[22:02:20] #startmeeting [22:02:20] DanielK_WMDE: Error: A meeting name is required, e.g., '#startmeeting Marketing Committee' [22:02:20] DanielK_WMDE: Error: A meeting name is required, e.g., '#startmeeting Marketing Committee' [22:02:50] #topic Normalize change tag schema [22:02:56] #link https://phabricator.wikimedia.org/T185355 [22:03:35] soo... who's here for the rfc discussion? [22:04:01] I am [22:04:25] I am [22:04:43] #startmeeting TechCom RfC discussion [22:04:43] Meeting started Wed Mar 7 22:04:43 2018 UTC and is due to finish in 60 minutes. The chair is tgr. Information about MeetBot at http://wiki.debian.org/MeetBot. [22:04:43] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [22:04:43] The meeting name has been set to 'techcom_rfc_discussion' [22:04:51] Meeting started Wed Mar 7 22:04:43 2018 UTC and is due to finish in 60 minutes. The chair is tgr. Information about MeetBot at http://wiki.debian.org/MeetBot. [22:04:51] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [22:04:51] The meeting name has been set to 'techcom_rfc_discussion' [22:04:55] Looks like we still have two bots.. Same issue last week. [22:04:59] I am, as the proposer :) [22:05:01] tgr: ah, thanks - i keep forgetting that [22:05:09] #topic Normalize change tag schema [22:05:15] #link https://phabricator.wikimedia.org/T185355 [22:05:31] tgr: can you #chair me? [22:06:13] RoanKattouw: can you give a brief intro, and tell us what the main questions are you want to discuss today? [22:06:22] Sure [22:06:29] #chair DanielK_WMDE [22:06:30] Current chairs: DanielK_WMDE tgr [22:06:36] So, the schema of the change_tag table and related tables has some issues [22:06:37] thanks tgr [22:06:43] Current chairs: DanielK_WMDE tgr [22:07:00] The three problems that I'm proposing a solution for here are: [22:07:31] 1) Compiling statistics about tag usage (how many edits are tagged with each tag) is very slow. On enwiki, loading [[Special:Tags]] takes ~5s and on Wikidata it's >30s [22:07:59] This will only get worse as we use change tags for more things. The high number for Wikidata is due to its heavy use of OAuth, which tags each edit with the responsible CID [22:08:45] 2) The schema is not normalized, so storage of tag+revid pairs is inefficient since tags are stored as strings. This is not an acute problem, but cutting down on storage requirements is good if we are to expand tag usage [22:09:07] 3) The valid_tag table is silly, it only has one column [22:10:02] So my proposal is to add a new table called change_tag_def (name is bikesheddable) that would define tag ID for each tag name. We'd then put tag_id+rev_id pairs in the change_tag table, instead of tag_name+rev_id pairs, so that solves #2 [22:10:27] a related issue is that some people would like to use tags in a way that results in most edits being tagged (e.g. tag edits made by the old wikitext editor) [22:10:46] which is basically Roan's #2 but would make it very acute [22:10:50] The change_tag_def table would also contain a counter that tracks how many uses of that tag there are, which after an initial population script would be updated over time (increment for every tagged edit), so that solves #1 [22:11:22] Yes, tgr is right that right now the only reason the tag system hasn't fallen over is because most edits aren't tagged [22:11:43] Except on Wikidata, where most edits are tagged, and that's how you get >30s response times for [[Special:Tags]] [22:12:26] And finally change_tag_def would also have a boolean field to say whether the tag is "defined" (a weird term that I can go into detail about if someone wants me to), so that solves #3 [22:12:54] The most complex part of this is the migration, so I'd like input from people on my migration plan [22:13:01] Anomie has already given some, thank you very much for that [22:13:15] The other open questions that I have are more detailed: [22:13:34] Are change_tag_def and ctd_defined good names? [22:13:56] A previous version of the proposal had a ctd_timestamp field, should we resurrect that or should it stay dead? (I think the latter) [22:14:24] Should we store rows with ctd_count=0 or remove them, or behave differently in different circumstances depending on defined-ness etc? [22:14:45] RoanKattouw: what was ctd_timestamp intended to do? [22:15:02] (My current proposal being that only rows with ctd_defined=true are alllowed to survive if they have ctd_count=0) [22:15:22] DanielK_WMDE: It was meant to contain the timestamp of the most recent use of the tag, i.e. the most recent time that a revision was tagged with that tag [22:15:31] ctd_defined seems redundant, this is define twice in the same name, right? [22:15:32] This would aid in spotting tags that are no longer being used and can be marked as inactive [22:15:48] maybe should be ctd_manual? [22:15:57] TimStarling: Hah, yes you're right that there's an ATM machine kind of thing going on with that name [22:16:04] RoanKattouw: a simply join against revision would do the same... i see no need for an extra field. [22:16:12] I used "defined" because that's the jargon from ChangeTags.php [22:16:22] RoanKattouw: how would you handle hard delete/undelete? [22:16:29] but we're saying that tags can be automatically defined by adding them to change_tag_def? [22:16:30] of revisions, I mean [22:16:38] DanielK_WMDE: Technically you need to join against revision and also logging and recentchanges but sure, you could find that info I suppose [22:16:52] tgr: Right now the tag system ignores that completely, and I'm not proposing to change that [22:17:05] RoanKattouw: if this is just for maintenance, we could have a script that does this. and if it runs for 20 seconds, so what [22:17:21] TimStarling: That's how the valid_tag table works right now, my proposal is to roll that table into a boolean field and not change its semantis [22:17:28] anyway... do i remember correctly that "defined" means "can be set manualyl by the user"? [22:17:51] IIRC yes, it means a sysop can apply it using a special page [22:18:12] So for context, the number of rows in enwiki.valid_tag is like 5 [22:18:19] so we have tag_summary that does almost the same thing as this new table? will it be dropped? [22:18:27] Even though there are clearly more than 5 tags in use on that wiki, which makes the old table name officially terrible [22:18:41] TimStarling: tag_summary is orthogonal, and also no longer used in MySQL [22:18:52] also there is ct_params, will it be dropped? [22:19:06] if you drop ct_params then change_tag can have fixed-length rows [22:19:09] tag_summary is a rollup of the GROUP_CONCAT() for the number + names of the tags for one revision, which is a rollup in a different dimension than the count of revisions per tag [22:19:32] Interesting point re ct_params, does anyone know if it was ever used? [22:19:33] I'm not sure whether ct_params was imagined as a thing attached to the tag name or to the revision [22:19:38] RoanKattouw: wait - the way i understand the rfc, "defined" means the tag was manually defined, but it says nothing about how that information is used. So, this means the tag was manually defined, and can be applied manually? [22:19:54] DanielK_WMDE: Yes. I'm not proposing to change the existing semantics of the valid_tag table [22:21:02] Thinking about it again I like Tim's ctd_manual name because it is about manual applicatoin [22:21:17] There is some value in following the jargon terms from ChangeTags.php and its function names, but those are also very confusing [22:22:29] so ChangeTags::addTags() has a $params which is sent to ct_params [22:22:44] RoanKattouw: You know the codebase better, do you think by this normalization, there is a need to practically rewrite ChangeTags.php? If yes, that sounds like an opportunity to fix the jargon [22:23:08] Yeah it looks like ct_params is written but never read [22:23:11] Based on a quick grep [22:23:12] re ct_tags: having a place for generic revision meta-data seems like a good thing. the semantics here is a bit unclear though. [22:23:23] Amir1: I think there's an opportunity to write ChangeTags.php but not a need [22:23:42] ContentTranslation reads it [22:23:48] Thanks [22:23:53] For one thing I am quite confident this change can be implemented without changing the API that ChangeTags.php exposes [22:24:31] So that 1) is a good first step (no breaking changes to other code) and 2) precludes fixing the jargon, since it's embedded in the names of the public functions that are exposed to and used by other code [22:24:51] Lemme look at ContentTranslation's usage [22:25:04] My gut feeling is that ct_params is specific to the rev+tag combo [22:25:08] https://gerrit.wikimedia.org/g/mediawiki/extensions/ContentTranslation/+/master/includes/Stats.php#35 [22:25:15] not super informative [22:25:23] CX reads it but does nothing with it, it looks like [22:25:26] RoanKattouw: it feels to me like the list of tags should be a configuration variable. it may be nice to allow people to define new tags on the wiki, but how often does that happen? Once a year? [22:25:57] the API would make it awkward to attach params to the name/def rather than the change [22:26:11] Also AFAICT the code tgr found is dead code [22:26:34] DanielK_WMDE: On Special:Tags, probably very infrequently, but in AbuseFilter, all the time [22:26:39] you have addTags() where the tag names are an array, and the params is a single string, how would that even work? [22:26:51] Ha really [22:26:56] DanielK_WMDE: it happens a lot thanks to Abuse filter (AFAIK) [22:27:06] let me double check [22:27:28] yea... but does that mean tags that can be assigend by abusefulter can be assigend by people manually? [22:27:32] Extensions can define tags, for example VisualEditor defines a tag for edits made with VE, but AbuseFilter uses its ability to define tags in software in a way that can be controlled by (privileged) users [22:27:35] is that needed? [22:27:41] where the list of tags comes from seems irrelevant to the RfC [22:27:44] but anyway - we are getting distracted [22:27:45] No that is not needed [22:27:56] you'd want the same schema either way [22:28:00] there may be other changes needed to be made to the tags system, but this seems orthogonal to the rfc as proposed [22:28:04] maybe minus the _defined field [22:28:06] But I still disagree that it should be a config var, right now the community has control over it and it wouldn't take much/any work to keep it that way [22:28:16] 14:28:00 there may be other changes needed to be made to the tags system, but this seems orthogonal to the rfc as proposed [22:28:17] This ---^^ [22:28:41] I tripped over lots of skeletons while working on this, they keep falling out of closets left and right [22:29:01] So I had to willfully ignore most of them and try to limit my scope to something I could actually get odne [22:29:43] Killing the valid_tag table is already a bit of scope creep that I was talked into, but I think it makes sense because that way we'd be pretty much done cleaning up the DB schema AFAICT [22:29:48] RoanKattouw: i know that feeling soooo well :) [22:30:08] However! [22:30:09] On that note [22:30:21] If truly ct_params is unused (I can do a little more research on that), we can kill that as well [22:30:40] pretty simple to include it in the schema changes [22:30:44] So as to not need a separate schema change for it later [22:30:48] RoanKattouw: i'm for keepign ct_params, i want to abuse it :) [22:30:48] Yeah exactly [22:30:58] you can have addTags() throw an exception if $params is non-empty [22:31:03] Ha nice [22:31:08] And then make the breaking API change later [22:31:10] That's a great idea [22:31:11] anyway - ctd_defined isn't a good name. How about ctd_user_defined? [22:31:20] What's the !command for highlighting something in the notes? [22:31:41] Aha I like that name [22:31:47] RoanKattouw: #info [22:32:01] #info anyway - ctd_defined isn't a good name. How about ctd_user_defined? [22:32:08] let me read back and do a few more [22:32:16] #info you can have addTags() throw an exception if $params is non-empty [22:32:18] OK go for it [22:33:01] some day these static functions will be replaced with a service, that will be the opportunity to clean up the API [22:33:45] I am not volunteering for that one :) [22:33:51] #info we don't need ctd_timestamp, we can just query for the last time a tag was used. [22:35:19] RoanKattouw: i'm serious about keeping ct_params. it seems potentially useful, and does no harm. i think of it as the equivalent of page_props [22:35:39] #info Killing the valid_tag table is already a bit of scope creep that I was talked into, but I think it makes sense because that way we'd be pretty much done cleaning up the DB schema AFAICT [22:35:44] I don't quite agree that it's equivalent to page_props [22:36:06] You could use it to make rev_props, but only if you add a (user-visible!) tag to the revision [22:36:34] Unless there's a strong tie between what the props are for and a user-visible tag, I don't think that's the best way to attach props to a revision [22:36:41] If we're gonna have a rev_props table, let's have a rev_props table [22:36:53] fair enough [22:37:18] (Also note that change_tag has three ID fields, you can tag either a rev_id, or a log_id or an rc_id, so it's broader than revs as well) [22:37:44] #info If truly ct_params is unused, we can kill that as well. [22:38:07] Welp, 7971 non-null ct_params rows on enwiki [22:38:13] Let's see how many of those are also != '' [22:38:18] All of them :( [22:38:20] RoanKattouw: yea, i don't get why it has rc_id. can we get rid of that? [22:38:23] So let's see why that is [22:38:40] DanielK_WMDE: I was thinking the same thing, RC entries are ephemeral but change_tag rows are forever [22:38:44] So it doesn't really make sense [22:38:48] If unused we can also kill that [22:38:50] Just a wild idea, making page_props actually tags for pages (and storing them in one table) would make sense to me but implementing it won't be easy I guess [22:38:53] If nobody objects to that [22:39:27] maybe when you are generating Special:Recentchanges it is handy to be able to join on rc_id [22:40:01] otherwise the join condition would be more complicated [22:40:02] Right, so then we'd expect all rows to either have rev_id+rc_id or log_id+rc_id, never a bare rc_id [22:40:04] There are cases of injecting rc records which you can not find them in revision table (like Wikidata) [22:40:14] Sure, but those should not be tagged IMO [22:40:22] #info ct_rc_id should perhaps go away. It seems redundant to ct_rev_id. Also, change_tag rows stay around while the recentchange row vanishes. otoh maybe when you are generating Special:Recentchanges it is handy to be able to join on rc_id [22:40:37] #info Right, so then we'd expect all rows to either have rev_id+rc_id or log_id+rc_id, never a bare rc_id [22:40:38] Unless the change_tag rows for them are cleaned up after the fact, but that is also not something the system is set up to handle [22:40:48] It can serves a purpose, we can flag these edits for users in client for vandalism [22:40:57] So the most common ct_params value is {"from":"es","to":"en"} [22:41:04] Looks like ContentTranslation does use it [22:41:11] Amir1: oh, you are right - the RC entries wikidat injects don't have a revision ID! [22:41:23] Kind of in the way that Daniel envisioned [22:41:24] #info There are cases of injecting rc records which you can not find them in revision table (like Wikidata) [22:42:31] #info ct_params is used by ContentTranslation: {"from":"es","to":"en"} [22:43:06] RoanKattouw: this kind of think could no go into comment_data.... [22:43:10] *thing [22:44:01] could *now* ? [22:44:12] or "not" [22:44:25] "now" [22:44:26] sorry [22:44:28] getting late [22:44:36] Btw, since we talked about it earlier: here's the code for the NamedTableStore class. Could perhaps be re-used here. https://gerrit.wikimedia.org/r/c/404445/35/includes/Storage/NameTableStore.php [22:45:09] Thanks [22:45:16] Anyway, so [22:45:27] It sounds like we have consensus on the following: [22:45:36] * ctd_defined -> ctd_user_defined [22:46:09] Ahm, and... nothing else? :D [22:46:35] ct_params is used so it's not easy to kill, and the situation around ct_rc_id is probably fine [22:46:39] I think the schema migration can be approved as proposed [22:46:56] That would be good :) [22:47:05] I suggest to puit this on last call, unless roan wants to refine it first [22:47:21] I think I'll still want ct_params to die, but the fact that it's used makes that harder, so I think we should not couple that to this change [22:47:40] maybe someone should update tables.sql to not lie about ct_params being unused [22:47:44] I support last call with the ctd_defined -> ctd_user_defined edit [22:47:51] Ha yeah will do that [22:47:55] #info maybe someone should update tables.sql to not lie about ct_params being unused [22:48:29] TimStarling: do you think change_tag_def can go ahead? [22:48:37] yes [22:48:48] * DanielK_WMDE thinks change_tag_def is a clunky name, but doesn't have a better idea [22:49:15] Same [22:49:24] Re ct_params, https://gerrit.wikimedia.org/r/417168 [22:50:09] yeah _def is a new convention, but english is hard [22:50:17] I couldn't think of a better name [22:50:24] Well the better name I can think of is change_tag [22:50:26] But that name is already in use [22:50:32] indeed :) [22:50:40] I initially suggested "tag", but as I expected people didn't like that either [22:51:09] So, sounds liek this can go on last call, to be approved on MArch 21. kchapman, can you take care of that when you compile the minutes? I suppose we'll be annouoncing two last calls this week, with different end dates. [22:51:21] (The fact that I was writing code that created a table with a three-character name is actually what made me realize I should make this an RFC) [22:51:47] yes, I can do that. [22:51:49] RoanKattouw: we could have change_tags and change_tag. not confusing at all! [22:51:54] kchapman: thank you! [22:52:02] haha right [22:52:44] we still have 8 minutes [22:52:53] but if there is nothing left to say, we can finish early [22:53:56] i guess that means we can close early :) [22:54:10] Yeah I think we're done [22:54:27] I'll go edit the task description for the field name change [22:54:45] cool! [22:54:49] #endmeeting [22:54:49] Meeting ended Wed Mar 7 22:54:49 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:54:49] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.html [22:54:49] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.txt [22:54:49] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.wiki [22:54:49] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.log.html [22:54:49] Meeting ended Wed Mar 7 22:54:49 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:54:49] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.html [22:54:49] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.txt [22:54:49] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.wiki [22:54:50] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-07-22.04.log.html [22:54:54] thanks everyone!