[20:01:29] #startmeeting TechCom RFC discussion [20:01:29] Meeting started Wed Nov 15 20:01:29 2017 UTC and is due to finish in 60 minutes. The chair is DanielK_WMDE_. Information about MeetBot at http://wiki.debian.org/MeetBot. [20:01:29] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [20:01:29] The meeting name has been set to 'techcom_rfc_discussion' [20:01:44] #topic Handling of imported usernames [20:01:51] #link https://phabricator.wikimedia.org/T179832 [20:02:18] who's here for the meeting? [20:02:22] * anomie is here [20:02:31] * _joe_ as well [20:02:45] hello [20:02:46] * Scott_WUaS is here too [20:02:46] Me too [20:02:49] Hello [20:03:16] me too [20:03:48] excellent! [20:04:00] :) [20:04:11] anomie: can you briefly outline the idea of the RFC, and what drives it? [20:04:21] ok [20:04:23] Are there any specific questions you want to get answered today? [20:04:40] #chair TimStarling [20:04:40] Current chairs: DanielK_WMDE_ TimStarling [20:04:44] #chair RoanKattouw [20:04:44] Current chairs: DanielK_WMDE_ RoanKattouw TimStarling [20:06:30] ...or does anyone else have specific questions about this RFC they want to see answered today? [20:07:14] well, we have a gerrit change up already, so the main question is "can we merge this" [20:07:15] The actor table change is what's really driving this. Instead of having rev_user and rev_user_text columns for each row in the revision table, with the latter being 255 bytes per row, we're going to have a rev_actor column that refers to rows in a new table that holds both registered user names (like the user table) and unregistered names/IP addresses. But it turns out there's an inconsistency in the existing schema: for imports, there are rows [20:07:15] with valid usernames in rev_user_text but rev_user is 0. Even if someone creates the name later, these rows still have rev_user = 0 and therefore will be found or not depending on whether calling code searches by ID or name. [20:07:23] So we need to do something about these. [20:08:01] The best option seems to me to be to reattribute all these to systematic invalid usernames, which is what https://gerrit.wikimedia.org/r/#/c/386625/ implements. [20:08:23] I think it's a bit odd that we're using a different character (SULF used ~, this uses > ) but I think you've got a good explanation for that [20:08:32] As for questions... Can we merge the patch? Or does anyone have anything to say about it? [20:08:36] Since IIRC ~ isn't invalid [20:08:54] Also SULF uses a suffix and this proposal uses a prefix but that's just me nitpicking really [20:09:10] ~ is valid in the sense that if you have a name like "Example~enwiki" you can still log in with it. [20:09:17] anomie: can adding the prefix lead to exceeding the max length of the field? [20:09:21] the username being invalid means that the User:: factory methods will throw an exception/return false, right? [20:09:26] that seems a bit scary [20:09:45] DanielK_WMDE_: Theoretically, yes. In that case the name would wind up being truncated. [20:09:48] (In what ways might this RFC address importing different names from each of all ~200 countries official main languages, - perhaps drawing on Wikimedia Foundation related databases and communities ?) [20:09:58] Right, that speaks for a prefix rather than a suffix [20:09:59] other than that, it'll be a huge increase in history / log sanity [20:10:10] tgr: That's a feature, IMO. No code will be able to accidentally use these as a valid name. [20:10:11] Because a prefix ensures that the '>' char won't itself be truncated [20:10:21] <_joe_> anomie: I have a question - it's not completely clear to me what happens to old imports [20:10:46] IIRC the only usernames which come close to the limit are ones specifically designed to be large, to annoy people [20:10:47] Scott_WUaS: I don't understand your question. [20:12:19] <_joe_> "On WMF wikis, we'll run a maintenance script to clean up the existing rows with valid usernames and rev_user = 0. The current plan there is to attribute these edits to existing SUL users where possible and to prefix them with a generic prefix otherwise, but we could as easily prefix them all."" [20:12:50] (anomie: Tim, Daniel and Bryan invited me to this IRC - and perhaps also re - https://wiki.worlduniversityandschool.org/wiki/Nation_States - where WUaS donated itself to Wikidata in 2015 and has this new Miraheze Mediawiki thanks to that - so I'm wondering about user names in these regards too). [20:12:52] Specifically, that maintenance script is https://gerrit.wikimedia.org/r/#/c/386625/5/maintenance/cleanupUsersWithNoId.php [20:13:07] We could run it with or without the --assign option. [20:13:09] <_joe_> ok I'll take a look [20:14:16] Scott_WUaS: i18n does nto seem to be a concern relevant to this RFC. User names from imports will continue to support all scripts that user names regularly do. [20:14:18] Scott_WUaS: If edits were actually imported using Special:Import, this proposal affects the user names that the edits would be attributed to. If data is going to be manually entered, this proposal isn't going to affect that. [20:14:21] though, that makes me wonder... [20:14:33] is there a larger plan to make code work with imported usernames without having to hand-implement username parsing? like a RemoteUser class? [20:15:00] Thanks, DanielK_WMDE_ and anomie [20:15:01] this patch seems like a reasonable middle step but leaves the user data in a fairly developer-unfriendly state [20:15:01] TimStarling: even in languages that use a lot of bytes per character? Chinese generally uses fewer characters, but Russian?... [20:15:23] tgr: Only in the vague "it would be nice to rewrite User" sense. [20:16:47] WUaS in Mediawiki / Wikidata is also anticipating usernames in all 7,000 languages, mentioned on the Wikimedia 2030 panel this August with Katherine Maher and Magnus Manske - and I wonder about today's user names' focus in these regards, as well. [20:16:55] as it is now, even the importer will break when encountering imported usernames, I think [20:16:57] $wgMaxNameChars is 85 on WMF wikis, and the field length is 255 [20:17:03] anomie: so, conceptually, this introduces a new concept of "qualified user name" to the codebase. These would currently never exist as a User object, but only as a string, right? [20:17:17] TimStarling: Has it always been much less than 255? No worrying gremlins in the DB? [20:17:35] anomie: would Title objects exist for something like User:meta>FooBar ? [20:18:23] anomie: in which places in the codebase can such prefixed user names show up? Would it make sense to wrap them in objects, so the idea of prefixed vs unprefixed user names is reflected by the type system? [20:18:54] DanielK_WMDE_: I did some quick queries in Tool Labs. Longest name in ruwiki.user is 85 characters. The query over centralauth.globaluser is being slower, we'll see what that says. [20:19:01] (... and even re WUaS's planning for user names for an Universal Basic Income for all 7.5 billion people with cryptocurrency and the blockchain ... conceptually ) [20:19:10] (quick reminder to everyone to make liberal use of the #info command) [20:19:14] DanielK_WMDE_: "User:meta>FooBar" is an invalid title. [20:19:23] if a username is longer than the limit, it won't be usable for login, so presumably the cleanup script checkUsernames.php was run when the limit was introduced [20:20:16] anomie: yea, i thought so. so, prefixed names cannot show up in Titles, nor in User objects. Where can they show up? Just in plain DB query results? In Actor objects? Is that a thing? [20:20:21] DanielK_WMDE_: They could "show up" anywhere where something selects rev_user, or in the future actor.actor_name. Since we don't have comprehensive DAO or whatever the acronym is, there's no way to comprehensively wrap things in an object. [20:20:47] of course not. but do we want to provide such an object, and advocate its use? [20:21:34] re: 255-byte usernames, is this worth increasing the field length / introducing a new interwiki column? with MCR the tables are going to be changed anyway so shouldn't be much extra effort [20:21:36] Scott_WUaS: It seems like you're more concerned with what scripts are allowed in usernames than anything this meeting is about. FYI, MediaWiki allows anything that can be represented in UTF-8. [20:21:50] to make a more concrete point: we want to link to "foreign" users in the revision history, right? How exactly does that work? [20:21:55] Thanks, anomie [20:21:59] DanielK_WMDE_: In globaluser, there is 1 username with 90 bytes, 1 with 93, 1 with 145, and 3 at 255 characters. [20:23:32] DanielK_WMDE_: In the patch in Gerrit, it gets passed to Linker::userLink() and Linker::userToolLinks() as a string, Linker sees the '>' character, and formats the link differently. [20:24:01] Note the "gets passed to Linker::userLink() and Linker::userToolLinks() as a string" is not a change introduced in this patch. [20:24:13] ah ok, I was under the impression that interface was asking for a User object [20:24:51] this would be the place where i could see use for an interface for "potentially foreign user name". But perhaps that's overkill [20:24:56] tgr: I considered adding a column, but that means all code has to be reviewed to handle the column, and anything that's missed would treat the name as the local user because it would ignore the new column. [20:25:11] DanielK_WMDE_: Save it for the future ;) [20:25:15] heh [20:25:36] No, the Linker interface asks for ( $userID, $userName ) [20:25:53] I ran into that just an hour ago and scratched my head a little, but that's how that interface already works [20:26:23] anomie: rev_user_text would never contain a prefixed name, right? nor would ar_user_text. It would always be in the new actor table. So code looking at rev_user_text directly wouldn't be affected... Or am I missing something? [20:26:27] anomie: is that bad? I imagine almost all code would look up actors by id, not name, so there wouldn't be that much chance for breakage [20:26:50] RoanKattouw: i'm about to introduce a UserIdentity interface that basically has that pair. [20:26:54] Aha [20:27:24] DanielK_WMDE_: It would be in rev_user_text. The actor table migration script actually errors out if it finds a valid username with xx_user === 0, telling you to run this migration script first. [20:27:41] hm... why do that? [20:27:55] why not do the prefixing as part of the migration to actor, and leave rev_user_text alone? [20:28:01] tgr: You're overlooking code like Special:Contributions where the end user inputs a username to be searched for. [20:28:16] if it's separable, better to separate it, right? [20:28:22] for CR etc. [20:28:36] DanielK_WMDE_: I thought you hated big patches that do everything all at once? ;) [20:28:55] TimStarling: yes... you want to separate the process. i want to keep the new format out of existing db fields. different kind of separation. same idea... [20:29:30] I understand anomie has reviewed things that look at rev_user_text and found that they're not much of a problem [20:29:47] anomie: doesn't have to be a single patch. can be 20 patches to get to the actualy working script. [20:30:03] no, tests have to pass in every commit [20:30:27] TimStarling: I hope he's right. I'd prefer to put the prefixed names only into the new table, leaving the old table alone [20:30:28] you can't have one commit that does half the job if it leaves the code in a broken state [20:30:41] of course not. [20:30:54] that's why unit tests are isolated [20:31:20] DanielK_WMDE_: Even if they were only in the new table, the way the actor table patch works most callers would still see it the same as if it were in rev_user_text once we get to that step of the migration process. [20:32:01] anomie: ideally, they would see a UserIdentity, not a string. Or an ActorIdentity or something. [20:32:08] The actor table patch basically changes all the "SELECT rev_user_text" into "SELECT COALESCE(actor_name, rev_user_text) AS rev_user_text", an later into just "SELECT actor_name AS rev_user_text" [20:32:43] DanielK_WMDE_: No, I'm not going to do the crazy "have Database->select() mangle it to return objects" idea. [20:32:52] 12:31:20 DanielK_WMDE_: Even if they were only in the new table, the way the actor table patch works most callers would still see it the same as if it were in rev_user_text once we get to that step of the migration process. [20:32:56] I find that the most compelling argument [20:33:06] That and being able to split the migration into smaller steps [20:33:25] for the record, I'm not opposed. [20:33:31] i'm pointing out risks and alternatives [20:33:36] but i don't want to derail the discussion [20:35:12] anomie: that crazy idea is a standard component of DB connections on many platforms :) But yes, doing this in ResultSet would break assumptions, no never mind. [20:35:37] you've seen the patch, right? it's surprisingly small [20:35:52] https://gerrit.wikimedia.org/r/#/c/386625/ [20:36:34] DanielK_WMDE_: In platforms that force DAO, I suppose, unlike MediaWiki. Anyway, we're getting off-topic. [20:37:08] so, AIUI the two concerns so far are: [20:38:03] - this patch is making the trade-off of being somewhat hacky/unexpectedly breaking tools but avoiding any possibility of code ending up with the wrong user; is that the right trade-off? [20:39:05] - the patch creates technical debt by introducing (well, standardizing) user data which cannot be handled by high-level code like User; what will happen to that? [20:39:25] are there any other concerns that weren't mentioned yet? [20:39:26] TimStarling: Small patches can cause much fallout :) I'm trying to poke holes in this not to be annoying but to avoid grief later on. I think anomie knows that. [20:41:05] I am a bit unsure about the fact that things could blow up if one of these fake user names accidentally found their way into User::newFromName() or similar [20:41:25] Daniel seems to have some ideas that he's about to propose that might handle that [20:42:10] we could have a factory function that constructs a User from a result row containing revision fields [20:42:11] tgr: Re your second point, I'd hesitate to call this "technical debt". We already have two classes of user names, one that passes User::isValid() and one that passes User::isIP(). The latter fails when passed to User::newFromName() in the same way these new type of names will, although they fail differently if you try to Title::newFromText( "User:$name" ). [20:42:34] tgr: for (1), i'm in favor of failing fast & safe. for (2)... well, this patch doesn't great that debt, btu it perpetuates it. But it doesn't make it worse, and doesn't make it more difficult to fix later. [20:42:41] but what would use it? [20:43:17] RoanKattouw: my proposal would be: put the prefixed names only into the actor table. never select actor_name as rev_user_text. Instead, force the use of the appropriate abstraction. [20:43:38] Re a factory function, we already have User::newFromRow(). You just have to name the fields "user_id" and "user_name" rather than "rev_user" and "rev_user_text". [20:43:55] RoanKattouw: code that now looks at rev_user_text directly would have to do $userReferenceFactory->newFromRow( $row )->getName() [20:44:17] a "user reference" is to a user as LinkTarget is to Title. [20:44:22] DanielK_WMDE_: That means the "appropriate abstraction" has to be created and put into use everywhere, which would be an epic on top of the existing actor-table epic. [20:44:41] that's how i would do it. but i'm not asking anomie to do it that way. [20:45:20] anomie: well, it would fit nicely with the MCR refactoring. we will have to touch all that code anyway. [20:45:29] but as i said: that's just how i'd do it [20:45:40] DanielK_WMDE_: We don't have to touch every piece of code that handles a User for MCR. [20:45:44] I think this patch improves the prospect of future development in this area substantially [20:45:54] actually... the code that needs touch is exactly the code that now needs $queryInfo. [20:45:57] because data is not lost anymore on input, we are storing more things [20:46:02] input -> import [20:46:26] anomie: no, but most (all?) code that looks at rev_user_text [20:46:38] TimStarling: i agree, yes. this is just about where to store it [20:46:47] I guess as long as methods that give you a user from a revision/log entry keep working, it's not really problematic [20:47:07] since those are the only ways to find an imported user, anyway [20:47:26] MCR doesn't even need to touch all the code that looks at rev_user_text. MCR only affects code that makes use of contents of revisions. [20:48:33] yes, true, at least technically [20:49:04] as for the idea of using a separate field [20:49:40] anomie: hm... if Revision::getUser returns a UserIdentity, UserIdentity::getName needs to be allowed to return a prefixed name. Is that right? [20:49:41] can it be split out to a separate RFC? considering the small size of this patch, and the actor table patch which will be merged shortly after it, wouldn't you rather be altering the actor table than the revision table? [20:50:00] the patch should probably add something like $user->isLocal() for convenient detection of such imported users, though [20:50:21] tgr: all User objects will be local. Construction of User will fail for prefixed names [20:50:28] at least, that's what I understood [20:50:40] not via User::newFromRow [20:50:47] the actor table work enables future abstraction of user concepts, lots of things will be easier once we have it [20:50:58] and that's how most user construction happens internally, I'd guess [20:51:02] tgr: YAGNI, IMO. It can be added easily enough once someone has a need for it. [20:51:10] ten minute warning [20:51:59] tgr: User::newFromRow wants a row from the user table. It would look at user_name - which is never prefixed. Only rev_user_text is. [20:52:09] (please tell me if I'm wrong, anomie) [20:52:13] probably $user->isLocal() should be added when we find a use case for User objects constructed from actor rows [20:52:31] I would discourage that [20:52:50] the idea of an actor is distinct from the idea of a locally registered user. the two are not compatible [20:52:52] * Krinkle late entry [20:52:53] anomie: if I'm a feature developer and I read through User.php to see how to use that object, that will leave me completely unaware of this [20:52:57] DanielK_WMDE_: Well, someone could manually construct a "row". Or you could do User::newFromName( $name, false ) to bypass the normal validation. Or User->setName(), I think. [20:53:12] at a minimum the documentation of that class should mention this somewhere, IMO [20:53:22] anomie: but nothing actually does that, right? [20:53:36] Perhaps a new static new*() method could exist that has a lighter contract (e.g. UserInterface or IActor), that will return either User or ImportedActor, with only a narrow subset of methods available. [20:53:41] tgr: User represents a locally registered user. It does not represent an actor who edited a page. We conflated these ideas, and are now separating them. [20:53:55] we're discussing adding a factory which nothing will call? [20:54:00] Having a full User object for an imported username indeed probably doesn't make as much sense, but then again, the same applies to user objects for logged-out users [20:54:27] except that IPs can still be blocked. [20:54:59] Anon User objects are a convenient but hacky way to represent the current anon user kind-of-as-if-it-was a real user. [20:55:07] DanielK_WMDE_: LogEntry::getPerformer constructs User from a log row I think; would log_user_text be prefixed? [20:55:12] But only in context of a request/session. not page history. [20:55:18] Krinkle: Again, though, that's a much more sweeping refactor of the User class. I'm already at epic-overload with the comment table, the actor table, MCR, and my next project which is messing with ExternalStore to remove use of unserialize(). [20:55:30] please let's not get side tracked with 4 minutes to go [20:55:37] we need to close this [20:55:52] tgr: ah, good call! sadly, quite late! anomie can you say something about LogEntry::getPerformer? That should probably return an Actor, not a User, right? [20:56:16] DanielK_WMDE_: That would require that "Actor" actually exists as a distinct class. [20:56:46] TimStarling: we can't have a meeting anyway, Google docs seems to be down :) [20:57:02] <_joe_> yeah I was about to tell you in private [20:57:07] <_joe_> :P [20:57:12] put it in the cloud they said, it'll be safe they said [20:57:15] anomie: if we don't want User objects to represent "remote" users, we should have that, yes. [20:58:33] anomie: Yeah, given the amount of work involved, might also make sense to defer this until after "actor" settles? [20:59:00] DanielK_WMDE_: you're the chair, what is the resolution? [20:59:02] DanielK_WMDE_: If we're going to be changing it, we don't want User objects to have a lot of the crap that they currently contain. We'd want a base class, and separate classes for registered users versus IPs versus these remote users, and service classes instead of having methods on User to do all the saving and group-changing and everything. [20:59:16] TimStarling: right. [20:59:34] And doing all that means that everything that uses a "User" object now would need changing/breaking. [20:59:50] I propose this to go on last call. There were concerns raised, but nothing that should block progress [21:00:08] any opposition to that? [21:00:13] sgtm [21:01:02] ok, closing meeting... [21:01:06] #endmeeting [21:01:06] Meeting ended Wed Nov 15 21:01:06 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [21:01:06] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-11-15-20.01.html [21:01:06] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-11-15-20.01.txt [21:01:06] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-11-15-20.01.wiki [21:01:06] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-11-15-20.01.log.html [21:01:11] thank you all! [21:02:37] thanks for working on this! all concerns aside, it will be a huge improvement over the current state of import randomly misattributing things