[21:55:01] schema chat in a few minutes [21:59:29] * DanielK_WMDE will be back in a few minutes to chair the thing [21:59:49] \o/ [22:00:00] * brion coffees up [22:01:56] * RoanKattouw quickly spams the hell out of the talk page in the last 15 mins before the meeting [22:02:22] :) [22:02:52] #startmeeting [22:02:52] DanielK_WMDE: Error: A meeting name is required, e.g., '#startmeeting Marketing Committee' [22:03:01] Should/can https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 be moved to Requests_for_comment/ ? [22:03:02] #startmeeting ArchCom RFC meeting [22:03:02] Meeting started Wed Mar 8 22:03:02 2017 UTC and is due to finish in 60 minutes. The chair is DanielK_WMDE. Information about MeetBot at http://wiki.debian.org/MeetBot. [22:03:02] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [22:03:02] The meeting name has been set to 'archcom_rfc_meeting' [22:03:32] Marybelle: yes it should :) I'll move it after this meeting [22:03:46] Cool. :-) [22:03:51] #action brion should rename the page out of userspace into rfc space [22:03:54] #topic Compating the Revision table Maybe How should we store longer revision comments? [22:04:10] oops, pasted the wrong thing :) [22:04:15] heh [22:04:15] * DanielK_WMDE is a bot slow today, sorry [22:04:22] it's related tho :D [22:04:30] #topic Compating the Revision table https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2#Provisional [22:04:35] #link https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2#Provisional [22:04:40] ok! [22:04:42] ok. [22:04:50] the main changes since a couple weeks ago are: [22:05:01] * archive table is updated to mostly match revision [22:05:16] * user_entry table has been renamed to 'actor' since user_entry was confusing [22:05:30] * started to flesh out the transition plans in more detail [22:05:39] Is merging archive and revision off-the-table? We'd talked about it in the past to make deletions less painful. [22:06:04] brion: ar_title is new, right? [22:06:07] Marybelle: that's another possibility is to just kill archive outright. however there would need to be some bigger changes made, and we'll want to decide whether that's worth it at this stage [22:06:21] James_F: no that's old, it's kept around for if the page is completely deleted [22:06:47] It's just listed as — as if there's no change, but it's not in the listed the current schema. [22:06:52] -the [22:06:52] I was about to ask, what's keeping us from storing revs of deleted pages in the revision table -- and that's it, there is no page_id to point rev_page to [22:06:54] Marybelle: specifically, rev-deletion has no good provision right now for pages that are "completely" deleted [22:06:56] we were talking about archiving rows from page, instead of archiving rows from revision. [22:07:00] James_F: oh i must have missed it in my cut-n-paste [22:07:04] i still like the idea, but maybe it's too much for this rfc [22:07:15] otoh, if we go that way, we don't need all the schema changes on the architve table [22:07:27] Surely there's an ar_title in the current schema already? [22:07:29] hmmmm, an archived page row is possible yeah [22:07:37] RoanKattouw: We could just keep the entries in the page table and add a bit field to expose. [22:07:39] RoanKattouw: yeah it's there, i saw it in tables.sql ;) [22:07:58] That would save a lot of headaches with page re-creations, at least. [22:08:03] AIUI, the plus side in doing it is simplicity of outcome and getting things done now, not in five years' time; the minus side is more complexity of change. Is that right? [22:08:08] #action correct the missing "existing" fields on archive in the page [22:08:12] Yeah, fair enough, we could bit-flag deleted pages in the page table or have an archive table for deleted pages (I prefer the former), but that's also more schema changes [22:08:22] you know rev_deleted was originally meant to be boolean, making it a bit field was a clever hack to avoid a schema change [22:08:37] hehe [22:08:39] #info daniel likes archiving page rows instead of revision rows. probably not in scope here though. [22:08:40] Hmm [22:08:43] we're so clever indeed [22:08:56] brion: What did you want to discuss today? [22:08:57] So do we feel like we want to split that out, or keep it as a bitfield? [22:09:18] i see no need to split it out [22:09:25] So the current plan is to not change the _semantics_ of deletion, just to update the archive table. Bitfield fun on page would be awesome but may change semantics. [22:09:38] #info Zppix likes keeping as a bitfield [22:09:42] OK [22:09:43] Also [22:09:45] #info ar_title is already in the old schema, to allow page deletions to be undone [22:09:45] last time I said changing deletion semantics should be a separate project [22:09:56] I think that if we did this, we wouldn't actually need to touch the revision tables, right? [22:10:06] #info So the current plan is to not change the _semantics_ of deletion, just to update the archive table. [22:10:12] We'd need to make a schema change to the page table, but not to any of the tables we're discussing today [22:10:26] Marybelle: getting back to what i wanted to discuss. :) mainly whether the updated tweaks still make sense, and whether the transition plans sound feasible [22:10:27] RoanKattouw: Depends whether we touch _user_text fields? [22:10:29] (except maybe delete the archive table, or add a new bit to the rev_deleted bitfield) [22:10:30] TimStarling: i agree. but which should come first? [22:10:37] i'll have to get more performance estimates out of jynus [22:11:01] Re: https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2#Transition_plan [22:11:15] I still think the actor (previously user_entry) table is a bit contentious. [22:11:35] Marybelle: *nod* what would you consider doing differently? [22:11:40] there's a couple possible tradeoffs [22:11:57] I liked the idea of killing _user_text a lot. [22:12:12] I like the actor table, and I like the coined jargon too [22:12:16] TTO discussed that on the talk page, I think. We could do that without a schema change, as far as I know. [22:13:22] my 2¢: the transition plan sounds good. "actor" is way better than "user_entry". I'm still unsure about the extra level of indirection, though [22:13:34] Marybelle: the IP range table idea? or sth else [22:13:37] if we want to change how anonymous users work, having the actor table in existence would be useful [22:13:57] I like the transition plan, kinda meh on having the indirection of the actor table [22:13:58] brion: rev_user_text and img_user_text could be NULL except for non-logged in users, without moving anything else, I think? [22:14:06] ah i see [22:14:15] hm, how big do we expect the actor table to grow? [22:14:24] Marybelle: yes, we actually join to user and use user_name in many cases [22:14:28] how many distinct IPs do we have recorded for anonymous edits? [22:14:38] IPs would still bloat the _user_text column though [22:14:39] brion: Yeah, but not in enough cases to make renames cheap. :P [22:14:40] DanielK_WMDE: Fewer than IPv4. [22:14:43] #info I like the transition plan, kinda meh on having the indirection of the actor table [22:15:19] James_F: what do you mean? [22:15:25] RoanKattouw: what would you think about keeping rev_user but still indirecting for IPs and non-local import usernames? [22:15:33] Yeah, maybe [22:15:48] Has it been suggested to have user table rows for IPs? [22:15:50] DanielK_WMDE: Last time I asked about number of distinct IPs that have edited it was the ballpark answer I got. [22:15:54] i'm not sure there's a huge benefit to keeping rev_user as it is, since you need the indirection logic for looking up IP users anyway [22:16:02] RoanKattouw: Yes, many times. [22:16:07] RoanKattouw: Yes, and shot down because of all the account-specific stuff there. [22:16:08] Or have some other kind of schema where we don't need an extra level of indirection just to distinguish between IPs and users [22:16:08] and for users it's just a (SELECT actor_id WHERE actor_user=1234) [22:16:21] James_F: well, "less than two billion" is a big ballpark. [22:16:25] the number of actor rows will be strictly smaller than the number of revision rows, right? [22:16:26] DanielK_WMDE: Indeed. [22:16:46] even in the worst case of a user deliberately polluting the actor table by cycling through every IPv6 address in a /64 block [22:16:50] TimStarling: Yes, it's marked as "<<1 per revision" in the wiki page. [22:16:53] TimStarling: <= (revision+archive+anyuthing else using it) [22:17:02] RoanKattouw: how many rows would we need to add to cover all the IPs we need? [22:17:16] 128 bits of IPv6.... too big for BIGINT ;) [22:17:21] Fair, a user could deliberately pollute the actor table with account creations. [22:17:28] Hmm yeah... [22:17:33] brion: We could just use two bigints. ;-) [22:17:34] TimStarling: probably, unless we also use the actor table for log entries. [22:17:47] brion: should we reference the actor table from rc and log rows? [22:17:49] I think I'm reluctantly sold on the actor table [22:17:54] James_F: Is that distinct from polluting the user table (or the CentralAuth tables) with nonsense accounts? [22:17:57] DanielK_WMDE: oh yes! we should, crap i forgot to update those [22:17:57] DanielK_WMDE: I'd +1 that [22:18:05] Anywhere else we have (user, user_text) pairs [22:18:24] yay more updates [22:18:28] Marybelle: No, just that *theoretically* it could grow beyond revision as we have non-revision-row-creating IP actions. [22:18:32] #action update recentchanges and logging with actor refs [22:18:32] #info recentchanges and log tables should also reference the actor table [22:19:03] James_F: well either we overflow log_id or we overflow actor_id [22:19:03] #info some concerns about the size of the actor table. would always be smaller than the revision+log table though. [22:19:05] either way it ain't good [22:19:09] Indeed. [22:19:34] RoanKattouw: Looks like revision, archive, oldimage, filearchive, recentchanges, and logging have _user/_user_text. [22:19:56] #action update filearchive and oldimage with actor refs [22:20:15] (I think oldimage might be getting refactored as part of Krinkle's work? Not sure.) [22:20:19] We don't have to update those all at once, I suppose, but I would suggest we agree that we're going to update all of those eventually [22:20:21] we could change those later too, in principle, but i'd rather be consistent [22:20:21] Yes. [22:20:32] I'd like to briefly mention that there is an RFC proposing to factor rev_comment out of revision: https://phabricator.wikimedia.org/T153333 [22:20:34] #link https://phabricator.wikimedia.org/T153333 [22:20:37] revision will be the time-limiting factor [22:20:45] DanielK_WMDE: Yeah, it's part of brion's RfC. [22:20:54] s/whatever we're calling this document/ [22:20:55] conceptually, I think so, yes [22:20:57] changing logging to reference actor will mean we can finally have anonymous loggable actions [22:21:01] DanielK_WMDE: yeah I think we'll officially merge it in [22:21:15] hmm, or do we already have those? [22:21:16] yea, i just wanted to make sure everyone here sees it. [22:21:16] TimStarling: Finally? We already have log_user_text? [22:21:24] Marybelle: Indexable. [22:21:28] i hate it when we forget about a schema change from years ago ;) [22:21:37] ah indexing is nice [22:21:39] +1 [22:21:39] We index rev_user_text. Surely we could index log_user_text as well if we wanted. [22:21:56] it would be nice if we didn't have to have these indexes [22:22:00] they are big and expensive [22:22:03] Actually, we already do index log_user_text. [22:22:07] looking things up by an integer id is much quicker [22:22:21] yep, only need one table's worth of large string index this way [22:22:25] Smaller indexes is a nice benefit but not Earth-shattering. [22:22:30] that's the kind of thing that makes DBAs happy [22:22:40] Is there a reason to /not/ do it? [22:22:43] #info T153333 should be merged into Brion's RFC [22:22:43] T153333: RFC: How should we store longer revision comments? - https://phabricator.wikimedia.org/T153333 [22:22:52] I'm concerned about the scope of this work and how it can be broken up into pieces. [22:22:59] *nod* [22:23:04] In execution? [22:23:09] Re merging T153333, kaldari suggested he'd be OK with that but did ask questions about the timescale of itall [22:23:17] (on the task) [22:23:18] James_F: Yeah, especially over a multi-year period. [22:23:23] I was under the impression that we'd break it into pieces (e.g. the image tables). [22:23:26] I'm trying not to let it get too big, which is one of the reasons I didn't want to majorly change archive beyond using the compact fields. [22:23:42] we could also do some tables in later pieces, yes [22:23:46] I have a feeling additional requirements will crop up in 2017 and 2018 and disrupt some of the current plans. [22:23:50] what James_F said. We should distinguish between splitting design documents/discussions, and splitting (or bundling) schema change deployment. [22:23:58] Marybelle: i agree i do as well [22:24:21] that's one of the reasons i _am_ bundling things together ;) [22:24:28] for instance upgrading things to bigint [22:24:53] and combining the compaction with the multi-content foundation [22:24:54] Better to have an overall plan and adjust it as facts change instead of having no overall plan just a lot of 'little' adjustments, though? [22:25:05] perhaps we should leave the bunding (or lack thereof) to the DBAs [22:25:12] Wait why dont we compress the rev comments db and uncompress when needed? [22:25:45] Zppix: compress as in zip? because comments are too short for benefitting much from compression [22:25:45] Zppix: we could gzip and they'd be smaller yes. might be a wise addition actually. (or can mysql do that for us I wonder?) [22:25:50] With a comment table that could be doable; not sure how much there is to gain there though because comments aren't that long? [22:25:56] most would not benefit a lot but ... i wonder [22:25:58] InnoDB page compression would probably be more effective than application level compression in this case [22:26:01] brion: i think myswl can [22:26:10] you konw there's a Unicode algorithm for short string compression [22:26:10] Mysql rather [22:26:13] RoanKattouw: They might become vast given the change to 700-byte-long ones. :-) [22:26:21] but I don't think it's worth doing in this case [22:26:21] brion: Unless there's magic MySQL compression we'd also need a field in the comment table telling us whether the content is compressed and how (like in the text table) [22:26:31] yeah [22:26:35] Zppix: bit normalizing by introducing an indirection will provide big improvements on its own [22:26:52] DanielK_WMDE: we would have to do that eventually no? [22:26:59] the unicode thing is based on escape codes that switch between code pages, which makes it a bit shorter for e.g. russian text [22:27:00] (There's a bit of work that we need to do on the front end to discourage users from writing very long edit summaries in one-byte-per-glyph languages like English without doing so for three-byte-per-glyph ones that need the change.) [22:27:01] But what I hear from Tim is that 1) we might not need it and 2) if we do, it'll be a MySQL magic thing [22:27:03] that brings us back to something DanielK_WMDE brought up about using the content storage for comments, but that implies more overhead on the fetch [22:27:09] Zppix: introduce an indirection? Yes, it's part of this proposal. [22:27:25] No i meant compression my bad daniel [22:27:38] we could. not sure that we should, atm [22:27:55] It would be nice to have some input from a dba [22:28:01] compression means less IO, more CPU and memory. it's always a tradeoff. [22:28:08] #info some talk about compression for long comments. worth investigating later, might need a second flags field [22:28:29] thinking about other things that might need updating for actor: AbuseFilter has three tables with *_user_text fields [22:28:43] re what brion said regarding using content (blob) storage: i'd prefer to not wait for that. if we introduce a new table for comments, we can still make that available as content (blob) storage later, when we have the MCR infrastructure in place. [22:28:57] Agreed [22:28:58] RoanKattouw: oh also I saw you made a comment on the talk page about reusing comment fields. If we wanted to search for matching comments we'd need an indexed hash field [22:29:02] TimStarling: AbuseFilter will need a serious overhaul for MCR [22:29:04] maybe I should just query the mysql table for every field named *_user_text [22:29:21] Ah, yeah, I was only looking at MediaWiki core's tables.sql. [22:29:24] brion: Yeah I don't think that's really worth it [22:29:35] *nod* [22:29:37] Hence my skepticism [22:29:48] #info Daniel would prefer to not wait for MCR to allow comments to be stored as content blobs. if we introduce a new table for comments, we can still make that available as content (blob) storage later [22:29:49] We got to keep in mind replication lag as well will this affect that at all? [22:29:53] That said I think the comment table is fine, to reduce the size of the main table [22:30:10] I'm just skeptical that comment reuse will happen basically ever [22:30:14] And allow longer comments. [22:30:17] *nod* [22:30:25] If I were doing this, I'd probably focus on the easiest/least controversial parts first. Splitting out rev_comment seems like an easy win, for example. [22:30:38] #info Zppix suggests to compress comments. TimStarling sais probably not worth it, but could use InnoDB compression [22:30:41] Maybe if the logging table also used it, then page moves and protections and stuff could share a comment row between the log entry and the fake revision [22:30:48] But that's about all I see [22:30:54] Which bits do you think are controversial? [22:31:09] Zppix: more tables does mean more INSERT commands and such, would change the balance of write queries a bit but should not hugely alter the total amount of data being put in [22:31:20] James_F: I think keeping archive around should be more controversial, but probably isn't. [22:31:31] RoanKattouw: comment reuse will happen for bot edits. also for wikidata edits. also for gadget edits. [22:31:42] brion: i suggest working with operations with this patch majorly [22:31:45] RoanKattouw: also for stuff like "rvv". [22:31:46] I'm pretty wary of the actor/user_entry stuff, but if it will make it easier to move away from IP addresses in the future, eh. [22:31:48] Zppix: oh yes :) [22:31:48] DanielK_WMDE: How though? How would MW know that it's the same comment [22:32:01] jynus and i will become good friends or enemies i think before this is done ;) [22:32:13] RoanKattouw: my suggestion is to hash the text instead of using incremental ids. [22:32:23] brion: hehe [22:32:25] DanielK_WMDE: for common edit summaries maybe only keep track of the users who use common edit summaries? [22:32:28] Marybelle: Agreed in principle, but the DBAs may want to bundle certain changes in which case I think that should take priority [22:32:39] James_F: Also, whether and how we keep the text table seems up for debate. [22:32:45] DanielK_WMDE: Right, OK, hashes as IDs. Yeah, maybe. [22:32:52] Plus smaller pieces like continuing to use SHA1 in archive and revision, for example. [22:32:56] OK, should we discuss those? [22:33:03] Yeah I've been wanting to bring up SHA1 [22:33:07] RoanKattouw: i really like that approach, but i often get strange looks for it :) [22:33:11] all *_user_text fields in enwiki: http://paste.tstarling.com/p/lTjzkE.html [22:33:14] rev_sha1 seems to be used by nothing, AFAIK [22:33:22] Marybelle: vague plan is to replace the text table later in the MCR deployment timeframe, replace it with more flexible blob storage. we'll probably keep text as is for this round. [22:33:34] Zppix: supporting multiple ways to store comments would be very complex/annoying [22:33:45] How multi-content revisions happens, in relation to this RFC, remains murky to me. [22:33:47] RoanKattouw: iirc it's exposed in dumps and api and can be used by bots etc, but isn't used internally [22:33:49] iirc rev_sha1 is not used [22:33:51] Like slots are mentioned in this RFC. [22:33:55] #info danie lsuggests to use the text hash as the id in the comment table. [22:34:02] Yes, it is exposed [22:34:06] Zppix: it's used by external tools [22:34:09] And perhaps someone does use it for revert detection somewhere [22:34:15] Zppix: the main use case is detecting manual reverts [22:34:15] However, note that this proposed schema breaks it anywa [22:34:22] brion: So like slots and content tables... those are still part of this RFC and not part of a separate RFC for MCR? [22:34:39] There will no longer be a SHA1 of a *revision* (unless you compute it by hashing the concat of the hashes), only of the content of each slot [22:34:40] Marybelle: This is the MCR RfC. [22:34:43] Marybelle: potentially we could use a different table for the revision<->content join that doesn't have infrastructure for MCR, then replace it later [22:34:45] DanielK_WMDE: how often however cause if its not used regularly then why have it? [22:34:56] James_F: Hmmm, oh wow. [22:35:06] Zppix: important is not the same as frequent. [22:35:07] but it's so easy to put in the slots table now and just use the one :) [22:35:17] ( https://phabricator.wikimedia.org/T23860 for context on rev_sha1) [22:35:23] Zppix: ther eare things that are rarely used that are still very good to have [22:35:30] brion: And it would delay everything massively, which'd be irritating. [22:35:45] OK so I'm willing to accept that a SHA1 field has some use [22:35:46] DanielK_WMDE: optimization is preferred no? [22:35:54] longer list of _user_text fields, from s3: http://paste.tstarling.com/p/ErblGg.html [22:35:56] (Or maybe SHA256 if we're concerned about the recent SHA1 collision attacks) [22:36:10] TimStarling: Ah, thanks for table name. [22:36:13] Zppix: yes, but we optimize for scalability, not bits. [22:36:18] But in that situation I think we should have a plan for how it remains relevant [22:36:26] RoanKattouw: i dont think it quite think sha256 is needed persay [22:36:45] DanielK_WMDE: true i guess good point [22:36:46] Marybelle: I should have sorted by it, hang on [22:36:54] For multi-slot content, do we expose sha1( slot1hash + slot2hash + slot3hash ) as the fake hash of the content? [22:36:59] ob, but on the topic of sha1 and db fields... [22:37:15] do we want to use names that indicate the hash algorithm, or avoid it? [22:37:16] (I didn't realize some of these tables were still around. Some should be archived/deleted.) [22:37:25] rev_sha1, rev_sha256, or rev_hash? [22:37:27] RoanKattouw: i think that's roughly the idea [22:37:29] http://paste.tstarling.com/p/AbraOv.html [22:37:30] i'm a bit torn on the issue [22:37:40] RoanKattouw: yes. [22:37:42] DanielK_WMDE: On the one hand I'd say avoid (rev_hash), on the other it seems like the length is pretty tied to the algorithm [22:37:46] Should rev_sha1 be hashed instead then DanielK_WMDE [22:37:47] DanielK_WMDE: if we want to keep the old field for things that might need it then we should keep the name in there [22:38:10] also the length does depend on the algo yeah [22:38:12] RoanKattouw: actually, hash( slot3 + hash( slot2 + hash( slot1 ) ) ) [22:38:39] TimStarling: I assume s3 covers all the main production extensions? [22:38:48] RoanKattouw: Personally I've found the base 36 part the most annoying part when dealing with our hashes. [22:38:50] James_F: probably not [22:38:57] DanielK_WMDE: You can't do that without another DB field. All you have is content_sha1 fields that store hash(slot1), hash(slot2) and hash(slot3) [22:39:06] RoanKattouw: yes, and if the algo isn't in the name, how do we know which one to use? a prefix? [22:39:10] TimStarling: Ha. Something to double-check later. [22:39:21] James_F: out of curiosity will whatever the outcome affect bots? [22:39:28] Zppix: Yes. [22:39:43] James_F: i'd be sure to keep note of that then [22:40:02] Zppix: There'll be lots of notice given before anything affects production users. [22:40:04] Zppix: initially many bots won't notice from the outside. but once we start using the multiple content slots later, it'll become more important to be MCR-aware [22:40:11] DanielK_WMDE> RoanKattouw: actually, hash( slot3 + hash( slot2 + hash( slot1 ) ) ) [22:40:14] Oh right, we store hashes in base36, I'd forgotten about that [22:40:15] blockchain lol [22:40:33] base36... the poor man's ASCII-compatible compression [22:40:39] #info  RoanKattouw: Personally I've found the base 36 part the most annoying part when dealing with our hashes. [22:40:48] Ive not read the whole rfc and other info but are all WMF dbs affected? [22:40:51] have we thought of an actual use case for rev_sha1 yet? [22:40:55] Let's kill that? At the very least we could move to base64 like normal people... [22:40:56] RoanKattouw: It's easier to forget until you have to re-implement the damn things in another language/outside MediaWiki. ;-) [22:41:07] Zppix: yes, every wiki will get upgraded [22:41:15] brion: ack thanks [22:41:27] but probably not at the same time, beware :) [22:41:29] TimStarling: All I've heard is mumbling about external applications that might use it / are believed to use it because it's exposed by the API and in dumps [22:41:40] We're going to break the dumps with MCR. [22:41:46] What's another change? [22:41:56] Also all the hashes will change [22:41:59] Once we have MCR [22:42:06] Should we ask around for usages of the sha1? [22:42:10] Even single-slot content will probably have a different hash [22:42:13] Yeah, perhaps [22:42:24] If it turns out we can get away with killing rev_sha1, I'd love to [22:42:27] Will we have a different opinion if people are using it? [22:42:27] Which brings me to another topic [22:42:30] removing it would have the advantage of avoiding the need to change the hash function once every few years to prevent an attack on a non-existent application [22:42:36] T19237 [22:42:37] T19237: Better integration of patrol feature (rc_patrolled, rc_bot in revision: rv_bot, rv_patrolled) - https://phabricator.wikimedia.org/T19237 [22:42:45] heh [22:42:50] TimStarling: +1 [22:42:53] FKA bug 17237 [22:43:03] I've used the dump hashes before to verify that my terrible XML parser ("parser") is getting the full content of the revision. [22:43:19] RoanKattouw: hashes for single-slot content will not change, i took care that it works that way. [22:43:26] Marybelle: But that'd be the same use, just with a different value, right? [22:43:26] RoanKattouw: that'S actually the reason for the odd nesting. [22:43:30] so my feeling on patrolled: patrolling should be an extension, and it should store patrolling state in its own table. [22:43:40] DanielK_WMDE: If you want to do your blockchain-style nesting, though, you will need a hash field in the revision table, not just in the content table [22:43:46] James_F: Yeah, I was gonna say, the hash type matters a lot less, it's mostly just for data integrity purposes. [22:43:53] TimStarling: use case for rev_hash is detecting manual deletions / other cases of duplicate content. that's the conly use case i know. [22:43:55] Which is why I'm not sure it matters that it's SHA1 or whatever. [22:43:55] * James_F nods. [22:44:11] Marybelle: Also that doesn't require us to store hashes in the DB, just to generate them as part of the dump process [22:44:13] RoanKattouw: sure, i was planning to keep the hash field in the rev table. [22:44:28] DanielK_WMDE: OK, well brion apparently wasn't ;) [22:44:29] brion: +1 re patrol [22:44:31] RoanKattouw: I think the dump process itself may also use them or want to? [22:44:35] (from the looks of the wiki page) [22:44:42] #info patrolling should be an extension, and it should store patrolling state in its own table. [22:44:45] 14:43:29  so my feeling on patrolled: patrolling should be an extension, and it should store patrolling state in its own table. [22:44:56] Like the dumps want to verify that they're dumping what's actually in the text table. [22:44:59] I don't agree on the extension bit, but I agree on the separate table [22:45:08] RoanKattouw: *if* we want to keep rev_sha1, it should use the nesting approach for MCR. if we drop it, fine. [22:45:16] we don't have any need for it internally, i don't think [22:45:16] I like the idea of de-coreifying patrolling, so we can have different systems. [22:45:17] There are other things I would like, like rev_reverted_by, that would be easier to do if there was a separate table [22:45:38] E.g. FlaggedRevs and Patrolling would be alternatives, not one hacking around the other. [22:45:43] I like the idea of getting rid of recentchanges, more generally. [22:45:54] Since it's basically revision with a few extra fields. [22:46:08] domas: Hi! [22:46:09] RoanKattouw: i'd rather have rev_logical_parent than rev_reverted_by. or both. [22:46:10] *nod* it's a funky summary table, and could be divided up differently [22:46:17] Marybelle: Sounds like a new RfC for that, unless there's a major win? [22:46:17] Hi!! [22:46:25] hey domas :) [22:46:29] #info I like the idea of de-coreifying patrolling, so we can have different systems. [22:46:41] James_F: Prolly. Though this one seems like a catchall for all schema changes for the next few years. ;-) [22:46:42] DanielK_WMDE: Sure, the main thing I want to know is *whether* a revision is reverted, knowing by what is secondary [22:46:49] I disagree with decoring patrolling [22:46:59] #info need more info to decide if we nede to keep rev_sha1. the only use cases seem to be external (e.g. detecting manual reverts) [22:47:04] i don't think there's a huge need to change recentchanges table in this round, if we know we're probably going to rip it out and redo it along with patrolling etc [22:47:08] domas: we're changing ALL the tables [22:47:12] but i'd love to change it later ;) [22:47:19] about time! to what? [22:47:25] I mean it would cause lots of userscripts and stuff to break de coring patrolling [22:47:26] brion: We'll need to at least prolly keep recentchanges in sync with changes to revision? [22:47:27] * brion trolls: MONGODB [22:47:27] domas: Different tables. [22:47:38] recentchanges isn't just revision, it's also logging [22:47:41] domas: silly putty [22:47:46] domas: https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 [22:47:49] It's like revision plus logging minus duplication between the two :/ [22:47:51] Marybelle: we could trim a couple bits, or we could leave it. it's accessed on different paths so not a biggie [22:47:51] Zppix: No. I'm not talking about any change to WMF production wikis, just moving code into different repos for cleaner config. [22:48:05] Marybelle: archive needs to be changed because Revision can read from both revision and archive tables [22:48:06] Ah ok then carry on James_F [22:48:20] oh wow [22:48:24] RoanKattouw: I mean, logging covering all non-edit actions is also kind of goofy. And we expose it so that like Special:Contributions only shows edits, but not every other type of contribution (except weird edge cases like page moves). [22:48:30] It's all kind of goofy. [22:48:36] Yeah that's weird too [22:48:41] good that you have server resources for this [22:48:42] \o/ [22:48:49] flash and what not! [22:48:50] Oh and RC also contains things like categorization and Wikidata these days, which aren't either revisions or log entries [22:48:57] heh [22:49:01] #info i don't think there's a huge need to change recentchanges table in this round [22:49:16] RoanKattouw: recentchanges can't contain anything too important given its limited lifecycle. [22:49:38] Marybelle: It's pretty critical to how wikis operate right now. [22:49:42] Yeah it's mostly there as an optimization AIUI [22:49:51] (Given patrolling.) [22:49:59] an optimization that they grew to store patrolling :P [22:50:00] Should we play around with the ideas on testwiki and/or betacluster before even touching prod? [22:50:05] And also yes, patrol flags and bot flags are only stored in RC [22:50:08] Zppix: oh hells yes [22:50:11] James_F: Fair. :-) [22:50:14] That's what T19237 asked to fix [22:50:15] T19237: Better integration of patrol feature (rc_patrolled, rc_bot in revision: rv_bot, rv_patrolled) - https://phabricator.wikimedia.org/T19237 [22:50:30] Marybelle: But yes, fixing it so that we didn't need it would be great. [22:50:41] are you sure you don't want to have (page, revision, ...) structured PKs on other tables? [22:50:46] like comment and others? [22:50:47] the code changes will have to survive transition mode with the old code, so we can dribble things out over some time [22:50:51] if you [22:50:57] if you're really eager to move stuff out [22:51:37] domas: that'd be great for just splitting tables out, but then we get no reuse of common values [22:51:57] you lose data clustering too [22:52:01] true dat :) [22:52:28] I say we go through the db and delete anything we dont need, see what happens then move forward with the plan [22:52:33] heh [22:52:45] domas: jynus swears that join lookups are cheap enough ;) [22:53:04] obviously we'll have to put some things through their paces [22:53:06] the actor move is also weird [22:53:10] James_F: Re: actor/user_entry, I'm also not sure how all of this will interact with CentralAuth and its tables. [22:53:23] Marybelle: I have faith we'll make it work. [22:53:25] you could keep user_id in the table [22:53:46] how are actor rows going to be created? [22:53:51] Marybelle: same as current rev_user/rev_user_text at present. always refers to local user rows [22:53:53] Marybelle: Eventually™ we're going to migrate to a single set of user_ids in WMF production so we can get rid of CentralAuth or something. [22:53:58] Ok, i think we should leave the nitty gritty for now, and come to the next steps [22:54:04] James_F: From your lips to God's ears. [22:54:07] 5 minutes left. [22:54:11] domas: insert ignore probably [22:54:15] brion: do you feel you can move on based on todays conversation? [22:54:22] do you need additional input? [22:54:27] brion: (type, id) may work [22:54:29] DanielK_WMDE: yeah i've got a bunch of todos and i need more input from jynus later [22:54:33] instead of (id)->(id) [22:54:43] I don't understand actor table [22:54:53] * DanielK_WMDE sounds the five minute claxon [22:55:03] domas: it's a place to put IPs [22:55:10] how large is ipv4? [22:55:12] brion: ill be around to help out if need be (with what access i do have) [22:55:14] how large is ipv6? [22:55:17] domas: it lets us consolidate the (id-for-user, name-for-not-user) space [22:55:21] and yes ipv6 is huge ;) [22:55:27] ipv6 is two bigints [22:55:33] but if you can kill those tables you're already flooding revision or logging table ;) [22:55:38] domas: 4 billion. but we do ipv6 now. anyway, we discussed that earlier. actors < revisions, obviously. [22:55:42] domas: # of distinct IPs is still less than the # of edits, so rev_id will overflow before action_id ;) [22:55:45] *actor_id [22:55:53] domas: Also we'll just block those users. [22:55:59] hehe [22:56:10] *shrug*, your row overhead for actor row will be much larger than any overhead within the table would be [22:56:25] just store addresses efficiently [22:56:42] domas: we may also expand actor later to cover more data for non-logged-in folks [22:56:50] right [22:56:54] but if we don't then yeah it's a couple BIGINTs :) [22:56:57] domas: my secret plan is to just stroe an ipv6 address, and encode the ids of logged in users in a reserved ip block. but i get dirty looks for suggesting that :D [22:57:00] Storing addresses efficiently (or at least in a way that is more indexable for ranges) was also suggested on the talk page [22:57:13] DanielK_WMDE: you could just hash them [22:57:20] DanielK_WMDE: not bad, long jedi [22:57:26] hehe [22:57:28] young [22:57:30] ergh [22:57:43] you could also route those IPs [22:57:44] bad, your typing is! [22:57:45] to user pages! [22:57:45] brion: is there a phab task for this yet? [22:57:53] #info domas warns actor table is low-efficiency for space. consider benefits of keeping separate fields with more efficient storage for users [22:57:53] anyway [22:58:05] To WMF cluster IPv6s. [22:58:11] comment storage is also a weird thing [22:58:16] #info Logical storage of IP addresses is desired for range scans anyway [22:58:18] TimStarling: should we use the MCR RFC or should I open a new one? I'll open a new one [22:58:20] Zppix: you still need a place to look up the actual text for the hash. which brings you back to square 1. [22:58:26] open a new one [22:58:33] you don't need separate table for most of them, do you? [22:58:35] #action brion will open a fresh RfC ticket on phab for the current schema initiative [22:58:35] With MCR as a sub-task? [22:58:35] what is median comment length? [22:58:41] DanielK_WMDE: that could be a different rfc perhaps? [22:58:41] we need to have somewhere for jynus and domas to duke it out, with a public record that we can analyse later [22:58:46] brion: +1 for new one. MCR is already a frigging kitchen sink [22:58:50] domas: Support for triple-byte languages. [22:58:57] I understand [22:58:58] you can have both [22:59:02] yes :D [22:59:12] brion: just create a subtask [22:59:16] you know, like with text -> ES , hack hack hack! [22:59:29] when in doubt, use another level of indirection ;) [22:59:31] domas: i want to put json there... [22:59:34] brion will create a task setting out the plan, and domas, do you think you could explain on that task why it is a terrible idea? [22:59:38] brion: indeed! [22:59:42] :D [22:59:54] well, I don't know how much you care about efficiency [22:59:55] #info file a proper rfc, linked with but not part of MCR [22:59:58] ok folks [23:00:04] I just know that if change decreases efficiency, I look for minimal damage [23:00:06] :) [23:00:10] any last comments? wishes? insults? [23:00:17] I got nothing anything else [23:00:25] awesome, thanks all :) [23:00:31] \o/ [23:00:35] #endmeeting [23:00:36] Meeting ended Wed Mar 8 23:00:35 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [23:00:36] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.html [23:00:36] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.txt [23:00:37] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.wiki [23:00:37] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.log.html [23:00:51] i gotta run; domas i'll run future updates by you as well :) [23:01:14] TimStarling: no ideas are terrible! [23:01:35] ok, just wrong then ;) [23:01:35] they are just at different positions in various valuation criteria! [23:01:55] domas: please comment on the talk page! it has flow! [23:02:24] brion: ill be around ping me if you want some help [23:02:48] cool [23:03:48] ok, I'll update the text & the ticket on phab tomorrow [23:04:28] Ack thanks brion [23:05:20] thanks for input! be back later...