[21:57:45] \o [21:58:22] * robla waves in anticipation of https://phabricator.wikimedia.org/E107 [21:59:31] * robla checks that yurik is here [22:00:34] * yurik is here [22:00:42] * yurik pokes robla [22:00:50] #startmeeting RFC meeting [22:00:51] Meeting started Wed Dec 9 22:00:50 2015 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [22:00:51] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [22:00:51] The meeting name has been set to 'rfc_meeting' [22:01:16] #topic RFC: Graph/Graphoid/Kartographer - data storage architecture | Wikimedia meetings channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [22:01:31] #link https://phabricator.wikimedia.org/T119043 [22:03:22] the big question - how do we handle data storage that is easily accessible to both MW, services, and frontend [22:04:10] restbase! [22:04:21] ping gwicke [22:04:53] well, we have been discussing this for a while, and would suggest to reword it more as a question about separating data and presentation [22:05:07] like I said on the ticket, I would like it if we could do this the same way for everything, instead of reinventing data storage every time we have a new kind of rich content [22:06:02] including migrating the Score extension to the new thing [22:06:04] the storage part itself isn't so interesting, but there are real questions about update propagation, layering / reuse and visual editing [22:06:44] agree 100%, but the main issue remains - should we build on top of the current SQL table structure, but expose it directly to the services, or should we continue wrapping SQL with MW API, or should we build a totally separate storage system for this. [22:07:22] *for this meaning nodejs services, etc [22:07:40] * yurik pings DanielK_WMDE__ [22:08:57] the question is whether to store data in SQL or cassandra? [22:09:37] is our cassandra setup considered highly reliable now? it's not going to eat all our data? [22:09:42] TimStarling, i think this is among the bigger questions, yes. Should we continue with the SQL model wrapped in PHP, SQL model exposed to all services, or SQL+Cassandra model [22:09:48] So far cassandra has only contained things we can regenerate. [22:09:50] i think its not [22:10:15] And this data will be editable by users on-wiki. Presumably those graphs and maps will present some kind of edit interface and store in "the storage" instead of revision table. [22:10:19] revision/text table [22:10:26] it hasn't eaten any data, and availability has been 100% for the last months [22:10:45] does it have offsite replication? [22:10:55] but, I think you are starting from the wrong end here [22:10:56] Krinkle, the data for maps/graphs can also be regenerated because it comes from wiki markup. [22:11:00] for access control, labs replication, dumps, etc. sql would make sense imho. [22:11:16] But we can access it through a service for HA if we want, e.g. like virtual rest service. [22:11:28] so that in prod we wouldn't query it directly. [22:11:31] you haven't even defined what should be stored, what the API will do & what data massaging might be needed, and are already talking about which storage backend to use [22:12:15] TimStarling: RB was one of the first services to be fully replicated to codfw [22:12:18] yurik: But you said it's limited to 64KB? [22:12:38] yurik is currently using page_props, which is wrong and has to stop [22:13:14] page_props is limited to 64KB [22:13:15] Right, so we're not talking about storing the revision text for Graph pages elsewhere, but about things inside that will remain in wikitext as well, but then during the parsing get stored elsewere. More like secondary data that we need available more widely. [22:13:22] Right, not the text table. OK. [22:13:42] Krinkle, TimStarling - yes, i need a bigger store, and as a stopgap i will patch is shortly with gzip [22:14:08] when yurik explained it to me on IRC, the problem seemed identical in every detail to math and score [22:14:14] Yeah, I'm doing the same with TemplateData at the moment. also gzipped in page_props [22:14:21] well, every detail except one... [22:14:22] exactly [22:14:23] one of the major issues I see with the current implementation is the mixing of presentation and data, and the hacky use of transclusion to combine the two into a json blob [22:14:25] with a query API [22:14:27] for VE [22:15:14] you have a tag in wikitext, and the contents of it gets sent off to a service and turned into an image [22:15:22] gwicke, that is a very different issue really - if you look at the spec i will use for maps, it also combines data and presentation -- https://github.com/mapbox/simplestyle-spec/tree/master/1.1.0 [22:15:26] Isn't this kind of storage what DanielK_WMDE__'s proposal for alternate resource streams for a page about? [22:15:26] and then you deliver an image link to the user in the HTML [22:15:50] same as math and score [22:16:01] math and score don't transclude their data [22:16:06] Yeah, and we need a deterministic url that can be regenerated via 404 handler ideally (and cache miss / purging) [22:16:09] so if the images are in a cache, you need a reliable mapping of graph IDs to graph sources [22:16:27] TimStarling, but there is also second type of usage - client side rendering, where client side asks for that same data from some api [22:16:27] gwicke: yeah, that is the one detail [22:16:32] so the service that generates the url needs to be able to access the data based in an ID of sorts in the url. That's the tricky part especially for Graph. [22:16:38] Not sure how Score does that right now [22:16:45] I guess Score creates it during the parse, not on 404 [22:16:48] bd808: storing map data alongside wikitext (instead of inside wikitext) is exactly the kind of thing i had in mind, yes [22:17:05] the one difference, as I understand it, is that images need to expire periodically and be regenerated [22:17:05] math is fairly straightforward, as formulas are simply defined inside the tag [22:17:13] kaldari's quality assessments would be another good exampe for a use case for multi-content revisions [22:18:05] TimStarling, the image for *older* revisions should stay intact, and only the HEAD version should be regenerated, ideally only when one of its dependencies has changed [22:18:10] I don't think multi-content revisions are relevant [22:18:20] Krinkle: if the data is in a separate content slot (stream), it would be simple to address it by slot id. [22:18:28] * DanielK_WMDE__ was away and is reading the backlog [22:19:28] but it is not proposed to separate graphs to a separate content slot [22:19:47] from my pov, having only the graph spec in the extension tag & data as a reference would simplify this a lot [22:19:48] but why not? [22:20:01] if it was, that would be another RFC [22:20:01] TimStarling: i propose it should do that [22:20:15] Yep. Score right now produces the media file synchronously in the parser, puts it in FileBackend and that's it. No expire strategy it seems. No idea how that's scaling in prod right now when pages are edited and how stale audio files are garbage collected. [22:20:33] math is doing the same [22:20:35] it would be blocked on the implementation of the multi-content rfc... [22:20:37] which is fine [22:21:09] gwicke: math has the primary data embedded in the wikitext, right? [22:21:10] for math, we could even make the generation async [22:21:12] the question is how to fix the train wreck that we have right now and deliver the exact same user-visible feature [22:21:42] DanielK_WMDE__: yeah, there is no transclusion shenanigans [22:21:44] for graphs, both the primary and the generated data could be in revision slots [22:22:03] that would also allow for a dedicated editor for graph data [22:22:23] yeah ok, so file an RFC for that and we'll discuss it some other week [22:22:24] (template substitution could be applied too, if we want) [22:22:26] If we extract the json and put it in an auto-increment sql table (just for sake of argument). Then we'd have an ID to put in the url and for API consumers and renders to cache and access the data. But then we store them indefinitely in the sql table all variations. That doesnt' seem scalable. For link tables we solve this by indexing on page id, and [22:22:26] cleaning stale data on-pagedelete and when new edits are saved. [22:22:54] Krinkle: that's basically why we have so far resisted storing this in restbase [22:23:08] having to store random snapshots forever is not a very sane solution [22:23:27] But we can't not store it synchronously, because we need to give the parser back an ID that uniquely identifies this or from other ones on the same page. [22:23:46] page deletion seems like a pretty small corner case [22:23:47] so we're more or less forced to store it somewhere in a persistent way during the parse. [22:24:03] we could just store graph sources against ID forever [22:24:07] TimStarling: If not page delete, then just in general new revisions to the same page that remove or change the extension tags. [22:24:49] Krinkle: you could store a stub that contains the raw data, and replace it with the generated data later, asynchronously or on 404 [22:24:51] a page can have any number of score or garphs on it [22:24:59] DanielK_WMDE__: Yeah, exactly [22:25:01] * aude waves [22:25:03] so the rendering doesn't need to happen during parse [22:25:08] we only persist the raw text source [22:25:15] yeah, sure, I was referring to the text only [22:25:17] * yurik waves back to aude [22:25:27] gwicke: what do yu mean by random? [22:25:56] aude: data can contain arbitrary time-dependent information like timestamps [22:25:59] If we do it in an sql table with pageid column, we essentially get what we have now: page_props [22:26:06] if you store things by hash, you need to keep all that forever [22:26:07] restbase stores the old revision HTML, so just editing the page does not ensure that there is no HTML live that references the images [22:26:18] except that we'd have an additional primary key column [22:26:19] gwicke: ok [22:26:34] i like the idea of storing data separately from the graph spects [22:26:53] TimStarling: we could retain just enough data to be able to go back and extract the data again from the appropriate revision [22:26:59] but want the data to be revisioned [22:27:15] having only the graph specs inline could even allow us to just pass the entire spec in the URL [22:27:25] aude: multi-content revisions \o/ [22:27:34] the spec can be more than 64KB, you can't put it in the URL [22:27:36] which would enable async generation & caching [22:27:49] i think this is the way we want to go with supporting more complext geodata types in wikidata :) [22:27:51] the spec itself is only a couple of options [22:27:56] the data is the big part [22:27:57] aude, gwicke, there is different types of data. There is large data blobs, and there is tiny tables that specify how large data should be filtered or mapped or labeled. I am totally for storing large data outside of graphs too [22:28:13] Right, multi-content revisions would be a cleaner way to solve this I guess. The reason we don't want to persist old revisions' secondary data is because of the growth. With multu-content revision they wouldn't be duplicated in the first place. [22:28:35] if this was basically a URL referencing the data, regeneration would be easy [22:29:11] If it's tied to a revision-id, we can get a way with sequential ids for any entity wihtin though. E.g. revision-123-graph-2 [22:29:17] Krinkle: as long as it stuff can be regenerated for old revisions... [22:29:19] can be reliably regenerated [22:29:26] there are some tricks we can play to fit quite a lot of information in a compact URL: https://phabricator.wikimedia.org/T118028 [22:29:27] e.g. to view a diff or such [22:29:42] please keep in mind that most of the revisions will produce identical graphs - so we shouldn't keep duplicating it [22:30:22] Right [22:30:35] yurik: maybe the revisions can use a hash to refer to some data blob [22:30:44] also, it might be good to have a "temp store" - where we store things while the user SavePreviews something. which by the way could still be identical to the previous revisions [22:30:50] multiple revisoins can refer to the same blob [22:31:45] aude, yes, it seems at the end of the day we will end up with some complex redirection system - a hash per revision pointing to the hash of the image that points to the hash of the data [22:32:19] that's what we have implemented in restbase, but I think we can do better than that [22:32:20] and this will allow us to go back and remove some unreferenced entries [22:33:19] TimStarling: the spec could just be a revision and + graph id. from that, the raw data can be extracted, and then rendered [22:33:23] gwicke, but should we rely on cassandra for this, or should we keep all the data together in a much more well known sql? [22:33:32] For media files old revisions isn't a problem since they can be regenerated when those revisions are viewed (they'd retain the same url in this new system, unlike current score and math extension) [22:33:48] But the text itself cannot be magically regenerated when querying the database table or other store [22:33:52] yurik: lets discuss *what* to store first [22:34:01] at least if it is in SQL then it is not such an enormous task to set up the extension outside WMF [22:34:28] i agree - sql has more flexibility in that regard [22:34:30] Krinkle: it can be re-extracted from the old page revision, no? [22:34:41] it's getting easier though, we have docker images now [22:34:56] https://phabricator.wikimedia.org/T92826#1804775 [22:35:07] DanielK_WMDE__: Yes, in theory. But it's harder to make a model for that. If the databse contains identifiers like 'rev-2-graph-3' then you get duplication. If the identifers are a hash, you can't fallback to regenerating. [22:35:12] Maybe we need a link table? [22:35:32] Then we can garbage collect unused ones. [22:35:50] yurik: how large would the graph options without inline data be? [22:36:00] Krinkle: duplication how? [22:36:00] assuming they'd reference a pre-defined graph type [22:36:01] Krinkle, yep, :31:44 [22:36:12] DanielK_WMDE__: rev-5-graph2 is probably not different from rev-4-graph2 [22:36:33] yurik: what is 31:44 [22:36:37] Krinkle: ah, right [22:36:39] gwicke, it really depends - most examples here don't have inline data - http://vega.github.io/vega-editor/?spec=dimpvis [22:36:44] did you see the PHP version survey results? almost half of MW's users don't even have root access [22:36:46] Krinkle, my message at that time ) [22:36:56] https://docs.google.com/forms/d/1Z-io754bUxVujh100D4xvIwkiBIFk9Ef0j4TYrJ2zMc/viewanalytics [22:36:59] right. min;sec ok [22:37:25] TimStarling: most users can click buttons, though [22:37:29] Krinkle: put that plus a hash into the table. the raw data blob and the rendered blob can be accessed by hash [22:37:42] we just store revid+graphid -> hash [22:37:49] DanielK_WMDE__: Right. As long as GC will leave alone things recently added. [22:37:51] Krinkle, the problem with multi-part data is that it becomes very hard to regenerate older images. I would much rather prefer to store images forever [22:38:08] Otherwise if GC runs after you click preview, the graph renderere won't find it [22:38:23] Krinkle: yes, we'd want a timestamp. though LRU would be better [22:38:33] Yeah [22:38:46] So basically cassandra linked data, in SQL. Great. [22:38:48] Krinkle: it would still be able to re-generate it, but that's expensive of course. so yea, smart gc :) [22:38:57] DanielK_WMDE__: No, it won't, not for preview. [22:39:00] TimStarling: one common thread through many discussions with third party users is that they'd love to leverage VMs, but don't want to deal with administering their own server manually [22:39:13] the blob in the data table for preview won't be in the link table yet [22:39:16] Krinkle: or no gc - disc space is probably cheaper than developer time [22:39:29] but we shouldn't store every blob added for previews forever. [22:39:36] I imagine that's currently happening for score and math as well [22:39:45] Krinkle: oh, preview. you are one step ahead of me again :) [22:39:51] yeah, it's happening for math - but that's a lot smaller [22:40:48] the mean math formula encoded as a compressed url is 40 bytes [22:41:09] the largest 222 [22:41:33] OK. So a data table (not super unlikely page_props), but just with hash as primary key and blob (no pageid). Just plain data store. And a separate link table that allows us to regenerate blobs if they fall out. If we account for them falling out, maybe we can put them in a proper object store instead, like bagostuff something. [22:42:01] backed by redis or memc [22:42:02] gwicke, for math, i agree that it is fine to url-encode it. But we are talking about multi-part data with a complex and quickly evolving rendering engine, which means if we want to support older images, we should just store those, and keep a link to them [22:42:10] what is in the blob? [22:42:11] Krinkle: yay, link tables for old revisions! [22:42:21] good luck ;) [22:42:33] Krinkle: maybe mark blobs that added by "save" differently than ones added by "preview" and collect the latters more aggressively? [22:42:36] in the blob would be text content opf or [22:42:58] SMalyshev: Will need upsert handling, since they'll often be the same. [22:43:06] most of the time, both "preview" and "save" blobs are identical [22:43:06] save would need to 'win' [22:43:09] Krinkle: yes, of course [22:43:10] so the hash key is the hash of that text content? [22:43:21] TimStarling: Yeah [22:43:32] and they are identical for many past revisions as well [22:43:42] that's what we have in RB [22:43:49] it's the hash of the expanded spec [22:43:59] data store of hash -> blob, and a link table tracking where that hash can be extracted (e.g. hash -> rev-123-graph-2?). Just thinking out loud. [22:44:02] and images are generated on-demand from the stored spec, addressed by hash [22:44:06] gwicke, the problem with the RB data is that its not easily accessible by other systems [22:44:27] um, it has an API? [22:44:44] to me, the issue is more that we can never delete any of this data [22:45:24] gwicke, if we cannot delete old data, it means we don't know what's in use. which means we may need another layer of indirection (( [22:45:59] I don't think that maintaining link tables for all revisions is a good use of our time [22:46:00] so the link table is a reliable store of hash to something from which the blob can be extracted [22:46:06] we can treat this similar to parser cache in a way. The HTML is also a secondary data item derived from the wikitext. [22:46:14] TimStarling: yeah [22:46:18] [14:44] gwicke to me, the issue is more that we can never delete any of this data [22:46:19] whereas the data store table is a cache [22:46:29] BTW keep in mind that we need to at least be able to make data inaccessible [22:46:32] Because of oversight etc [22:46:39] yeah, I think if we go this route, we'd be better off making the data store table not a table but actually use cache interfaces from bagostuff for it. [22:46:49] RoanKattouw: yeah, that's another whole can of worms [22:48:10] I would like to see data on the size of graph specs if all they do is a) specify a data source, b) specify a pre-defined graph display, and c) provide some options for b) [22:48:44] Do we want RESTBase to be able to re-generate this data (text content of , etc.) without a roundtrip to MW if it has that revision in cassandra already? [22:48:49] I would not be surprised if this information fit well into a URL [22:49:10] Then we'd need RESTBase to be able to access the link table. Otherwise it woudl have to go through MW API. [22:49:15] what is being stored in page_props at the moment? [22:49:20] link tables only cover current versions [22:49:27] page_id, 'graph', 'contents of ' [22:49:33] so they are useless for suppression & use tracking [22:49:43] I thought it was the contents of the tag, and that it was that contents that occasionally needed to be over 64KB [22:50:09] if that is right, we can easily do statistics [22:50:11] TimStarling: the expanded form is that large, meaning with the actual data injected via transclusion [22:50:27] rather than referenced [22:50:47] TimStarling: The problem with page_props is that there is no unique Id for each entry of on a given page, and doesn't have a strategy for viewing old revisions. [22:50:59] here are some stats & a link to a zip with the full dataset: [22:51:00] https://phabricator.wikimedia.org/T118028#1802155 [22:51:05] there are many problems with page_props, I was just going to suggest doing statistics on its size [22:51:49] gwicke, TimStarling, think of usages - most of the time, the data is actually tiny - e.g. population data for 10 years, whereas the json to draw it is much bigger. But, if we start using large data storages, the data could grow much bigger and will be filtered by the graph [22:51:55] I'd be more interested in learning about how large these specs actually *need* to be, though [22:52:26] I guess we'll have to continue this on phabricator [22:53:20] and of course we don't have time for the second RFC [22:53:43] TimStarling, do we have time to discuss if we should allow direct JSON emebing in api responses? [22:53:53] seems to be a much smaller issue [22:54:06] it was triaged in the committee meeting last hour [22:54:06] https://phabricator.wikimedia.org/T120380 [22:54:20] it probably doesn't need to be an RFC [22:54:25] Krinkle: another issue with page props is if graphs ever have multilingual content [22:54:31] maybe just submit a gerrit change and see what happens? [22:54:58] that page props is unaware of languages and does not currently support multilingual anything [22:55:01] TimStarling, i would like to hear yours, gwicke and DanielK_WMDE__ opinion on it - as it seems anomie is not very happy [22:55:05] (e.g. display_title) [22:55:17] I see [22:55:48] and Krinkle as well btw, as he uses api from js [22:55:57] well, let's wrap up on the graph extension first, are there any action items for the notes? [22:56:03] * anomie got pinged [22:56:21] it seems like we have have a number of competing designs that should be written out on the task [22:56:33] TimStarling, i could try to implement a generic SQL structure [22:56:36] I'll put those as action items for gwicke and Krinkle? [22:56:43] next week: agenda bashing WikiDev '16: https://phabricator.wikimedia.org/E121 [22:57:23] as an alternative to restbase - to see if there are any benefits to it. obviously i would much rather have one system than two though [22:57:39] #action Krinkle to summarise his proposed design in a comment on T119043 [22:57:56] I'd nominate yurik to investigate how small specs could be [22:58:03] yurik: could you take that on? [22:58:19] #action yurik to investigate how small specs could be and whether they could fit in URLs [22:59:07] gwicke, sure [22:59:22] yurik: cool, thanks! [22:59:26] but i already sent most of the specs to gwicke before - and most of it is not data [22:59:37] but sure, i will look at it further [22:59:50] my guess - specs could be literally huge, possibly auto-generated [22:59:51] yurik: assuming they'd reference a graph definition [22:59:51] #info no firm resolution, this meeting was mostly design exploration from a number of competing angles [23:00:09] Okay [23:00:36] #endmeeting [23:00:38] Meeting ended Wed Dec 9 23:00:36 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [23:00:38] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-12-09-22.00.html [23:00:38] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-12-09-22.00.txt [23:00:38] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-12-09-22.00.wiki [23:00:38] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-12-09-22.00.log.html [23:01:34] Links copied to https://phabricator.wikimedia.org/E107 [23:01:47] I suggest JSON inclusion be discussed with anomie in #mediawiki-core [23:04:10] YuviPanda: I think one of the first things I want to do with the Labs instance we discussed is automating some of the minute posting and such from this meeting [23:07:37] James_F: https://www.mediawiki.org/wiki/Reading/Quarterly_Planning/Q3 [23:12:21] yurik: sorry, got distracted. i'm a bit torn on the json/api stuff. in principle, i'm with anomie, but for practical reasons, i kind of like your proposal. [23:12:28] let's talk at the summit :) [23:13:22] DanielK_WMDE__, thx, i'm always torn between philosophy and practice, and the practicality of things usually produces much better results in the longer run ;) [23:14:00] but in theory, they don't... [23:18:07] yurik: well, in theory, theory and practice are the same. but in practice, they aren't... [23:20:50] Tech Talk: Secure Coding For MediaWiki Developers is starting in 10 minutes [23:22:14] The stream will be at http://www.youtube.com/watch?v=iKdufZQTTao [23:29:12] Tech Talk: Secure Coding For MediaWiki Developers is about to start. [23:29:57] We might wait a couple of minutes for late arrivals, but Darian and Chris are ready. [23:31:27] OK, we are going to start. [23:31:41] Stream: https://www.youtube.com/watch?v=iKdufZQTTao [23:31:57] We are live! [23:32:25] YouTube video is working for me [23:33:14] Just checking, who in this channel is watching the Tech Talk right now? [23:33:26] * zuzak [23:33:29] o/ [23:41:35] hi [23:54:26] if anyone watching has any qustions ping qgil and he will ask at the next break :) [23:54:31] hi legoktm [23:57:32] honestly this hidden iframe js submit stuff makes me feel that browsers are just broken.