[02:21:28] Science Tuesday tonight, anyone? [22:01:24] #startmeeting RFC meeting [22:01:24] Meeting started Wed Nov 5 22:01:24 2014 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [22:01:24] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [22:01:24] The meeting name has been set to 'rfc_meeting' [22:02:19] #topic Content API / Storage service | RFC meeting | PLEASE SCHEDULE YOUR OFFICE HOURS: https://meta.wikimedia.org/wiki/IRC_office_hours | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [22:02:57] hello & welcome [22:03:12] hi [22:03:24] hi [22:03:57] do you want to start with telling us what progress you have made to date? [22:04:19] sure [22:04:35] we have basically implemented what's described in the README [22:05:03] currently I'm tweaking the puppet module while tweaking last bits around config & logging [22:05:31] next step is deploying this, using puppet, to a set of test servers [22:05:35] you may want to give some context [22:05:43] and testing it [22:06:05] the README is at https://github.com/gwicke/restbase [22:07:07] there are also design docs linked from the readme [22:07:07] where is this in gerrit btw? [22:07:27] mark: https://gerrit.wikimedia.org/r/#/c/171162/ [22:07:29] thanks [22:07:44] so basically you have a REST frontend for cassandra which introduces some configurable structure for the data stored? [22:08:23] it's a storage service built around the concept of tables and 'buckets' built on top of those tables [22:08:57] table schemas are specified using JSON [22:09:14] creation is by PUTing a schema with the appropriate privileges [22:09:29] and a page is a bucket? or a revision, or both? [22:09:45] there is simple secondary index support in the schema [22:10:13] a bucket is an abstraction that bundles behavior and storage at a higher level [22:10:56] gwicke: can you give an example? [22:11:09] examples are a revisioned blob bucket, which revisions a blob object which has a few attributes like content-type etc, and can be dereferenced directly [22:11:19] well, there are two files in the "bucket" directory in restbase at present [22:11:28] kv.js and pagecontent.js [22:11:54] so I guess a page is a bucket, and alternatively a key/value pair is a bucket [22:12:05] another example is currently a pagecontent bucket, which tracks MediaWiki revisions in a table & by default creates revisioned blob buckets for html, wikitext, data-mw and data-parsoid [22:12:36] it's possible to add more buckets to this pagecontent structure if other properties need to be stored [22:12:46] yeah, and there's some specific handling for parsoid data in pagecontent.js, right? [22:12:53] like, let's say, some custom HTML or metadata for mobile [22:13:34] TimStarling: yeah, currently that lives directly in there; in the longer term I'd like to move this to a config layer [22:13:47] like getPageCSS? [22:14:08] likely with a yaml syntax similar to https://github.com/gwicke/restbase/blob/master/doc/Architecture.md#declarative-proxy-handler-definition [22:14:16] *a* page is a bucket, or there's a bucket holding all pages? from the readme, it seems like the latter is the case. [22:15:01] for example, the revisioned blob bucket can hold all revisions of a type of content for all pages [22:15:19] gwicke: i'm a bit confused - the example in the readme sais you create a "table" for enwiki, and a "bucket" for pages. to me that sounds like "database" and "table". why is the top levfel structure called a "table"? do all buckets in a table have the same structure/schema? [22:15:27] internally, it's keyed off // [22:16:33] the terms table & bucket have evolved a bit over time, and the generic table layer was added later in the design process [22:16:39] so in the short term, it will be basically a cache for data generated from wikitext by parsoid? [22:16:47] and a REST interface to that data? [22:16:48] "declarative proxy handler definition"... that's like a complex rewrite rule? [22:16:50] as it stands, all buckets are implemented in terms of tables [22:17:12] as for tables, think DynamoDB tables [22:17:39] Is a javascript callback api implemented via yaml file? [22:17:50] TimStarling: first use case is HTML & metadata storage [22:18:18] with an eye to storing wikitext as well [22:18:34] but there is no facility for compression by differential storage? [22:18:50] bd808: it's independent of the implementation language [22:18:53] you know we are wasting something like 95% of our disk space at present by storing revisions independently [22:19:10] yeah [22:19:32] for a wikitext dump import, I got a compression ratio of 18% of the input text size with Cassandra [22:19:48] what compresses it? [22:19:50] it could be much better (single-digit %) with lzma -1 support [22:20:05] mark: cassandra has tunable compression per table [22:20:14] so cassandra, ok [22:20:40] and in the schema revisions of each property are laid out sequentially on disk, which of course helps [22:21:12] ah right [22:21:27] how would it stay sequential after random additions? [22:21:27] this will be fronted by varnish for external traffic? [22:21:58] there are some old notes about this at https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage#History_compression [22:22:29] TimStarling: the schema is defined in a way that keeps revisions sorted on disk [22:22:39] mark: yes [22:22:51] so objects are moved? [22:23:01] a goal in the REST layout is cache / purgeability [22:23:18] TimStarling: no, cassandra uses LSMs [22:23:33] https://en.wikipedia.org/wiki/Log-structured_merge-tree [22:24:04] cassandra is the first backend, but it's also not intended to be the only one [22:24:06] regarding mark's question, it's not clear to me if this API is supposed to be exposed to external users as well or if it's purely for internal use [22:24:17] my understanding is both... [22:24:22] * gwicke nods [22:24:41] with private information appropriately protected [22:24:42] what concerns me about it is that it seems to mix a lot of different concerns [22:24:57] you know it's not just a REST interface to the cassandra store [22:25:05] it also has a proxy to MediaWiki's api.php?action=query [22:25:19] and it also has presentation-layer logic specific to parsoid [22:25:30] i.e. post-cache link colouring [22:25:31] presentation-layer? [22:25:43] // Self links [22:25:43] css += 'a[href="./' + sanitize(decodeURIComponent(rp.key)) + '"] {' [22:25:43] eh, no [22:25:43] + ' font-weight: bold; color: inherit;' [22:25:43] + ' text-decoration: inherit; pointer-events: none;' [22:25:43] + ' cursor: text; }\n'; [22:25:52] that's an experiment [22:26:33] an experiment that worries me, can you elaborate on it? :) [22:26:33] so please disregard that bit ;) [22:27:32] mark: it relates to an earlier discussion about how we can implement all the content-affecting user prefs client-side, using data exposed through an API [22:27:51] one option was to implement red links with server-generated CSS [22:28:02] but we agreed to do this differently [22:28:11] ok [22:28:48] in any case, I'd like to keep any advanced logic out of restbase in the name of performance & reliability [22:29:10] the yaml config I mentioned earlier came out of that line of thought [22:29:34] how would that do this differently? [22:29:42] basically allow teams to hook up new entry points and some simple fall-back behavior, without the ability to break things with arbitrary code [22:30:03] so there wouldn't be "pointer-events:none" in the YAML config? [22:30:31] no ;) [22:30:55] the yaml config is pretty much restricted to HTTP proxying [22:31:20] and some simple stuff like "if it's not in storage, then call this service over there to make it" [22:31:45] with a possible extra bit of "and store it back while returning it to the client" [22:32:33] gwicke: remindes me of this from 2007 (look at the source): http://brightbyte.de/demo/xskin/Test.xml [22:32:41] basically a caching proxy? [22:32:43] I want to keep the option to move to a different implementation language in the medium term [22:32:55] TimStarling ask Krinkle [22:32:55] the yaml config helps with that as well [22:33:04] except that it also allows PUT requests... [22:33:35] the idea of integrating a public REST API with a storage backend for wikitext seems very wrong to me [22:34:32] TimStarling: how so? [22:34:55] the API is basically all about providing raw data [22:35:10] wikitext is one bit of data provided [22:35:11] gwicke: it feels a bit like exposing mysql to the public, relying on database layer permissions to keep people honest... [22:35:17] because authenticating insertion of wikitext is a large and complex thing [22:35:29] it'd be like integrating EditPage.php with Revision.php [22:35:51] TimStarling: right, that should live in its own service(s), some of which might just be MediaWiki [22:36:04] there are no immediate plans to do that in any case [22:36:12] if RESTBase manages the storage then some other thing is going to want to insert into that storage [22:36:40] the SOA auth RFC talks about how we can require signed assertions from the sanitization & anti-spam services [22:36:45] and the API for that should presumably be minimal and use internal, trivial authentication [22:37:04] so I'm confused- if the storage allows put requests, how will that get keyed back out? or is the idea that put is just for changing things, not for retrieving different page versions? [22:37:15] no, I don't think that is right [22:37:19] TimStarling: but that's getting ahead of ourselves a lot [22:37:36] for now, no public PUT is allowed [22:37:36] even enumerating the services which need to sign the data is not a job for a storage backend [22:37:53] (technically, will be before a public deploy) [22:38:01] I would like to get a little bit ahead [22:38:34] I don't want it to be off my radar for 6 months and then have it presented to me as a done thing, with 24 man months of work behind it [22:38:36] TimStarling: none of that needs to be decided right now [22:39:31] presumably the idea of RFC discussions is to get at least some distance ahead of current work [22:39:31] there are some security goals in the auth RFC [22:39:48] it seems to me like it would be useful to have a clear distinction between the public facing and the backend/storage api. they should be separate services, even if they conform to the same interface. [22:39:58] preferably to form goals for all forseeable development work [22:40:27] DanielK_WMDE_: which advantage would this have? [22:40:28] yes, mixing them up also feels wrong to me [22:41:08] there will always be the desire to add a lot of complexity to the public facing one to support features [22:41:09] mind you, I started with a separate content API [22:41:17] gwicke: security, mostly. but also conceptual clarity. and more freedom to change things under the hood later. [22:41:26] which will be at odds with the desire to keep a highly reliable, simple, secure storage service [22:41:35] but then realized that it would just add a network hop, and internal users would still want to use a consistent API [22:41:50] brion: any opinion on splitting the public API from the storage backend? [22:41:57] it's easy to run the storage layer on a separate box [22:42:05] in the code it actually talks HTTP [22:42:06] i’d tend to want them split as well [22:42:16] * AaronSchulz leans that way to [22:42:25] RoanKattouw? [22:42:31] a network hop might be the least of our problems [22:42:36] gwicke: in trivial cases, that hop could be implemented as a rewrite rule. [22:42:41] though it’s not insane to do a hybrid model where a public-facing api gives you metadata [22:43:00] and we use the same storage api for fetching the actual text data [22:43:06] I don't think that we all have enough information to vote right here [22:43:08] not sure where this stands [22:43:21] then perhaps we should get that information before we make a decision [22:43:38] as usual, it's a trade-off [22:43:38] Hi [22:43:50] the storage backend only provides low-level table storage [22:43:58] TimStarling: Rereading backscroll and thinking [22:44:01] it's not a very convenient API to use [22:44:31] the same convenient API could be used on top of the storage backend [22:44:33] gwicke: which should probably not be public. think revision deletion, etc. you don't want the low level storage layer to deal with that. or at least, i don't. [22:44:39] but it shouldn't gain any features needed by public use [22:44:46] it's very possible to expose it though if people insist on using it [22:45:11] i think i’d rather see something like this as a backend, with a frontend forwarding/filtering/doing access control [22:45:18] DanielK_WMDE_: revision deletion is all higher-level [22:45:25] at the bucket layer, using multiple tables [22:45:40] as long as the system design doesn't rely on exposing it, and we don't exposit any time soon, fine... [22:46:01] I think what you are hinting at is that you'd like to see the proxy / fall-back handler & rewriting bit in a separate layer [22:46:07] TimStarling: I'm thinking, if what we expose to the public is a wrapper service that proxies GETs straight to the storage service and intercepts PUTs to do complex processing and auth before proxying to storage, that would be OK with me [22:46:18] gwicke: you want to re-implement app-level logic under the hood od the storage api?... [22:46:25] DanielK_WMDE_: no [22:46:32] That would allow you to have a service that pretends to be simple GET/PUT to the outside world, but internally it would still be segregated [22:46:51] RoanKattouw: what you describe is the current implementation [22:46:55] But yes internally those things should not all be in the same service, and the storage service itself should not have auth concerns [22:47:25] gwicke: Well -- what I'm proposing is separating that outwards facing proxy service from the actual storage service [22:47:33] re auth: I'd like to make sure that all storage access is properly authenticated [22:47:36] I don't have a good grasp of to what extent RESTbase is one or the other [22:47:50] otherwise we'd have to audit a lot more code that could access it [22:47:57] RoanKattouw: thanks, i completely agree [22:48:20] gwicke: authed for the specific wiki user? [22:48:25] so we have Daniel, Brion, Mark, Aaron and myself in favour of splitting [22:48:33] and gwicke in favour of integrating [22:48:50] TimStarling: what exactly do you want to split? [22:48:51] oh, and RoanKattouw in favour of splitting [22:49:07] the public API from the storage backend [22:49:07] the proxy layer? [22:49:41] do we have any block diagrams/architectural diagrams anywhere? [22:49:46] that's not very hard to do [22:49:52] not sure though what the advantage would be [22:50:07] i was just thinking how to draw such a diagram with ascii art in irc :) [22:50:09] the proxy layer is simple enough that at least part of its work can be done directly in varnish [22:50:50] if you split the public API from the storage backend and the redesign the storage backend accordingly, then you end up with a simpler backend API [22:50:56] I think that an informed discussion would need a spec of what you'd like to split along with the rationale [22:51:02] I think that is appropriate [22:51:13] regarding only the backend bit, can we begin with mutiple backends to benchmark? and performance aside, maybe that would encourage clearer separation of layers... [22:51:43] cassandra, couchbase, regular sql store [22:52:02] yes, good point springle [22:52:13] i think we're overloading words like "backend" a lot here [22:52:18] gwicke: how strongly does restbase rely on cassandra at present? [22:52:31] and with some diagrams we can at least pinpoint them clearly [22:52:33] is it easy enough to add other storage engines for benchmarking? [22:52:34] springle: the table storage layer is well abstracted; it's actually a separate module [22:52:54] I was thinking about adding a restbase-sqlite module [22:52:58] https://github.com/gwicke/restbase-cassandra [22:53:57] but right now we don't really have the resources to implement every possible backend under the sun [22:54:15] #info DanielK_WMDE_, brion, mark, RoanKattouw, AaronSchulz, TimStarling in favour of altering gwicke's proposal to introduce separation between public API and storage API [22:55:35] #info I think that an informed discussion would need a spec of what you'd like to split along with the rationale [22:55:41] there are very interesting backends out there though [22:56:18] restbase-cassandra is 1800 lines at present [22:56:59] so I can see that it would be difficult to multiply that by 3 [22:57:26] but maybe some kind of simplified indicative benchmark could be done? [22:57:30] gwicke: two or three isn't all of them :) it would give us some chance to prove we are on the right track with cassandra [22:57:42] TimStarling: are you asking us to do this, or are you saying that others could do it? [22:57:44] and perhaps instead of sqlite, mysql would be more useful? [22:57:50] and presumably not much more work? [22:58:17] the point of using Cassandra is that we need to store more data than can be stored on a single node [22:58:18] besides benchmarking, a MySQL backend could allow us a more iterative approach into deploying this [22:58:24] mysql is the null hypothesis [22:58:30] yes [22:58:41] you kind of have to prove some benefit over it to justify deployment of cassandra, right? [22:59:15] #info comparative benchmark against mysql would be nice, but gwicke not offering to implement [22:59:17] maybe an implementation based on Revision::getRevisionText? Written as a MediaWiki extension?... [22:59:20] TimStarling: no, but our current use case requires storage of HTML of all pages & eventually revisions [22:59:23] (only partially joking) [22:59:48] we know that this doesn't fit on a single box, see ExternalStore (and html is quite a big bigger than wikitext) [22:59:53] *bit [23:00:07] right, and you don't want to implement sharding yourself [23:00:14] I think it would fit on a single box if it were properly compressed [23:00:22] I don't see the point in doing that as a first step [23:00:42] TimStarling: on a related note... is external store documented somewhere? i know more or less how it works by now, but i can't find a doc to point people to... [23:00:44] but of course we can do so at some point in the future [23:00:44] gwicke: thing is externalstore has limitations that we could avoid is we wanted to for a service like this. tokudb can match cassandra compression, for example [23:01:17] DanielK_WMDE_: not that I know of [23:01:37] hehehe... [23:02:03] we are pretty much out of time now, so please think about any #action/#info/#agreed commands you want to send to meetbot [23:02:07] right, it's only the backbone of our application ;) [23:03:23] so, back to separation of layer.. one issue I see with that is that for example the pagecontent bucket needs to do some queries about revisions [23:03:39] which is currently implemented as a proxy end point for action=query [23:04:13] yeah, that's not the sort of thing you would expect to see in a storage backend [23:04:17] having that as part of our internal storage api layer feels very wrong [23:04:18] now, we could implement this particular case as a library & distribute the config to front & backend [23:04:26] ha, i managed to doodle a basic diagram: http://www.gliffy.com/go/publish/image/6431470/L.png [23:04:26] but there will be more abstractions like this [23:04:59] (gliffy seems worth a look...) [23:04:59] anyway, we at least have consensus minus gwicke that something needs to change [23:05:01] heh daniel [23:05:27] TimStarling: all I'm saying is that it might appear more appealing at first than it is when you look at it more closely [23:06:01] gwicke: there's a clear outline of steps you can take to make your plans more palatable, and at this point it'd be more efficient to just go ahead and take them rather than argue the point. Just IMO. [23:06:02] apart from a vague 'security' bit I haven't seen many reasons for it so far [23:06:25] ok, thanks everyone [23:06:30] #endmeeting [23:06:30] Meeting ended Wed Nov 5 23:06:30 2014 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [23:06:30] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2014/wikimedia-office.2014-11-05-22.01.html [23:06:30] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2014/wikimedia-office.2014-11-05-22.01.txt [23:06:30] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2014/wikimedia-office.2014-11-05-22.01.wiki [23:06:30] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2014/wikimedia-office.2014-11-05-22.01.log.html [23:06:55] ori: I'm happy to have an informed discussion [23:07:03] i can give you additional reasons; for example having the internal storage layer tied to a pretty public and fairly unstable mediawiki action=query API makes for horrible coupling which is difficult to make reliable [23:07:19] but at this point I'm not sure that there is enough shared understanding of the trade-offs [23:07:32] that may be the case, but in that case it falls on you to provide it [23:08:01] mark: actually, a public API needs to be very stable [23:08:03] and versioned [23:08:07] so that's hardly an argument [23:08:16] it currently isn't [23:08:50] so is your point that it would need to be too stable? [23:09:18] my point is that I wouldn't like to see our internal storage layer coupled to something that is in a separate layer of concerns AND isn't stable [23:09:49] it'd easily become one cascaded house of cards [23:10:26] I share the desire to keep the layers separate; on the other hand I do like the idea of reusing abstractions and avoid repetition and inconsistency [23:10:45] yeah, so let's think about that more [23:11:04] at least we should be able to weight the cons and pros, I don't think we have those fully yet [23:11:07] weigh [23:11:10] I think we should have a look at the code & request flow, the relevant costs etc & then have another discussion [23:11:38] *nod* [23:14:07] DanielK_WMDE_: btw, I didn't get the bit about rightbyte.de/demo/xskin/Test.xml [23:14:18] http://brightbyte.de/demo/xskin/Test.xml [23:19:45] gwicke: AaronSchulz very old experiment of mine. serving page content and user specific content separately, one cachedable server side, the other one client side. they get combiend in the browser, using xslt. [23:20:07] seemed related to your experiment. here's the writeup: http://brightbyte.de/page/Client-side_skins_with_XSLT [23:20:24] it's a bit dated... xslt didn't really become a thing... [23:20:54] err, sorry AaronSchulz, that was a random [23:22:12] gwicke: i seem to remember you where involved in the original discussion back then... but memory is hazy :P [23:23:51] basically, the idea was to anable web caching for logged in users