[00:09:42] TimStarling: https://gerrit.wikimedia.org/r/#/c/252877/ & https://gerrit.wikimedia.org/r/#/c/252878/1 [00:20:20] Oh, ostriches-- https://phabricator.wikimedia.org/T118531 [00:22:26] All done [00:24:10] Thanks! [16:23:32] * bd808 is writing java code and remembering why he stopped doing this years ago [17:13:16] bd808: what java? :p [17:13:50] ostriches: making a new UDF for hive queries [17:14:16] it will classify an ip address as "internal", "external" or "labs" [17:14:29] good java I enjoy. bad java is worse than the dentist. [17:14:42] this is mostly boring java [17:15:07] but with mvn build tools which I always hated [19:22:38] anyone have some PHP/MW vim config I could steal? :) [19:54:02] Krinkle: ping [19:54:32] Krinkle: do you have a sec to talk about https://gerrit.wikimedia.org/r/#/c/252950/2? [19:55:10] I do [19:55:11] gwicke: [19:55:45] i think we (I) have some crossed wires, and maybe it means that doing this as an rcfeed is the wrong choice [19:56:28] https://phabricator.wikimedia.org/T116786 is about writing events from MW to this new event servicee [19:56:42] urandom: krinkle & I just chatted in -dev, see https://gist.github.com/gwicke/df8b347058a19e6556f6 for a log [19:57:47] * urandom is reading [20:03:06] to me, it seems that we could have a thin layer below RCFeed to handle events in general, and then dispatch those that should be included in feeds to RCFeed [20:03:34] Not unlike log groups, however. [20:03:42] Which uses Monolog now [20:04:07] wgFeedGroups['rcfeed'][], perhaps similar to wgHooks? [20:04:15] keyed by "topic" [20:04:22] or recentchanges as topic rather, not rcfeed [20:04:28] yeah, and each event could have several topics set [20:04:30] and then another for other types of events. [20:04:39] Several? That would get confusing [20:04:51] With regards to formatting expectations [20:05:10] sure; if we make the topics fine-grained then the feeds can list what they are interested in [20:05:25] I was thinking about including a 'rcfeed' meta topic [20:05:26] We should not hardcode seemingly arbitrary aggregations of topics inside mediawiki core. If we want that maybe we should have a kafka consumer on mediawiki_recentchanges, and have it feedback into Kafka for specific things and formats. [20:05:40] yeah, agreed [20:05:42] to a different topic. [20:06:09] the other aspect is the private / public data split per event [20:06:29] one way to handle that would be to emit two events right at the source [20:06:30] Yeah. As long as it's all bound to recent changes I think we can do it on that side of the devision. [20:06:31] Makes sense [20:07:24] OK. So yeah, we need them to be outside recentchanges. I realise why now. [20:07:29] Because tey would not have an rcid [20:07:32] unless they are in the table [20:07:49] and if they are in the table, we break the interface between consumers of the table and require filtering, which Id like to avoid. [20:08:14] yeah, unless all those consumers go through a single accessor it would be hard to enforce consistent filtering [20:08:38] It would be interesting to have the SQL writer be an inphp consumer of the feed, but can't be because of rcid [20:08:43] which is only known after writing [20:08:43] on the other hand, having a table with all events could be nice as a low-fi eventbus backend [20:08:50] Yeah [20:08:54] Hm.. [20:09:30] we were talking about consistent event emission as well [20:09:49] it's a lot easier to guarantee consistency if you can write to a table as part of a primary transaction [20:09:50] Yeah, we can add restricted to the rc table. Index on it. Hidden by view. Hidden from feeds by default. Optionally made visible. For Redis, IRC etc. (which are currently unfiltered to the public) they don't change anything. For internal Kafka we can do the full feed including restricted. [20:10:22] But we may still need an idea of topics at some point. [20:10:31] But for the initial use case it seems like rc covers it [20:10:49] a lot of other things currently use logstash and statsd instead. [20:11:31] yeah, which makes sense if you don't need 100% reliability [20:11:56] It does mean we can only emit events from POST requests. [20:12:01] I mean, with the goal to not connect to master on GET. [20:12:22] which is already the case, but something to keep in mind [20:13:00] that's fine for the events we have in mind right now [20:13:24] The added rc table field should be simple, but not trivial. [20:13:57] now, if we used this as a backend for eventbus in small installs, would it be fine to write all kinds of events in there? [20:14:05] Hm.. [20:14:11] like RESTBase signaling that HTML for some revision was rerendered? [20:14:23] Not sure I follow [20:14:38] eventbus is a general event system [20:14:44] Right [20:14:55] it'll include events from different services, including MW, RB etc [20:15:05] would small installs have RB? [20:15:06] a lot of the functionality will be job queue like [20:15:36] yes, I am operating under the assumption that small installs will have Parsoid, RB and VE [20:15:59] and some form of EventBus [20:16:14] Right and we want to stop MEdiaWiki from being directly aware of RB (with the RB extension) [20:16:26] yeah [20:16:48] Do we define EventBus as a protcol, or as a service? [20:17:14] RCfeed right now is an interface/protocol. It can be whatever one configures it to be. [20:17:17] it's in flux; the minimal definition is queuing functionality similar to kafka's [20:17:24] There is no RCFeed composer or node package. [20:17:30] a medium definition is that there's a producer API as well [20:17:44] the full-fledged definition might eventually have a consumer API, too [20:18:21] the implementation is at the 'medium' stage [20:18:41] We could have a MediaWiki extension that implements the basic system as an SQL table. And exposes its HTTP through the MediaWIki API :/ [20:19:05] Krinkle: yeah, especially for the producer side that should be fine [20:19:19] websockets might be a bit tricky in that scheme, but.. maybe with hack? [20:19:19] But how RB is going to listen to that is another matter.. [20:20:16] Right now Parsoid and RB both don't need a complicated installation right? Parsoid is a service that works out of the box with npm-install and a config file. RB can fallback to sqlite [20:20:29] yeah [20:20:32] if we do eventbus as a node service, then it needs a database. [20:20:40] and we should be able to bundle both into a single service [20:21:00] using service-runner to offer both services in a single node worker on low-mem installs [20:21:09] Right [20:21:33] I won't be able to install this on my 2 wikis that I run in shared hosting though. [20:21:53] yeah, the other option would be to leverage RB's storage modules [20:21:54] I was able to enable memcached and APC and upgrade to PHP 5.6 from their config panel. But cgi through apache only. [20:22:40] the packaging discussion is a fun one [20:23:07] hopefully, we'll eventually have either 'docker run mediawiki', or 'apt-get install mediawiki' [20:24:00] I can upload any PHP and python. but it's cgi through apache. And no node. [20:24:09] Now I could easily add 1 or 2 dollars per month and then I have it elsewhere [20:24:20] There are old sites I don't mind upgrading. It's just an example :) [20:24:30] you can get a full VM for $2 a month ;) [20:24:40] http://serverbear.com/ [20:24:42] But that means I"m now maintaining a server [20:24:44] I don't want that [20:24:52] I've spent less than 4 hours in total on these sites in the past 3 years. [20:24:56] yeah, hence docker or the like [20:26:06] this conversation isn't easy to following coming into late, can someone summarize? [20:26:12] s/following/follow/ [20:26:59] urandom: an issue is that we need some private events that aren't in rcfeed right now [20:27:23] the ones we need right now are compatible RC, theyre just omitted right now. [20:27:53] So for now we don't need an extra layer of topics yet and where to store them. [20:27:58] so we were thinking about ways to filter in a central place [20:28:03] * ori still does not understand the crux of the disagreement [20:28:12] But we do need to figure out the stock install strategy for eventbus. [20:28:21] There is no spoon, ori. [20:28:35] is there a spork? that would do, in a pinch. [20:28:40] :) [20:28:50] a Titanium Spork. [20:28:51] Krinkle: what are you objecting to? [20:28:56] Nothing in fact. [20:28:57] by Light My Fire [20:29:08] what is gwicke objecting to? [20:29:11] gwicke: So let's go concrete. [20:29:14] or are the two of you in ferocious agreement with one another? [20:29:20] Let's try something and see what we think. [20:29:25] ori: we were just brainstorming [20:29:34] I got to run, sorry [20:29:37] a node service for eventbus, bundleable with parsoid/rb via service-runner for small installs [20:29:38] well, don't I feel like an asshole now! :P [20:29:40] using the same sqlite db? [20:29:47] Krinkle: why node? [20:29:50] configurable with mysql of course [20:29:56] * urandom 's head implodes [20:29:59] there was the decision to extend EventLogging IIRC, and ottomata is working on packaging [20:30:05] ori: node or php, for pubsub, you probably won't wnat php. [20:30:19] ori: This is not for prod. [20:30:23] Krinkle: there's a history there, let's not regress [20:30:42] But we are not going to advocate that the minimal working install of MediaWiki + VisualEditor includes Kafka and EventLogging [20:31:11] why not? [20:31:27] It needs Parsoid + RB, and right now that works because RB has a MW extension that will inform RB about any actions it needs to hear about. [20:31:36] That is something we want to replace. [20:31:54] next year in jerusalem [20:32:11] MW is going to get a generic interface (designed after Kafka) that has a topic and a message, basically. And how you publish it, and with what, is configurable, null, by default. We'll use Kafka. [20:32:11] anyways [20:32:24] or EventLogging [20:32:26] either weay [20:32:38] But for plain installs we're thinking what is the right approach. [20:32:49] urandom is new to mediawiki-core development, so let's ease him into our wonderful world of "perhaps you thought you had consensus, but i'm here now to challenge all the points of agreement you thought you had finally established" [20:33:04] \o/ [20:33:09] They already have a node server running with parsoid and restbase (independent, but running alongside in the same service-runner) [20:33:13] and in general make room for the people doing the development work to actually do the thinking [20:33:39] OK. I didn't know that! [20:33:51] Krinkle: this was discussed, in RFC meetings, and in Phab threads long enough to make Samuel Richardson feel inadequate [20:35:34] OK. Let's start again with the commit we have in Gerrit. [20:36:15] that, is based on the discussions in the ticket, and some here with legoktm [20:36:24] urandom: We both miss some context. I for one, am missing the context of EventLogging and Kafka somewhat. I'm aware with that being the direction and am absolutely fine with that. I'm here to learn. [20:36:50] urandom: I'd like to understand what brought you to the state of this commit. E.g.is the $event format being modelled after something pre-existing? [20:37:05] for some value of pre-existing, yes [20:37:27] the schemas are under discussion at https://github.com/wikimedia/restevent/pull/5 [20:37:40] there is a service that ottomata has been working on, and another at https://github.com/wikimedia/ [20:38:14] and yeah, the events used with those APIs can be found here: https://github.com/d00rman/restevent/tree/basic-events/schemas [20:39:25] Krinkle: so the assumptions that led to that gerrit are that we'd hook into MW *somehow*, to do an HTTP post to a service, using events formatted according to those schemas [20:39:40] and there seemed to be some consensus that RCFeed was a good starting point [20:39:48] OK. I don't want to re-open any made decisions, but I think there is some leeway within implementation that may make it seem different, but maintains the same semantics. For example, we don't need to filter down the stream to just the three hand picked sub events. That could introduce an odd subset into the mix that is harder to re-use. If we need a narrow [20:39:48] subset for RestBase, I think we can either make the events configurable, or we can have a Kafka consumer within Wikimedia's set up that will listen to recentchanges from EventBus and feed the subset back into another Kafka topic that REstbase will consume. [20:39:59] urandom: Yeah, that sounds good. [20:40:54] what do you mean by making the events configurable? [20:41:23] Like $wgRCFeeds['eventbus'] => array( ... rcType => array( 'log' ) ) [20:41:25] it seems that we could evolve how events get into RCFeed slightly to support some amount of filtering / mapping [20:41:29] or something like that. [20:41:44] But I don't think we need to do that per se. [20:41:47] Krinkle: i see [20:42:07] yeah, i'd though about adding a configurable filter of some kind [20:42:17] I mean, it depends. We want to use this system for more things, so might as well just expose recentchanges as-is in a format we know and understand. [20:42:30] otherwise the feed should be called restbase-purges [20:42:50] we need a way to add new events without them showing up in all feeds by default [20:43:11] & some way to audit what goes where [20:43:12] define events. You mean things that are not recentchanges? [20:43:43] gwicke: I guess you don't want the restbase purger to receive all of rcfeed? [20:43:48] so, pushing everything to a kafka topic (or two, if i understood, private and public), and then reprocessing them into eventbus topics was one option? [20:44:14] recentchanges is a collection of events that are all deemed useful & not too sensitive for a public recent changes feed [20:44:29] I think 'restricted' would be a JSOn property of the blob pushed to mediawiki_recentchanges topic. But it coudl also be a separate feed. [20:44:30] that's how I see it, at least [20:44:38] it includes several different events [20:44:40] Yeah [20:45:07] Krinkle: that's interesting [20:45:10] edits, new pages, log events, categories. [20:45:16] those are the main RC types [20:45:30] and log events has many subtypes. Many of which are provided by MediaWiki extensions and plug ins. [20:45:38] it seems easy, we already have code in eventlogging for this, i think [20:46:29] categories is actually not exposed in rcfeed. So that gives us a nice compat list: new pages, edits and log events. [20:46:32] compact* [20:46:51] we'd want to add suppression events [20:47:13] They have a generic schema to them defined in MachineReadableRCFeedFormatter that doesn't vary on the subevent. So we can make a schema for it that is generic. E.g. not specific to delete events. [20:48:06] urandom: Yeah, so that is the only tricky part. There is particular type of event, suppression. Which is important to RESTBase, and is not currently provided by the RCfeed system. So as-is it wouldn't solves the first use case. [20:48:48] is suppression not something that could be added to recentchanges? [20:48:52] even a binary 'recentchanges' vs. 'suppression' topic distinction would already be useful to us [20:49:02] but, there are already types that we could elevate to the topic level [20:49:21] I'm checking now whether suppression itself also emits a log event. [20:49:27] I wonder if one can suppress a suppression event? [20:49:43] I think so [20:49:54] but then that one also emits a suppression event [20:50:06] which should be visible internally only [20:50:13] so that RB can suppress content [20:50:35] * Foo deleted page Sandbox. [20:50:39] suppression is for deletes? [20:50:46] urandom: Surpression is for any log entry. [20:50:57] Account creation, page move, page delete, user block, anything. [20:51:09] See the dropdown menu at https://en.wikipedia.org/wiki/Special:Log [20:51:28] yeah, there's two kinds of suppression ;) [20:51:34] one for log entries, and one for revisions [20:51:34] Authorised users have an additional UI component on that page that allows them to suppress an entry. [20:51:59] That will flip a flag in the logging table, causing the UI rendering of that list item to be greyed out. [20:52:10] what is that used for? [20:52:15] And it then adds a new log entry above it that says a suppress took place. [20:52:34] to make a log/entry secret? [20:52:37] Foo changed visibility of a delete log event. [20:52:50] :/ [20:52:53] Here' an example. [20:53:00] I just deleted my local wiki's Main page [20:53:01] 19:29, 13 November 2015 Root (Talk | contribs | block) deleted page Main Page [20:53:05] It says that on my Special:Log [20:53:35] Now, I hide that event. Which changes the UI rendering of that log entry to this: [20:53:43] I agree that this could be an attribute [20:53:43] 19:29, 13 November 2015 (username removed) (log details removed) [20:53:54] redacted [20:54:07] 19:35, 13 November 2015 Root changed visibility of a delete event on Special:Log [20:54:13] but, that doesn't necessarily address the revision suppression use case [20:54:27] Yeah, I don't think Restbase is interested in log/supress [20:54:31] but you want rev delete [20:54:35] which is similar [20:54:48] it's actually an event topic [20:55:06] rather than a restriction on which events (or parts of events) are visible to which audience [20:55:37] The attribute doesn't work indeed. Because no event would ever be restricted from that point of view. [20:56:19] I mean, the delete evenet wouldn't be restricted when the consumer originally got it. [20:56:29] ok, so tl;dr, these suppressions aren't something that can be handled by recentchanges? [20:56:39] It can be, but currently is omitted from it [20:56:43] It is added to Special:Log [20:56:44] k [20:56:54] and almost all of Special:Log is also in Special:recentChanges (limited to 30 days) [20:57:05] but suppression is currently omitted from recent changes [20:57:09] it depends on whether "recentchanges" is what you see in public recent changes, or if it's "all events" [20:57:21] The logging system has a 'restricted' attribute which controls whether a user sees it when they view Special:Log [20:57:32] the recentchanges system does not currently have this attribute. [20:57:34] But we could add it. [20:58:09] in the short term, it seems safer to not feed sensitive information into RCFeed [20:59:08] Well, it would be hidden by default. [20:59:09] one way to guarantee that could be to dispatch those events at a layer below [21:00:31] the simplest version of that could perhaps be to start with a single internal topic called 'recentchanges' [21:01:02] we can then add a second one called 'suppression', and subscribe the eventbus producer (but not others) to both [21:03:34] Yeah [21:04:10] later, we could consider un-bundling 'recentchanges' into different events [21:04:47] gwicke: when you say 'topic', do you mean to have MW produce to kafka directly, and then subscribe and forward to eventbus? [21:04:56] So I checked where revision delete comes in. Both revision delete and log suppression are part of log type 'delete'. With log actions delete/revision,and delete/logging respectively. [21:05:02] or are you referring to 'topic' conceptually here [21:05:37] urandom: in this context, I mean a string property we can match on [21:05:39] Krinkle: so rc *does* supply the suppression we need? [21:10:21] OK. I'm gonna zoom out for a minute. [21:11:15] there's definitely a "logentry-delete-revision" message [21:12:06] Let's consider a (simplified) version of MediaWiki: revision table, logging table, recentchanges table. Edits are saved in revision table. A summary of this is also saved in the recentchanges table with added info we only keep for 30 days. Non-edit actions (such as deleting pages, account creation and renaming pages) are logged in the logging table. The [21:12:06] logging table is kept indefinitely and vieweable via Special:Log. [21:12:24] Log actions, like edits, also result in the creation of a recent changes entry. [21:12:42] All recent changes entries are also published through any RCfeeds the site has configured. [21:13:25] the logging table has an attribute 'restricted'. Entries with this set are omitted when a user queries rows from the logging table for Special:Log. [21:13:43] When log entries with 'restricted' are created, they also do not go to recentchanges. [21:14:40] When an admin deletes a page in MediaWiki, Restbase needs to know about this so it can delete it there too. [21:14:47] it's a more specialized version of the topic thing [21:14:58] it would basically hard-code some mappings [21:15:26] with topics, we could have events that aren't logged internally at all [21:15:29] Page Delete events are normally public and as such in recent chnages. [21:16:17] When an admin suppresses an individual revision (edit) but keeps the page, this creates a restricted log entry since it is not public that this happened. Restbase needs that event as well. But those don't go to recent changes since recent changes does not have a restriction ability at the moment. Public-only. [21:16:50] urandom: I hope that makes more sense? [21:17:00] Krinkle: yes, that helps a lot. [21:17:17] Krinkle: thank you [21:18:14] I mostly care about not having the logic about which events & properties to include spread across all those formatters / backends [21:19:07] gwicke: Yeah. agreed. But I think splitting up at the topic layer and not having individual formatters decide are mutually exclusive. [21:19:24] However, if we do add it to recentchanges, we have to add it all the way, which is a project we may not want to take on. [21:20:00] So while imho it is semantically wrong to publish this event outside topic=recentchanges, I think it is the best way forward for now. [21:20:11] yeah, we can't consistently implement filtering etc right now [21:20:27] for all logging table consumers [21:20:37] recetnchanges table consumers* [21:20:52] logging and revision already have it. That was a major change we did in the last 2 years. [21:20:56] both, I think [21:21:10] logging already has it too. It's just RC that doesn't. [21:21:24] oh, okay [21:21:42] E.g. if someone surpresses a revision, most people will not see it on Special:Log, but authorised users do [21:22:42] *nod* [21:22:47] or rather, they can if they click on it (it's striked through with placeholder by default for everyone, but details) [21:23:20] It was a bigger deal for revision, which is often queries by tools in labs. [21:23:44] but we did it, and it's now manadatory for tool owners to filter responsibly. They have access to it (and supposed to), but it's just tricky. [21:24:15] depending on the level, if it's restricted above system, then its not visible in labs either. [21:24:19] sysop* [21:24:33] okay, so just to clarify: currently, RCFeed is not seeing suppression events, but the log table does? [21:24:35] but labs seems default hidden revs and archive table etc. [21:24:40] gwicke: Yes [21:25:03] okay, then: how would we tap into those events? [21:25:33] If the delete revision event ends up surpressing content, (as opposed to unsupressing content or doing sometihng else) then it will flip the prefix from delete/* to supress/* e.g. surpress/revision, and then it matches isRestricted() which makes updateLog() in RevDelList return early instead of publishing to recentchange as well. [21:25:49] RCFeed production looks very ad-hoc & inline right now [21:26:00] LogEntry::publish, not RevDelList ::updateLog [21:26:02] anyway. [21:27:08] gwicke: It's a lot better than it was actually [21:27:16] But yeah, not pretty. [21:27:32] So there is one separation for example. A log entry can be published to sql, feeds, or both. [21:27:44] akwardly named 'rc' and 'udp' respectively. [21:27:54] where 'rc' means DB and 'udp' means feeds. [21:28:07] quite obvious [21:28:16] default is 'rcandudp' but patrol log actions, for example, after the logging table, only go to feeds, not to recent changes. [21:28:29] since that would be conceptually less useful to see patrol events of rc events in rc again. [21:28:33] as patrollable events :P [21:28:44] that could be fun [21:28:47] wat. [21:28:48] another one is 'changetags' [21:29:05] Those are the only to in core right now that go to feeds but not DB. [21:29:10] So it seems we have this already. Weird. [21:29:16] I woudl expect that to break since those don't have an rcid [21:30:22] great, so we have a two-topic event emission system already [21:30:27] :D [21:30:40] I don't think anyone knows [21:30:51] including the people that made it, piece by piece unware of what it became [21:31:04] got a link to that code? [21:31:17] includes/changetags/ChangeTags.php:755, includes/logging/PatrolLog.php:65 [21:31:31] RevDelList::updateLog [21:31:32] awesome, thanks! [21:31:36] LogEntry::publish() [21:31:38] Those