[21:01:10] <TimStarling>	 #startmeeting RFC meeting
[21:01:10] <wm-labs-meetbot>	 Meeting started Wed Sep  2 21:01:10 2015 UTC and is due to finish in 60 minutes.  The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot.
[21:01:10] <wm-labs-meetbot>	 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
[21:01:10] <wm-labs-meetbot>	 The meeting name has been set to 'rfc_meeting'
[21:01:31] <TimStarling>	 #topic RFC: Master/slave datacenter strategy for MediaWiki  | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
[21:01:35] <jzerebecki>	 \o
[21:01:39] <ori>	 Hello
[21:01:43] <robla>	 \o
[21:01:48] <bblack>	 hi
[21:01:51] <legoktm>	 hi
[21:01:59] <paravoid>	 hello
[21:02:04] <godog>	 'lo
[21:02:07] <chasemp>	 hi
[21:02:14] <matt_flaschen>	 Hey
[21:02:22] <SMalyshev>	 hey
[21:02:28] <NDK|Cloud>	 hi
[21:03:04] <spagewmf>	 wuff
[21:03:15] <TimStarling>	 #link https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
[21:03:15] <comets>	 great meeting..short and sweet :D
[21:03:19] <AaronSchulz>	 heh
[21:03:57] <ori>	 AaronSchulz: would it make sense to start by reminding everyone what is the problem you're trying to solve?
[21:04:17] <ori>	 just to get everyone on the same page
[21:04:19] <robla>	 +1 to ori
[21:04:45] <AaronSchulz>	 ori: I suppose, though it hasn't changed since I wrote the rfc intro yet :)
[21:05:06] <ori>	 up to you; it's your show
[21:05:07] <robla>	 AaronSchulz: feel free to cut and paste the appropriate part of the RFC  :-)
[21:05:31] <gwicke>	 https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki#Background
[21:05:34] <AaronSchulz>	 so, the idea is to have more traffic served by our new Dallas DC while also having the Ashburn one serve traffic
[21:06:22] <bblack>	 Well, I can put in some thoughts on how we're looking at the very basics at the traffic layer.  Even if we didn't have true live master/slave at the applayer, the intent has been to have the two primary DCs have hot, in-use cache layers that do not rely on each other.
[21:06:24] <AaronSchulz>	 I heard that CDN is up in codfw (Dallas), which is cool. This will extend that to HTTP GET page requests for MediaWiki in general (logged out or in, doesn't matter)
[21:07:11] <bblack>	 The idea would be (if nothing else here came to fruition) that the non-primary directly reaches the applayer over the wan link, and we could configuration-switch both of them to backend to the opposite DC
[21:07:38] <RoanKattouw>	 So we will have appservers in codfw serving traffic, while eqiad remains the primary appserver DC?
[21:07:48] <AaronSchulz>	 yes
[21:07:52] <bblack>	 so, we can have user-facing traffic split already fairly easy.  the key thing here is being able to truly use both at the same time at the applayer to keep various lower-level things warm and running
[21:08:34] <gwicke>	 so it sounds like you are shooting for a phased approach, with CDN being possibly the first phase?
[21:08:47] <AaronSchulz>	 ideally it could lower latency in the common (non-switch) case as well, though keeping both DCs warm and at the ready is the main focus
[21:08:51] <TimStarling>	 AaronSchulz: what are the remaining problems and what do you need input/help on?
[21:08:57] <bblack>	 well I'm saying the CDN part I mentioned above isn't even a part of this.  It's already being architected that way.  It's a precursor to this :)
[21:09:00] <AaronSchulz>	 TimStarling: good question :)
[21:09:37] <gwicke>	 bblack: I mean in terms of actual traffic switching, it would probably be the first step?
[21:09:47] <bblack>	 but having that traffic split/switch capability at the edge/cache layer does not give us hot/ready/tested/filled caches at lower layers, etc
[21:09:58] <AaronSchulz>	 I'll need to know the status of Swift replication (I've only heard bits and pieces) and I've been keeping an eye on Elastic replication (the search team has been doing some work on this)
[21:10:48] <bblack>	 gwicke: it's already happening in a limited way.  Texas users are hitting codfw right now, but the config isn't in the final desirable state yet either.
[21:11:03] <AaronSchulz>	 I'll also need to recruit someone, maybe eevans/ori to work with me on getting cache relay purging working
[21:11:07] <RoanKattouw>	 What is the progress with adapting MW code and extensions to the New Ways of doing things (WANCache, idempotent GET, etc.), and is there a place where devs can educate themselves on these practices/changes?
[21:11:12] <godog>	 I can speak to the swift part, ATM we are replicating originals in codfw from eqiad with 'swiftrepl' from mark
[21:11:25] <urandom>	 AaronSchulz: i'm still game
[21:11:33] <godog>	 I tried last year per-container swift replication but didn't get very far
[21:11:35] <AaronSchulz>	 the model is there, but we need a real service for that (not just the toyish prototype)
[21:11:35] <gwicke>	 bblack: oh, nice; didn't know that we were that far already
[21:11:49] <paravoid>	 AaronSchulz: what do you mean by "cache relay purging"
[21:12:08] <AaronSchulz>	 paravoid: as in relaying memcached deletes and purges
[21:12:17] <AaronSchulz>	 to all DCs
[21:12:34] <TimStarling>	 there is an application layer (i.e. MW) option for swift isn't there?
[21:12:44] <AaronSchulz>	 TimStarling: for replication?
[21:12:48] <TimStarling>	 yeah
[21:12:49] <bblack>	 right, that's where I see the benefit of the work in this RFCs, beyond what we're doing at the edge: having memcached, db, redis, etc all working hot all the time, and ready for a fast easy switch to known-working stuff.
[21:13:02] <bblack>	 with the edge-layer work alone, we'd be looking at a very cold/questionable switch at the applayer
[21:13:03] <TimStarling>	 at least, I thought it was discussed
[21:13:07] <paravoid>	 AaronSchulz: ah. the RFC hints at that ("memcached purger daemons") but it isn't really spelled out
[21:13:09] <AaronSchulz>	 I guess the FileJournal could be turned on again with SyncFileBackend in a service loop or something
[21:13:42] <AaronSchulz>	 it would need some slight bolstering (a --uselock mode flag, not much work)
[21:13:54] <TimStarling>	 right
[21:13:58] <AaronSchulz>	 I never got around to that since I wanted to see if a lower layer setup could work
[21:14:12] <ori>	 paravoid: AaronSchulz has something working with https://github.com/AaronSchulz/python-memcached-relay but (AFAIK) is not sure it's the right solution; something off-the-shelf would be nicer.
[21:14:13] <AaronSchulz>	 (e.g. something in Swift...container sync, geo-clusters, ect)
[21:14:52] <godog>	 we haven't tried native swift cluster replication no, but the thought of leaving the two unaware of each other was nice
[21:15:46] <brion>	 is there anything we need to change in file storage to be friendlier to replication? just thinking of, eg, replacing large files on re-upload
[21:16:22] <gwicke>	 content-addressable storage would reduce the number of updates
[21:16:39] <TimStarling>	 so swiftrepl is done and in production? is that enough?
[21:16:53] <gwicke>	 there is a separate conversation in progress on that
[21:17:12] <AaronSchulz>	 TimStarling: I don't think that works by tailing updates though, but it would be useful for random things that still use swift outside of FileRepo (does math still do this?)
[21:17:27] <TimStarling>	 yeah, math and score
[21:17:35] <paravoid>	 I would hardly call swiftrepl "done"
[21:17:42] <paravoid>	 it has always been a hack coded in one afternoon or two
[21:18:09] <godog>	 agreed, it does need some love if it stays
[21:18:11] <paravoid>	 it sequentially goes through all files in all containers and copies them (or deletes them) across
[21:18:15] <AaronSchulz>	 TimStarling: I'd be worried it would take too long for some math/score files to show
[21:18:21] <AaronSchulz>	 that would need some though at the least
[21:18:32] <AaronSchulz>	 using swiftrepl/copyFileBackend
[21:18:38] <ori>	 paravoid, godog: so looking ahead to the end of the meeting -- can you capture what remains to be done in a short one-liner TODO?
[21:18:41] <paravoid>	 so yes, it takes hours to propagate changes
[21:18:46] <ori>	 (re: swiftrepl)
[21:19:03] <AaronSchulz>	 using the syncFileBackend would require mariadb FileJournal updates just for math/score stuff...ugly though maybe not world ending
[21:19:06] <godog>	 ori: sure
[21:19:33] <AaronSchulz>	 it could replicate quickly though, which is the advantage
[21:19:37] <paravoid>	 I don't think swiftrepl is a solution that shuld be considered here at all, honestly
[21:19:38] <gwicke>	 so one project that could potentially help with speeding up replication is the event queue we are going to work on next quarter
[21:19:51] <paravoid>	 AaronSchulz: we don't do keep thumbs in the journal though, do we?
[21:20:02] <AaronSchulz>	 nope, same as math/score
[21:20:03] <paravoid>	 it has been a long time since I cared for those things, forgive me if I'm totally mistaken :)
[21:20:09] * AaronSchulz didn't want to spam the DB
[21:20:18] <ori>	 gwicke: I don't think it would be responsible of us to bank on that. It sounds like a tricky thing to get right.
[21:20:26] <AaronSchulz>	 though thumbs at least can regenerate on request
[21:20:45] <gwicke>	 sure, just saying that we'll likely have an event stream of things like file changes soon
[21:20:55] <paravoid>	 possibly, yes; but then we'd have to handle propagating deletes somehow...
[21:21:40] <AaronSchulz>	 sure, that's why swift sucks for thumbnails isn't it?
[21:21:48] <paravoid>	 among other reasons :)
[21:21:50] <AaronSchulz>	 *doesn't
[21:22:21] <AaronSchulz>	 it would be nice if the CDN and thumbnail regeneration didn't have to care about the other DCs
[21:22:25] <godog>	 we also never garbage collect thumbs afaik
[21:22:27] <AaronSchulz>	 alas, one can dream...
[21:22:59] <TimStarling>	 direct FileBackend callers in deployed extensions are ConfirmEdit, Score, Math, GWToolset and timeline
[21:23:42] <matt_flaschen>	 I thought that MediaViewer needed thumbs generated up-front for performance reasons.  I could be off or out of date, though.
[21:24:10] <brion>	 i wouldn't mind just keeping thumbs in varnish though :)
[21:24:18] <godog>	 matt_flaschen: yes standard thumb sizes are pregenerated now iirc
[21:24:19] <brion>	 they can be precached via URL
[21:24:25] <TimStarling>	 yeah I saw that Aaron has a patch for that
[21:24:37] <TimStarling>	 https://gerrit.wikimedia.org/r/#/c/126210/2 "[WIP] Added support for CDN-only thumbnail storage"
[21:24:39] <bd808>	 brion: I think there's an RfC for that (thumbs in varnish) ;)
[21:24:59] <AaronSchulz>	 TimStarling: yeah, well the real work would be CDN for that, so that patch does very little ;)
[21:25:03] <TimStarling>	 "updated Apr 18, 2014"
[21:25:15] <paravoid>	 that's a whole different can of worms, gilles resurrected that discussion and we had a meeting about it last week again
[21:25:27] <bblack>	 well they're in varnish anyways, the hot ones.  the question is how much we want to trust that we never lose all of our varnish caching all at once globally, and fall back directly on regenerating from scratch.
[21:25:36] <bblack>	 swift saves us from the worst impact of that
[21:25:40] <AaronSchulz>	 brion: sure, if we could have variant purging and a decent (not nmap) persistence model...maybe Apache TrafficServer with some hacks
[21:25:49] <AaronSchulz>	 paravoid: right, easy to be derailed
[21:25:58] <bblack>	 (and yes, agreed, huge separate topic)
[21:26:11] <TimStarling>	 swift also acts as a third tier cache big/cold cache
[21:26:23] <bblack>	 right
[21:26:46] <gwicke>	 yeah, and bounds the effort for huge multi-page tiffs by keeping base thumbs around
[21:27:02] <bblack>	 AaronSchulz: ATS eval vs varnish for all related things in our long-term plans regardless, but they'd serve a similar functional purpose and have roughly-similar issues wrt this
[21:27:26] <AaronSchulz>	 if we have to direct thumbnail regenerations to eqiad to keep the listings (for purge) in sync, we can start off that way. I'd like to actually use both DCs for scaling so that would be temporary until something better comes (either the thumbnail redo or just decent swift replication)
[21:27:41] <brion>	 i'd also like to consider separating swift groups between original files and long-lived derivatives (such as large video transcodes)... they can't be created on demand as easily as thumbs but are still distinct i think
[21:27:50] <AaronSchulz>	 bblack: well I assume the disk persistence would be more trustable, no?
[21:27:58] <ori>	 we have a few things on the table now, but it needs to be divided up into units of work and sequenced logically
[21:28:01] <AaronSchulz>	 bblack: but bucking would still need tricks, yes
[21:28:06] <ori>	 s/it needs/they need/
[21:28:12] <AaronSchulz>	 *bucketing
[21:28:21] <bblack>	 AaronSchulz: it's pretty trustable as it is.  We've made a lot of improvements to it over the past couple of years to make it reliable.
[21:28:55] <AaronSchulz>	 bblack: good to hear; last time it came up in discussion it was very sketchy
[21:28:59] <bblack>	 but we could still face a C-level or VCL-level or hashing-level bug that causes to accidentally lose things there.  falling back on swift to re-warm the caches is one thing.  falling directly back on scalers is another.
[21:29:18] * robla wonders just how much of this RFC should be incorporated into https://www.mediawiki.org/wiki/Architecture_guidelines
[21:29:26] <gwicke>	 some kind of timeline with milestones & a testing plan could help to give others an idea of where we are at & what's ahead
[21:29:50] <gwicke>	 not so much with dates, more the sequencing / dependencies as ori says
[21:30:07] <chasemp>	 re: search the plan is to have the jobqueue at eqiad for now be responsible for updating indexes at both locations, this gets us parity for writes.  We just now have hardware in for dallas and it's not yet even OS'd.  Precursor refactor for jobqueue https://gerrit.wikimedia.org/r/#/c/235149/, there is a one off here that doesn't go through jobqueue that needs some thought: https://phabricator.wikimedia.org/T109126. A
[21:30:07] <chasemp>	 nd no solution for redirecting read traffic has been sorted out to my knoweldge.  Briefly we discussed having an etcd key that keeps the active "svc" url or some such and relying on its multi-dc replication if available.  We we want the ability to hit search at either location from either location.  But it's an open ended question.
[21:30:13] <ori>	 (TimStarling: are '#info's fair game throughout the meeting, or should they be reserved for the end?)
[21:30:16] <chasemp>	 that was longer than I expected
[21:30:34] <TimStarling>	 you can use #info and #action throughout
[21:30:34] <spagewmf>	 robla yes, and to RoanKattouw's question: https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki#Design_implications says "Some changes would be needed to MediaWiki development standards", I would love to meet with AaronSchulz  and turn that list into wiki improvements, e.g. https://www.mediawiki.org/wiki/Performance_guidelines#Persistence_layer
[21:31:06] <brion>	 +1 should add to guidelines!
[21:31:10] <AaronSchulz>	 chasemp: yeah, that was my understanding
[21:31:12] <spagewmf>	 #action spagewmf meet AaronSchulz and incorporate RFC guidelines into developer docs
[21:31:27] <spagewmf>	 *RFC's guidelines
[21:31:43] <AaronSchulz>	 brion: some has been done already (e.g. caching). I can't recall how much GET/POST stuff has been mentioned on mw.org yet...
[21:32:07] <ori>	 #info Plan for search is to have the jobqueue at eqiad for now be responsible for updating indexes at both locations, this gets us parity for writes.  We just now have hardware in for dallas and it's not yet even OS'd.  Precursor refactor for jobqueue G235149, there is a one off here that doesn't go through jobqueue that needs some thought: T109126.
[21:32:12] <AaronSchulz>	 anyway, I still need to finish slogging through MW to get to more of the ops side stuff (e.g. the POST/GET vcl logic)
[21:32:16] <ori>	 #info And no solution for redirecting read traffic has been sorted out to my knoweldge.  Briefly we discussed having an etcd key that keeps the active "svc" url or some such and relying on its multi-dc replication if available.  We we want the ability to hit search at either location from either location.  But it's an open ended question.
[21:32:29] <bblack>	 relatedly, the stuff in the RfC about Sticky DC cookies: that's something we can implement wholly at the edge layer too.  We'd nominally have a natural traffic split between the two regardless, but we can do the post-POST short cookie thing as well at that layer, so that even if the users falls into a corner case where they bounce between DCs at the edge layer, they still hit a consistent applayer
[21:32:35] <bblack>	  DC during their cookie window.
[21:32:36] <AaronSchulz>	 at this point, it looks like almost everything at least has a patch, so I hope that's close to wrapping up
[21:33:04] <AaronSchulz>	 I've been looking at DBPerformance logs in Kibana and tracking down/fixing usage patterns
[21:33:25] <chasemp>	 tx ori
[21:33:37] * AaronSchulz would also like to get slave lag down a bit, since it affects the WAN cache usage patterns if too high
[21:33:46] <ori>	 #info direct FileBackend callers in deployed extensions are ConfirmEdit, Score, Math, GWToolset and timeline
[21:33:51] <gwicke>	 AaronSchulz: is https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki#Deployment_steps still up to date?
[21:33:53] <jynus>	 row based replication, AaronSchulz
[21:34:11] <AaronSchulz>	 https://phabricator.wikimedia.org/T95501
[21:34:23] <jynus>	 +SSDs will solve most of the issues
[21:34:29] <AaronSchulz>	 jynus: yeah, that will be awesome; I suspect there is lots we can do to MW in the mean time
[21:35:05] <AaronSchulz>	 it's not like we never do anything stupid in MW ;)
[21:35:06] <jynus>	 for mw, join the fight of "no transaction larger than 1 second"
[21:35:11] <ori>	 #info <AaronSchulz> if we have to direct thumbnail regenerations to eqiad to keep the listings (for purge) in sync, we can start off that way. I'd like to actually use both DCs for scaling so that would be temporary until something better comes (either the thumbnail redo or just decent swift replication)
[21:35:18] <jynus>	 :-)
[21:35:18] <bblack>	 how do we plan to address split-brain in all of this? if the world can still reach both DCs, but the DCs can't see each other, and that condition persists for a notable timeperiod.  Do we allow writes to proceed on both sides? does everything related have the ability to handle that?
[21:35:29] <AaronSchulz>	 like I noticed/fixed stuff (e.g. a15cf051885b9) that randomly stumbled across
[21:36:02] <ori>	 (jynus: I'd like to talk with you about that, by the way -- I have some ideas for better query performance monitoring. But that's off-topic.)
[21:36:05] <AaronSchulz>	 jynus: ori and I had a bit of talk about tooling for catching slave lag offenders
[21:36:15] <AaronSchulz>	 I miss the days of the ishmael tool
[21:36:26] <gwicke>	 bblack: aren't writes (in the db sense) limited to one DC for now?
[21:36:48] <ori>	 #info Slave lag is especially problematic for multi-DC b/c it affects the WAN cache usage patterns if too high.
[21:36:49] <brion>	 yes i don't think we have full multi-master on the plan yet...
[21:36:54] <bblack>	 gwicke: that's one possible outcome, but some things in the RfC seem to sound like multi-DC writes, with some stickiness so that a single user doesn't face replication update woes
[21:36:58] <AaronSchulz>	 brion: baby steps...
[21:37:01] <brion>	 :)
[21:37:12] <bblack>	 (the sticky cookie part)
[21:37:24] <ori>	 #action AaronSchulz / jynus / ori to discuss improving observability for query performance
[21:37:39] <DanielK_WMDE_>	 yea, what's the preferred_datacenter cookie thing about?
[21:37:42] <AaronSchulz>	 bblack: yeah, writes can happen across DC for some special cases, but it's heavily frowned upon
[21:37:44] <brion>	 bblack: dc stickiness is nice for consistency i think, as long as it's paired with a master version check when doing a read after a write
[21:37:46] <jynus>	 lag affects read-write/read-only too
[21:37:50] <AaronSchulz>	 hence the whole GET/POST rule
[21:37:51] <gwicke>	 my understanding is that writes to authoritative data at least will be limited to one DC for now
[21:37:59] <bblack>	 ok
[21:38:07] <TimStarling>	 you would configure MW so that the master is in eqiad and the slave is in codfw
[21:38:13] <brion>	 though if you've got a version (chron protector) in the cookie ..... maybe don't need the stickiness
[21:38:25] <ori>	 #info writes to authoritative data at least will be limited to one DC for now
[21:38:27] <bblack>	 so in a single-writeable-DC scenario, we still want sticky-cookie, but we want it to mean that post-POST, that user has to directly use the primary DC for reads for a short window as well?
[21:38:28] <TimStarling>	 so if MW decides to write to the DB from codfw, it would just be slow, not fatal
[21:38:55] <gwicke>	 *nod*
[21:38:55] <AaronSchulz>	 brion: note that chronprot handles multiple LBs
[21:39:14] <AaronSchulz>	 I guess you could stuff positions in one/several cookies (maybe hmac it for sanity)
[21:39:26] <AaronSchulz>	 that could avoid stickiness for the DB itself, but not other stuff like swift
[21:39:27] <DanielK_WMDE_>	 bblack: that would make sense, to make sure that people see the change they just make
[21:39:31] <DanielK_WMDE_>	 *made
[21:39:37] <jynus>	 yes, I assumed that if we learn about writes too late, it is an option still
[21:39:59] <ori>	 the sticky cookie bit needs to be implemented still, right?
[21:40:08] <bblack>	 at the edge layer, effectively codfw+eqiad will both be *capable* of backending requests to both codfw+eqiad MW/services.  So we're saying GETs would flow to the local DC, POSTs to the write-DC even if it's not local, and POST sets a cookie for the user to keep their reads going there too for a short while.
[21:40:30] <RoanKattouw>	 #info <bblack> 	at the edge layer, effectively codfw+eqiad will both be *capable* of backending requests to both codfw+eqiad MW/services.  So we're saying GETs would flow to the local DC, POSTs to the write-DC even if it's not local, and POST sets a cookie for the user to keep their reads going there too for a short while.
[21:40:36] <bblack>	 ori: yeah not implemented.  but as I was saying earlier, if it's truly just on POST/GET distinction, we can do that entirely in the traffic edge layer.
[21:40:40] <TimStarling>	 what is the schedule? it sounds like the schedule is constrained by ops setting up hardware in codfw?
[21:40:43] <ori>	 bblack: nod
[21:40:47] <RoanKattouw>	 Schedule for what?
[21:40:49] <AaronSchulz>	 brion: so yeah, I'd stick with the sticky cookie initially at least
[21:40:53] <paravoid>	 hardware for what?
[21:41:02] <ori>	 for the schedule!
[21:41:03] <ori>	 :D
[21:41:05] <TimStarling>	 search, I heard earlier
[21:41:15] <paravoid>	 ah, yes, that's still pending
[21:41:21] <chasemp>	 search is relatively late to the dallas ballgame.
[21:41:23] <paravoid>	 as is a solution for swift, I guess.
[21:41:43] <AaronSchulz>	 bblack: yeah, just a matter of checking HTTP method as well as the sticky cookie if set
[21:41:52] <RoanKattouw>	 I'm guessing it's gonna take a while to get MW core and its extensions to be multi-DC-ready? In terms of correct WanCache usage, and GET idempotence, etc
[21:42:02] <bblack>	 right, I'm just saying, all of the cookie part can be out at the edge layer.  we don't need to put MW code into it.
[21:42:25] <godog>	 we need to match swift capacity in codfw by three machine as in eqiad, in the next swift hw order
[21:42:27] <gwicke>	 the RB dallas hardware has an ETA of 9/15, so we might even have a small chance of making our goal of having replication set up by the end of this month
[21:42:30] <TimStarling>	 for swift it sounds like more development work in swift itself is needed
[21:42:34] <AaronSchulz>	 RoanKattouw: MW has come pretty far, and I'll probably finish that soon except a few bits.
[21:42:35] <bblack>	 and if we face split brain in that world, what will happen is half the world won't be able to do POST traffic at all, until we decide to DNS-move them off to the write-side of the split.
[21:42:44] <ori>	 #action To do sticky-cookie request scheduling, all we need to know is the HTTP method, so this could be implemented entirely at the edge layer. This still needs to be done.
[21:42:48] <DanielK_WMDE_>	 RoanKattouw: it could help to implement a check that would log a warning if a DB write is triggered during a GET request. easy enough via global state.
[21:43:10] <AaronSchulz>	 Those 'bits' being Flow & Echo
[21:43:11] <DanielK_WMDE_>	 that should surface most problematic code
[21:43:24] <AaronSchulz>	 they have tasks...hopefully legoktm will get to them ;)
[21:43:40] <ori>	 #action Flow and Echo still need to be made multi-DC-ready
[21:43:48] <legoktm>	 :|
[21:43:51] <AaronSchulz>	 I guess the AJAX rollback thing is still unassigned (as of now)
[21:43:52] <gwicke>	 bblack: for now, there would also be a lot of manual switching needed; my impression is that this is some ways out
[21:44:06] <AaronSchulz>	 maybe I can trick Krinkle into doing that
[21:44:31] <AaronSchulz>	 https://phabricator.wikimedia.org/T88044
[21:44:35] <bblack>	 gwicke: I'm talking about a scenario where we're not really moving which DC is active for writes.  we've just suddenly lost IP comms between the two DCs, but users can still hit the readonly DC
[21:45:12] <bblack>	 in that case, that half of the users become effectively readonly until ops notices the issue and flips full user load over to the write-DC at the DNS layer to work around it.
[21:45:25] <ori>	 OK, so goals for next Q are due for WMF teams later this month, and to protect resourcing for this project (rather than doom it to the "things people do in their spare time because management doesn't understand them") we need some sort of timeline
[21:46:21] <ori>	 Getting flow, echo, and ajax rollback ready is prerequisite, yes? and certainly doable in a quarter?
[21:46:43] <TimStarling>	 I was looking for a project management page since this seems quite complex from that perspective
[21:46:43] <AaronSchulz>	 as long as the right people are assigned, I don't see why not
[21:46:53] <TimStarling>	 I think I found it in the big list of blockers at https://phabricator.wikimedia.org/T88445
[21:47:17] <ori>	 #action Ori to create a workboard for multi-DC work
[21:47:40] <TimStarling>	 but maybe we need some prose or a gantt chart or something
[21:47:45] <robla>	 ori: thanks for thinking through this stuff
[21:47:46] <RoanKattouw>	 Thanks Ori, I was about to ask something like "where can I find a list of tasks that are in my team's wheelhouse that block this project"
[21:47:51] * AaronSchulz likes it when Tim says "prose"
[21:47:55] <brion>	 i think patch for ajax rollback is half done, it just needs a little more lovin' to complete it. anybody more interested in it than i am? :)
[21:48:03] <brion>	 heh
[21:48:06] <Krinkle>	 AaronSchulz: Relaying memcached deletes and purges - is that for the main cache, or for making wancache/stash/persisent stash use memcached.
[21:48:17] <Krinkle>	 ori: What is G235149?
[21:48:24] <TimStarling>	 there are a lot of bits to this, affecting a few teams, and ideally it would all come together at the same time
[21:48:30] <AaronSchulz>	 Krinkle: it means implementing the relay logic in WAN cache with the requisite services
[21:48:52] <TimStarling>	 so that we don't have e.g. hardware in codfw halfway through its warranty before it gets used
[21:49:01] <ori>	 #action Q2 goal: ensure all PHP code ready for multi-DC: Flow, Echo, Ajax rollback and the direct FileBackend callers (ConfirmEdit, Score, Math, GWToolset, timeline)
[21:49:02] <AaronSchulz>	 Krinkle: see the EventRelayer part
[21:49:03] <robla>	 I'm trying to figure out how to both help ori and AaronSchulz with protecting this work, and figuring out how to avoid a Gantt chart  :-)
[21:49:38] <bblack>	 from the Traffic perspective, we're just pending some network link turnups for fuller capacity (soon) and we'll have user traffic splitting well.  we already split a limit number of users to dallas.  for getting to the independent tier1s + sticky-dc-cookie, we've got to fix the traffic sec issues in T108580 , and then implement the VCL changes for stickie-dc-cookie.
[21:49:40] <ori>	 does EventLayer need a server?
[21:49:49] <AaronSchulz>	 bblack: is there any reason to assume container sync is still is crappy as it was?
[21:49:58] <bblack>	 my question on that stuff is, what kind of schedule am I on there? when does traffic have to be ready for the rest?
[21:50:01] <gwicke>	 an updated version of https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki#Deployment_steps, perhaps slightly more milestone-y could already be helpful
[21:50:06] <paravoid>	 AaronSchulz: swift container sync you mean?
[21:50:16] <paravoid>	 (bblack was talking about varnish)
[21:50:18] <legoktm>	 There are some tasks which apply to multiple extensions and it would it be easier to track and work on them if they had multiple subtasks (https://phabricator.wikimedia.org/T94480 applies to both PageTriage and Echo for example)
[21:50:20] <AaronSchulz>	 paravoid: yeah; how much does ops want to look into swift replication?
[21:50:27] <AaronSchulz>	 or should I assume it is a MW problem
[21:50:33] <ori>	 bblack: I think you are in a position to dictate rather than be dictated to. What do you think is reasonable?
[21:50:46] <paravoid>	 godog was experimenting with container sync some time ago, I'm not sure if he got anywhere
[21:50:49] <ori>	 How much time do you need? Would it be sensible to make it a Q2 goal?
[21:50:51] <legoktm>	 and if there was a #Multi-DC project, we could have a phab search for #XX-Team AND #Multi-DC
[21:51:10] <bblack>	 well reasonable would be not already having a year-long backlog of tasks.  I can prioritize, but I don't want to rush through this next quarter if the applayer can't do it by then yet anyways (multi-dc reads, etc)
[21:51:39] <ori>	 bblack: the few extensions that were brought up are a small minority; the bulk have already been migrated
[21:51:44] <godog>	 paravoid AaronSchulz I didn't go very far, upstream hasn't been investing a lot in it last I looked, I can timebox some time to look into that with recent versions of swift
[21:51:47] <bblack>	 traffic sec issues in T108580 are an unknown, it's a significant work item that could consume a quarter goal
[21:52:01] <paravoid>	 bblack: traffic security is slated for Q2 according to our roadmap
[21:52:07] <AaronSchulz>	 godog: they are pushing there geo-replication cluster stuff then I assume?
[21:52:08] <bblack>	 yeah, true
[21:52:22] <AaronSchulz>	 that would be risky to switch to probably, sigh
[21:52:35] <ori>	 (not contesting, but trying to understand) how is traffic sec a blocker? is it simply that it is a priority and needs attention from the same set of people?
[21:52:39] <AaronSchulz>	 it has affinity features in theory (accept DELETE afaik), but not sure if trust that yet
[21:52:49] <bblack>	 if we get that fixed up in Q2, then we're just looking at some moderately-complex VCL work for the cookies.  would not be a whole quarter's worth, just needs a couple weeks of dedication.
[21:52:50] <robla>	 bblack: I want to +1 ori here on telling us what you need.  you should be able to say "we need multiple quarters from multiple teams" if that's what its going to take
[21:53:12] <paravoid>	 AaronSchulz: I'd propose postponing the swift decision (or even discussion) for after this meeting, as what gilles is doing in this space (and other performance team's goals) might be relevant
[21:53:48] <paravoid>	 AaronSchulz: (and not split it just yet between "ops" and MW)
[21:53:59] <AaronSchulz>	 paravoid: yeah, if does content hashing (and some pixie dust for revdelete) then this would be a lot easier, for example
[21:54:06] <paravoid>	 right
[21:54:07] <ori>	 #action Swift to be discussed in follow-up meeting.
[21:54:08] <bblack>	 re traffic sec blocker: the problem is right now with a single tier-1 DC, we have traffic sec nailed down to some degree: we're not leaking user traffic.  if we don't solve the really deep traffic sec issues in T108580 before promoting codfw to a tier1 site at the traffic layer, then we regress and start exposing user traffic again (on the codfw<->eqiad link)
[21:54:23] <bblack>	 I probably need a quarter to fix that whole issue
[21:54:29] <TimStarling>	 ok, we've got about 7 minutes left, please write only summaries and #info/#action now
[21:54:38] <AaronSchulz>	 paravoid: we can always keep image CDN misses going to ashburn as I mentioned earlier
[21:54:40] <TimStarling>	 time to wrap up
[21:55:02] <AaronSchulz>	 bblack, paravoid, godog: do we agree on postponing swift then?
[21:55:18] <ori>	 #info (bblack) with a single tier-1 DC, we have traffic sec nailed down to some degree: we're not leaking user traffic.  if we don't solve the really deep traffic sec issues in T108580 before promoting codfw to a tier1 site at the traffic layer, then we regress and start exposing user traffic again (on the codfw<->eqiad link). traffic sec issues in T108580 are an unknown, it's a significant work item that could consume the quarter.
[21:55:36] <paravoid>	 AaronSchulz: ori already made it an action item :)
[21:55:39] <ori>	 oops
[21:55:43] <bblack>	 that works
[21:55:43] <ori>	 oh
[21:55:51] <AaronSchulz>	 haha, ok
[21:56:02] <gwicke>	 #info there is a strong desire to get a better understanding of what is left & where help is needed, so that other teams can help
[21:56:06] <ori>	 this has been extremely productive from my perspective. can we agree to do this regularly?
[21:56:15] <AaronSchulz>	 jynus: how much time do have to prod at T95501 ?
[21:56:20] <paravoid>	 AaronSchulz, ori et al: let's sort it out in the coming weeks before the end of Q1 between our two teams
[21:56:21] <ori>	 and is this the appropriate forum? (i think so, but checking.)
[21:56:22] <bblack>	 given this is a very large thing, I think it does need some regular syncups going forward
[21:56:34] <AaronSchulz>	 even if it's just file bugs about queries that suck (I don't mind fixing such things :) )
[21:56:36] <gwicke>	 bblack: yeah, agreed
[21:56:58] <AaronSchulz>	 as well as looking at the tooling aspects
[21:57:09] <jynus>	 most queries are identified, but needs work
[21:57:10] <TimStarling>	 you want another IRC meeting in a month or so?
[21:57:12] <ori>	 #action Schedule regular sync-ups for multi-DC work
[21:57:17] <godog>	 AaronSchulz: yep
[21:57:23] <TimStarling>	 or just a working group meeting?
[21:57:28] <jynus>	 RBR has a blocker which is labs replicas
[21:57:30] <ori>	 TimStarling: yes, if it's cool with AaronSchulz and everyone else
[21:57:36] <ori>	 I think this forum works well
[21:57:38] <robla>	 ori: yeah, agreed this is productive for regular sync ups
[21:57:45] <TimStarling>	 ok
[21:57:56] <AaronSchulz>	 meetings are fine
[21:57:58] <jynus>	 new hardware- to be deployed thoughout the year
[21:57:59] <AaronSchulz>	 not sure on the medium
[21:58:10] <gwicke>	 I wouldn't mind a conference call
[21:58:11] <ori>	 jynus: is that #action or #info?
[21:58:13] * AaronSchulz always things of IRC as a fallback
[21:58:24] <AaronSchulz>	 as in "tool X does not support this many people in a video conf" ;)
[21:58:36] <matt_flaschen>	 AaronSchulz, yeah, and keep the Collaboration team posted on what the priority blockers are for Flow and Echo (meeting might also help).  Having legoktm do everything is not the only option. :)
[21:58:43] <jynus>	 ori, info resumed on T95501#1577186 already to fix lag issues
[21:58:45] <ori>	 hangouts are great if we want to waste the first 40 minutes figuring out why it's not working
[21:58:47] <gilles>	 bluejeans would support that many people, wouldn't it?
[21:58:52] <gwicke>	 yeah
[21:58:52] <spagewmf>	 https://meta.wikimedia.org/wiki/IRC_office_hours isn't just for RFC meetings
[21:58:53] <robla>	 AaronSchulz: I'm willing to support whatever the main participants think is best
[21:58:58] <TimStarling>	 I think action items are generally short tasks that are mostly administrative
[21:59:11] <ori>	 #summary <jynus> new hardware- to be deployed thoughout the year
[21:59:15] <AaronSchulz>	 TimStarling: yeah
[21:59:19] <TimStarling>	 action items shouldn't be used for the actual work product
[21:59:25] <gwicke>	 bluejeans worked well the last time we used it, and it supports up to 100 participants
[21:59:38] <paravoid>	 bluejeans still hasn't launched WebRTC, has it?
[21:59:39] <ori>	 i'm going to just lump that into <ori> #action Schedule regular sync-ups for multi-DC work
[21:59:52] <AaronSchulz>	 ok, I can't think of anything I need to say before we wrap up
[21:59:55] <gwicke>	 paravoid: we used webrtc the last time, and didn't have issues
[22:00:04] <spagewmf>	 TimStarling is AaronSchulz 's  https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki#Proposal approved ?
[22:00:05] <AaronSchulz>	 we just need to follow up on getting meetings setup
[22:00:17] <robla>	 TimStarling: really?  I guess that makes sense if "file x in phabricator" is the administrivia
[22:00:19] <paravoid>	 ok, then I'd be willing to try it, although I'd honestly prefer IRC
[22:00:24] <paravoid>	 especially for this size
[22:00:25] <ori>	 Thanks SO MUCH to everyone who participated, I think that this has been extremely productive, and I feel a lot better about where we're at.
[22:00:28] * ori prefers IRC too
[22:00:38] * Krinkle also prefers IRC - asynchronous IO for the win.
[22:00:52] <RoanKattouw>	 Yeah IRC was great for this meeting
[22:00:53] * robla thinks IRC is working well for this group, and likes it too
[22:00:55] <chasemp>	 really agreed
[22:00:59] <TimStarling>	 I guess we can call it approved
[22:01:01] <paravoid>	 heh
[22:01:05] <RoanKattouw>	 I know how they say in-person is higher bandwidth, but with a group this big I don't think that's true
[22:01:10] <Krinkle>	 Also easier with less accent variation, and able to re-read, and natural notes taken :) –plus urls rich content
[22:01:15] <TimStarling>	 we can't approve every last detail but it is broadly going to happen
[22:01:23] <ori>	 yay
[22:01:32] <robla>	 also, notetaking is easier on IRC  :-)
[22:01:34] <TimStarling>	 #agreed RFC approved
[22:01:35] <gwicke>	 there's a lot to be worked out still, I think
[22:01:39] <TimStarling>	 #endmeeting
[22:01:41] <wm-labs-meetbot>	 Meeting ended Wed Sep  2 22:01:39 2015 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)
[22:01:41] <wm-labs-meetbot>	 Minutes:        https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-09-02-21.01.html
[22:01:41] <wm-labs-meetbot>	 Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-09-02-21.01.txt
[22:01:41] <wm-labs-meetbot>	 Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-09-02-21.01.wiki
[22:01:41] <wm-labs-meetbot>	 Log:            https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-09-02-21.01.log.html
[22:01:42] <ori>	 \o/
[22:01:47] <gwicke>	 but there's very broad support for the direction
[22:02:01] <spagewmf>	 \o/ , is it worth flagging the accepted/still-contentious parts of the RFC #Proposal section ?
[22:02:31] <robla>	 excellent work AaronSchulz!
[22:02:44] <ori>	 +1
[22:02:44] <gwicke>	 thanks, everybody!
[22:03:41] <matt_flaschen>	 Yep, this is great stuff.
[22:04:01] <TimStarling>	 there's always been a lot of difficulty in marking a work in progress as "approved"
[22:04:15] <TimStarling>	 or approving anything, really
[22:04:20] <chasemp>	 conceptual approval anyways :)
[22:04:32] <TimStarling>	 in practice only the shortest RFCs have been approved
[22:04:34] <ori>	 right, it doesn't mean "we'll plough on even if we discover a fatal flaw"
[22:04:37] <bblack>	 so we've approved that we need to move in the direction of having more things to approve? :)
[22:05:27] <TimStarling>	 I don't know if we should put this in the approved column on the phab workboard since that will make it less visible for scheduling
[22:05:57] <TimStarling>	 but we've been talking about redoing the columns anyway
[22:06:25] <TimStarling>	 maybe we should have "in progress/check in"
[22:12:58] <spagewmf>	 TimStarling, robla: anything in "approved" is not "implemented" (separate column) thus is in progress, I agree it's hard to know which RFCs to check up on
[22:14:03] <spagewmf>	 if you click the anchor you see most RFCs don't have a priority, so ArchCom could use that to flag the RFCs to attend to
[22:41:10] <robla>	 spagewmf: sorry I missed this part of the conversation.  Yeah, in short, I agree that we need more clarity here
[22:42:16] * robla contemplates which channel this conversation should move to, and tries to resist the temptation to create a new channel
[22:42:43] <spagewmf>	 robla: sure "Add a phab comment when RFC status changes" is good advice, but reviewing comments doesn't work to track RFCs
[22:43:19] <spagewmf>	 robla: there's #wikimedia-devtools for Phab, and maybe a Team Processes Group IRC channel ;-)
[22:43:41] * robla thinks #wikimedia-tech may be the right home for "Architecture" conversations
[22:46:32] <spagewmf>	 robla: sure, I'm there. The #wikimedia-devtools and #wikimedia-teampractices are good to get advice on detailed Phab usage
[22:50:04] * robla moves conversation over to #wikimedia-tech
[23:21:34] <robla>	 AaronSchulz: I wanted to talk about your RFC in #wikimedia-tech, but you aren't there