[00:15:53] <wikibugs__>	 10Traffic, 10MediaWiki-Cache, 06Operations, 06Performance-Team: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#3170181 (10Krinkle)
[00:18:01] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3170197 (10Krinkle)
[03:05:57] <wikibugs__>	 10netops, 06Operations: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#2913735 (10faidon) This is great to see and a very good catch. Nice work @ayounsi!
[09:09:51] <wikibugs__>	 10netops, 06Operations: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3170689 (10fgiunchedi) Indeed, thanks a lot @ayounsi for fixing this long-standing issue!
[11:21:39] <wikibugs__>	 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3170871 (10fgiunchedi) As far as swift upstream is concerned this issue was raised before in https://review.openstack.org/#/c/150149/ but...
[11:52:15] <wikibugs>	 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3170959 (10ayounsi) return part UPS tracking#: 1Z81648Y9142072038
[12:28:33] <wikibugs>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Marostegui)
[12:37:43] <wikibugs>	 10Traffic, 06Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3171126 (10BBlack)
[12:38:00] <wikibugs>	 10Traffic, 06Operations: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#3171129 (10BBlack)
[12:39:10] <wikibugs__>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack)
[12:39:46] <wikibugs__>	 10Traffic, 06Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack)
[12:40:22] <wikibugs>	 10Traffic, 06Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171160 (10BBlack)
[12:40:25] <wikibugs__>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171161 (10BBlack)
[12:40:27] <wikibugs>	 10Traffic, 06Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3171162 (10BBlack)
[12:40:29] <wikibugs__>	 10Traffic, 06Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3171159 (10BBlack)
[12:41:39] <wikibugs>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack)
[12:41:42] <wikibugs>	 10Traffic, 06Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack)
[12:41:53] <wikibugs__>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack)
[12:41:55] <wikibugs>	 10Traffic, 06Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10BBlack)
[12:42:17] <wikibugs__>	 10Traffic, 06Operations: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3171168 (10BBlack)
[12:42:19] <wikibugs>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack)
[12:42:50] <wikibugs>	 10Traffic, 06Operations: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3171174 (10BBlack)
[12:42:52] <wikibugs__>	 10Traffic, 06Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3171173 (10BBlack)
[12:42:54] <wikibugs>	 10Traffic, 06Operations: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#2962033 (10BBlack)
[12:42:56] <wikibugs__>	 10Traffic, 06Operations: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3171176 (10BBlack)
[12:43:17] <wikibugs>	 10Traffic, 06Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3171177 (10BBlack)
[12:43:19] <wikibugs__>	 10Traffic, 06Operations: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#2962020 (10BBlack)
[12:44:22] <wikibugs>	 10Traffic, 06Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#2962007 (10BBlack)
[12:44:24] <wikibugs__>	 10Traffic, 06Operations: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2968867 (10BBlack)
[12:44:26] <wikibugs>	 10Traffic, 06Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171179 (10BBlack)
[12:45:33] <bblack>	 phab-spam (sorry, trying to sort out the inter-dependencies in phab terms)
[12:45:56] <wikibugs>	 10Traffic, 06Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack) p:05Triage>03Normal
[12:46:18] <wikibugs>	 10Traffic, 06Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) p:05Triage>03Normal
[13:00:19] <ema>	 we apparently do still serve some images with CT: text/plain, but not because of the swift bug. They seem to just have the wrong CT 
[13:00:52] <ema>	 eg: http://ms-fe.svc.eqiad.wmnet/wikipedia/commons/c/ce/Antu_audio-off.svg
[13:01:15] <ema>	 godog: ^
[13:10:01] <godog>	 ema: interesting, I tried "file --mime" on that and indeed it thinks text/plain
[13:10:18] <godog>	 IIRC that's essentially what mw does at upload time, so that might be it
[13:12:44] <bblack>	 is there a way for the upload to correct an incorrectly-guessed mime?
[13:13:56] <bblack>	 I never really thought about this part of the problem, but it seems problematic :)
[13:14:33] <bblack>	 as the mime-type of a file is metadata, if it's not specified explicitly (e.g. by the uploader) any automatic mime-type is always a heuristic guess and subject to being wrong by nature
[13:15:16] <ema>	 good old file extension would be a good guess in this case though 
[13:16:26] <volans>	 I wouldn't trust it, but maybe have a whitelist of accepted mime for images, although I guess we support many of them and might be complex to keep it up to date
[13:17:17] <bblack>	 it would be easier to maintain a blacklist of unacceptable image mimes, like text/plain and text/html
[13:18:53] <godog>	 yeah I'm looking in maintenance/ and the closest seems to be refreshImageMetadata.php
[13:18:59] <ema>	 only SVG files seem to be affected BTW
[13:19:12] <godog>	 though applicable if e.g. the file magic db gets smarter
[13:20:23] <bblack>	 https://commons.wikimedia.org/wiki/Commons:File_types
[13:20:33] <moritzm>	 file upstream follows the Debian BTS, might be worth filing a bug there
[13:20:51] <bblack>	 I walked through the normal upload process just to be sure: it never asks explicitly for mime type, or offers any field to override it
[13:21:33] <bblack>	 I wonder if, for the limited file types commons allows uploading, extensions would be more-reliable than inferring from data headers
[13:21:47] <volans>	 that image is an uncompressed svg, so basically xml
[13:21:48] <bblack>	 (or check both and report some kind of correctable error if they don't match)
[13:22:01] <volans>	 it's strange how the text/plain was picked
[13:23:25] <volans>	 it doens't have the doctype
[13:23:35] <volans>	 that's seems the issue to me, has only the xmlns="http://www.w3.org/2000/svg
[13:23:36] <bblack>	 right
[13:23:47] <bblack>	 it's either not well-formed XML or not "valid" XML, I forget which is which
[13:23:58] <bblack>	 it should ideally have an <xml> tag and a DTD
[13:24:04] <bblack>	 (in xml terms)
[13:26:46] <volans>	 just adding <?xml version="1.0"?> make file --mime recognize it correctly, even without a doctype
[13:26:59] <bblack>	 the w3 validator parses that file as correct SVG, with 2 warnings about the missing xml charset declaration
[13:27:00] <volans>	 seems to me an "embedded" svg t
[13:27:37] <volans>	 maybe we could recognize and fix them at upload time, but I'm wondering if they are used in ajax calls to embed them directly in some other html 
[13:28:40] <bblack>	 adding this at the top fixes the only reasonable validator warning: <?xml version="1.0" encoding="utf-8" ?>
[13:29:13] <bblack>	 but http could specify the charset in the content-type and then even that wouldn't be necc
[13:29:16] <godog>	 funnily enough forcing the w3c validator with SVG then it warns also about missing doctype
[13:30:08] <bblack>	 anyways, bottom line is text/plan and text/html are not valid upload document types according to https://commons.wikimedia.org/wiki/Commons:File_types
[13:30:27] <bblack>	 if whatever mechanism detects it as those, it should reject the upload (or to be nicer, offer some workaround or advice for the user to fix it)
[13:30:38] <volans>	 +1
[13:32:29] <godog>	 +1 too
[13:37:21] <ema>	 cache_misc upgraded to linux 4.9, not a single issue today :)
[13:38:09] <XioNoX>	 bblack: during the switchover eqiad will not serve traffic at all, or be similar to a caching site?
[13:38:22] <XioNoX>	 especially about the CP* servers
[13:38:42] <bblack>	 XioNoX: it will be capable of handling user traffic, but it will be disables in DNS and inter-cache routing, so there shouldn't be much
[13:38:55] <bblack>	 (but there will be a handful of users with broken DNS, plus our own internal+external monitor reqs, etc)
[13:39:50] <XioNoX>	 but it's not a big deal if 4 of its cp* servers go down a bit of time while we move them to a new rack
[13:40:33] <bblack>	 ah that
[13:40:46] <bblack>	 we can depool individual cp servers from traffic, too
[13:40:58] <bblack>	 unless all 4 are in the same cluster and it's a 4-node cluster heh
[13:41:10] <bblack>	 which ones is it?
[13:42:50] <XioNoX>	 bblack: cp1071 to cp1074 ( https://racktables.wikimedia.org/index.php?page=rack&rack_id=2110 )
[13:43:43] <bblack>	 they're all cache_upload
[13:43:54] <bblack>	 but yeah, we can just depool them ahead of the move
[13:44:14] <bblack>	 all 4 at once is kind of heavy to depool in normal times, but during the switchover period it should be no big deal
[13:45:04] <XioNoX>	 we can look at how little traffic they get before as well, to be sure. but they will have to move at one point anyway
[13:46:42] <bblack>	 the traffic will be minimal while we're on codfw
[13:47:32] <bblack>	 it'd be nice at some point to re-evaluate eqiad's cache layout in terms of rack rows anyways
[13:47:47] <bblack>	 but probably not worth the disruption to try to rebalance it now, better to wait for the next hw refresh
[13:48:21] <bblack>	 (which is coming up this next FY)
[13:49:58] <volans>	 bblack, ema: when you can, kindly reminder to review and update the traffic switchover documentation at [1]. Also it will be nice to replace salt commands with cumin (I can help for the "translation" of course).
[13:50:03] <volans>	 [1] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Traffic
[13:50:41] <bblack>	 volans: yeah we need to discuss all of that I guess.  I'm again assuming the core "switchdc" stuff is all about the MW/RO process, not the rest
[13:51:00] <bblack>	 the rest, it's not time-critical/outagey, and we're doing it a day ahead async, etc
[13:51:37] <bblack>	 also we had Swift specially-documented last year, I don't know that it needs to be, or what we're changing there
[13:51:51] <bblack>	 (or whether we can just go a/a this week before we get there)
[13:52:15] <volans>	 bblack: yes, excatly to discuss about those stuff
[13:52:29] <bblack>	 the steps need editing on wikitech in any case, I'm just kinda waiting to edit them until we know for sure what we're doing
[13:52:45] <volans>	 for the MW switch yes it MW-focused but we have the step in the middle for switching text_cache as discussed with joe the other day
[13:53:03] <volans>	 so I want to be sure that that's ok and doens't need refactoring
[13:53:10] <bblack>	 there shouldn't be any step related to cache_text as a whole
[13:53:14] <bblack>	 in the MW stuff
[13:53:16] <bblack>	 just app_directors
[13:53:52] <bblack>	 (maybe that's what you meant)
[13:53:52] <volans>	 I mean, running puppet on the cache_text hosts in eqiad first and then in codfw
[13:53:57] <bblack>	 right, ok
[13:54:21] <volans>	 then, to be sure that the documentation on the wiki is up-to-date with what we'll do this year
[13:54:32] <volans>	 for swift, I've already pinged godog to update it
[13:55:10] <bblack>	 godog: ping re: all things swift x-dc: (1) Can we go active/active for orig and/or thumbs now as our normal default going forward? (2) How we're handling the switchover period, which kinda depends on (1)
[13:55:15] <volans>	 I'm also finishing a more detailed page with all the switcdc automated steps
[13:55:19] <volans>	 so that they are clear
[13:55:47] <bblack>	 What about "Services" at the bottom - is that translating to exact commands/commits, or just left there as a guide as it is now in more-general terms?
[13:56:05] <bblack>	 (I mostly ask as a guideline for level of detail elsewhere for async steps the day before)
[13:56:05] <volans>	 that also need to be translated 
[13:56:19] <volans>	 although I know that joe can switch them blindly
[13:56:47] <volans>	 it would be nice to have the commands so to 1) not have to think too much during the switch and 2) be able for other people to do/review them
[13:56:52] <bblack>	 right
[13:57:11] <bblack>	 I can handle all the traffic-level stuff blindly too, but then that kinda defeats documenting things :)
[13:57:20] <volans>	 right :D
[13:58:07] <bblack>	 should we talk about timeline in the main wiki page, or re-order it based on the approximate timeline, etc?
[13:58:28] <bblack>	 or just the method for executing each service's change, but leave timeline for this year elsewhere?
[13:59:05] <volans>	 the top part has indicative times for all the pieces, or at least it should
[13:59:16] <volans>	 not sure if we want to add all of them to the deployment calendar or similar
[13:59:19] <volans>	 to have a clear timeline
[13:59:22] <volans>	 visually
[14:05:01] <bblack>	 in any case, barring necessary complications related to replication or whatever, swift is basically just another a/a application service for the cache clusters
[14:07:42] <bblack>	 volans: is there a topic or commitmsg convention for pre-staged commits to be used during the switch process?
[14:08:31] <volans>	 bblack: I don't think it exists yet, but let's agree on one
[14:08:32] <bblack>	 codfw-switchover-2017 as a topic label maybe?
[14:08:47] <volans>	 sure, sounds good
[14:12:45] <bblack>	 https://gerrit.wikimedia.org/r/#/q/topic:codfw-switchover-2017
[14:13:09] <bblack>	 those two commits are it for the cache-level traffic part a day before
[14:13:31] <bblack>	 + a cumin command to push the puppet one, and a standard authdns-update for the dns one
[14:13:47] <volans>	 ok
[14:13:57] <bblack>	 I'll fix up wikitech around those
[14:15:46] <volans>	 thanks! I'll tell the others about the topic so to all use the same
[14:16:17] <volans>	 are you optimistic to not have another switchover this year? :)
[14:17:44] <bblack>	 who knows, not planned yet is it? but it's been a year since the last :)
[14:18:04] <volans>	 yeah, I think the wish would be to aim at 6 months
[14:18:07] <volans>	 but who knows
[14:18:22] <bblack>	 so the next one can be codfw-switchover-2017-october or something :)
[14:18:53] <volans>	 yeah
[14:19:39] <bblack>	 really, for the day-ahead async stuff, we don't have to care about immediate push with cumin, we can just let it roll on normal agent timing
[14:19:49] <bblack>	 but either way
[14:20:20] <volans>	 sure, fine for me too
[14:20:21] <bblack>	 what's the standard way to invoke puppet from cumin with correct exit codes if you don't care about verbose output?
[14:20:31] <bblack>	 just "puppet agent"?
[14:20:51] <volans>	 run-puppet-agent -q if you don't care about the output
[14:20:54] <bblack>	 ok
[14:21:04] <volans>	 first example here: https://wikitech.wikimedia.org/wiki/Cumin#Cumin_CLI_examples_in_the_WMF_infrastructure
[14:21:34] <volans>	 and you probably want a selector like:
[14:21:40] <volans>	 R:class = profile::cumin::target and R:class%cluster = cache_upload and R:class%site = codfw
[14:21:53] <volans>	 to be adapted to the DC and cluster that you need of course
[14:22:17] <volans>	 unless cpN* is ok for the task
[14:22:19] <volans>	 :D
[14:24:55] <godog>	 bblack: yes I think we can go a/a with upload varnish this year, during the switchover I was thinking of essentially porting the istructions at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2FSwift to the current way hieradata is configured with app_directors and so on
[14:25:13] <godog>	 which is related to the question I had for you on how to do that
[14:26:57] <bblack>	 yeah I can do that part
[14:27:40] <bblack>	 the big question is whether we can go a/a *now* (as in, say, from today forward or tomorrow forward), and assume an a/a world from now on.  In which case the April stuff is just disabling the eqiad side then turning it back on.
[14:27:57] <bblack>	 or whether we're not ready to do that now, in which case it's an actual switch from one side to the other or whatever
[14:30:31] <bblack>	 godog: well also the other question is whether you have any non-traffic procudures that need to be timed with this.  e.g. changing something about replication just before/after.
[14:30:55] <volans>	 I guess swiftrepl
[14:31:09] <volans>	 at least
[14:31:12] <bblack>	 this is for the user-facing part with cache_upload
[14:31:25] <bblack>	 swiftrepl I think is more about the mediawiki side of things, right?
[14:31:31] <volans>	 yes
[14:31:56] <volans>	 that I think we agreed to just stop before the MW switchover and start it again inverted after
[14:31:59] <volans>	 it's async
[14:32:04] <volans>	 can be down for a bit
[14:32:15] <volans>	 MW takes care of writing in both DCs anyway
[14:32:18] <bblack>	 ok yeah that part I'm not asking about
[14:32:35] <bblack>	 this is for switching the upload.wikimedia.org public traffic => ms-fe
[14:33:02] <volans>	 yeah sorry got confused ;)
[14:33:18] <paravoid>	 I don't think we can, by virtue of MediaWiki not enforcing strong consistency between the two backing stores
[14:33:38] <bblack>	 godog: maybe a better question to ask to sync up: what's blocking us from turning on a/a for upload.wm.o=>swift today?
[14:33:41] <paravoid>	 aiui mediawiki won't fail an upload if it only gets uploaded to the "primary" swift cluster but fails to be uploaded to the secondary
[14:34:03] <paravoid>	 this means that inconsistencies may happen and will only be reconciled hours later by swiftrepl (hopefully, probably etc.)
[14:34:14] <paravoid>	 so I don't think it'd be safe to serve traffic in an active/active manner
[14:34:25] <bblack>	 ok
[14:34:32] <paravoid>	 it may work in the majority of the use cases but weird issues may creep in I think
[14:35:07] <paravoid>	 of course if we think that's valuable from an end-user and/or load perspective, we can talk with $whoever about implementing strong consistency between the swift clusters
[14:35:07] <godog>	 yeah that's my understanding as well, I'll ask Aaron too either today or tomorrow ahead of the meeting
[14:35:15] <bblack>	 so next question after that would be, I guess, can we tolerate a short window of a/a during the switch?
[14:35:25] <bblack>	 or do we have to outage the service briefly during the switch?
[14:35:39] <bblack>	 (again, the switch of public user traffic routing into swift ms-fe)
[14:36:22] <bblack>	 and then the next question is: do other non-traffic steps need to be coordinated with the switch of users from one side to the other?
[14:36:43] <godog>	 I think a/a during the switch like we did last year is tolerable, and no the mw-config change from last year to move "async" to "sync" is permanent now
[14:36:52] <volans>	 or does this mean that swift is thightly coupled with MW and need to be switched during the MW RO period?
[14:37:09] <bblack>	 ok
[14:37:36] <bblack>	 so if no other steps need coordinating and a short a/a is acceptable (which is great, because we don't yet have a way to do it outage-style at the traffic layer!)...
[14:37:48] <paravoid>	 to answer an earlier question
[14:37:53] <bblack>	 then moving the public-facing swift traffic is just like any other service (e.g. on cache_text)
[14:38:08] <paravoid>	 I don't think switchdc is as a project mediawiki-specific (or appserver-specific)
[14:38:36] <bblack>	 do you mean switchdc the script or ?
[14:38:36] <paravoid>	 I'd like it to be the tool that handles datacenter switches across all the different components, incl. traffic
[14:38:41] <paravoid>	 switchdc the script, yes
[14:38:53] <paravoid>	 maybe we can't do it this time around, but it's by no means a design choice I'd say
[14:39:02] <bblack>	 paravoid: this gets into a whole other discussion - should it also be the tool we use in a real outage?
[14:39:26] <bblack>	 (I'd think the right long-term answer is yes, maybe with a different flag or something)
[14:39:30] <volans>	 bblack: in the long term I think so, but the tasks will need to be outage-aware
[14:39:35] <paravoid>	 agreed :)
[14:39:38] <paravoid>	 with both of you :)
[14:40:03] <bblack>	 from a higher-level pov... the procedure we're executing now basically won't work for a real outage, on several different levels
[14:40:12] <bblack>	 it's something we need to converge towards
[14:40:27] <paravoid>	 sure
[14:41:00] <bblack>	 so that eventually we have e.g. "switchdc --safe" (for planned moves/testing) vs "switchdc --outage" (does not depend on dead side, moves fast and unsafe on the assumption things are already broken and dead side is dead)
[14:41:33] <paravoid>	 nod
[14:41:42] <bblack>	 so here's the thing that bothers me in the present:
[14:41:45] <paravoid>	 and some kind of stonith etc. :)
[14:42:08] <bblack>	 (and yeah let's rewind to the stonith topic later if we have time...)
[14:42:38] <bblack>	 we could have a DC outage at any time, it won't necessarily wait for us to perfect all of this
[14:43:02] <bblack>	 and some of the changes we've made to how we're handling these things are moving us in the opposite direction of quickly handling a real outage
[14:43:18] <paravoid>	 like what?
[14:43:35] <bblack>	 e.g. a few years ago it might've been brutal on a different level since we didn't even have good documentation of everything we need to reconfigure, etc...
[14:44:10] <bblack>	 but now we're resetting our brains around the idea that these steps we're doing in the test-failovers are some kind of guideline at least for how we'd handle the real thing
[14:44:35] <bblack>	 but they would totally fail and totally fail to account for some aspects of the real thing, and we haven't talked about that much or figured out how to work around it
[14:44:47] <bblack>	 I see it as "forward progress on planned switches, at the cost of backwards progress on smoothly handling real ones"
[14:44:57] <bblack>	 (until some later date when it all reconverges)
[14:45:40] <bblack>	 e.g. almost everything relies on confctl/etcd, and as soon as eqiad powers off confctl/etcd is broken, and we have no documented procedure for handling that as Step 0
[14:45:51] <bblack>	 (which AIUI involves manually updating authdns first)
[14:48:40] <bblack>	 (and then have we ever tested how things reconverge when we do that, vs software timing out or getting broken somehow when it first can't reach eqiad for etcd)
[14:49:14] <volans>	 I think this should be a separate effort, that indeed we should do while going towards everything a/a
[14:49:32] <bblack>	 sure
[14:49:42] <volans>	 that involves also more testing in a lab those scenarios
[14:50:01] <bblack>	 but my point is, the etcd-failover problem is a new one we've introduced along the way towards a better failover procedure.  We've essentially regressed in some ways for handling the real-case.
[14:50:41] <paravoid>	 yeah, ok
[14:51:04] <paravoid>	 one of the things that I'm planning to do is to make a plan for next steps after this failover test is done
[14:51:21] <paravoid>	 we have accumulated a few longer-term actionables
[14:51:29] <bblack>	 the stonith problem is a whole separate nasty issue that's unrelated to all the above really.
[14:51:43] <paravoid>	 yeah, that too
[14:51:59] <bblack>	 there isn't any way, afaik, for us to actually stonith the whole eqiad deployment once we start switching to codfw after an eqiad outage.  and it could bounce back or flap and wreak havoc.
[14:52:18] <paravoid>	 depends on the eqiad outage
[14:52:35] <paravoid>	 which layer of eqiad died for instance
[14:52:42] <bblack>	 one thing we can do, is once we make the call, we can shut off all our wan interfaces from others sites <=> eqiad to isolate it from them (on the other sites' routers)
[14:52:59] <paravoid>	 e.g. we could conceivable lose power in 2/4 rows and decide that it's large enough to do an emergency switchover to codfw, even though eqiad isn't completely dead
[14:53:14] <paravoid>	 it's an outage large enough*
[14:53:30] <bblack>	 but this leaves open the possibility that it can flap-up and still connect to our transit/peers and advertise address space (including DNS that still points users towards that address space)
[14:53:40] <paravoid>	 well
[14:53:43] <bblack>	 which breaks users, but not our remaining infra
[14:53:56] <paravoid>	 there's always the option of asking chris (or smarthands) to just pull the cables of our transits
[14:53:59] <paravoid>	 :)
[14:54:07] <paravoid>	 pretty good way to stonith :)
[14:54:10] <bblack>	 well....
[14:54:33] <paravoid>	 if the event is e.g. a complete power ourage
[14:54:34] <bblack>	 we can wargame scenarios all day and never iterate them all or predict how they turn out
[14:54:42] <paravoid>	 yup
[14:55:06] <bblack>	 but it's certainly possible I think, that we could end up in a situation where nobody can go inside eqiad to make changes (or is allowed, or can drive there) and power to all of eqiad dies, then flaps back.
[14:55:56] <bblack>	 and death->powerup time interval could easily be perfectly-wrong: just long enough we assume it's dead and start switching over, but just short enough that there's no good information from the site, and nobody can get there and do anything
[14:56:00] <bblack>	 e.g. ~30 minutes
[14:58:08] <bblack>	 so, I think we can eventually reliably-automate switching everything to codfw-active on a software level while eqiad's dead, and I think we can also software-automate shutting off wan links to eqiad to prevent it harming codfw when it flaps back
[14:58:21] <bblack>	 I just don't see a good answer to prevent it bringing up transit/peering and harming users when it flaps back
[14:59:01] <bblack>	 I'm hoping there's some network magic by which we can advertise from our other sites to not trust eqiad, using something about bgp path lengths and metrics or whatever and some pre-configuration for this scenario
[14:59:24] <bblack>	 (e.g. advertise a better route for eqiad's IPs from codfw than what it would normally advertise itself)
[14:59:27] <volans>	 worst case scenario should be just a spike, in the sense that once flap back we can connect and manually shut down the network/dns I guess
[14:59:42] <bblack>	 yeah, hopefully
[14:59:43] <volans>	 and for data corruption we could have stopped replicas from eqiad to codfw
[14:59:55] <bblack>	 it will be in the midst of a lot of chaos though.  it would be nice to have a more-solid answer
[15:00:02] <volans>	 I agree
[15:00:26] <volans>	 if there is a better way to blacklist it from outside
[15:00:48] <bblack>	 yeah I don't know the bgp-level engineering well enough
[15:01:32] <bblack>	 but you'd think there would be some way to essentially add some small negative metric to all our normal route advertisements, so that in an emergency we can have one site advertise another at a better metric and effectively take it over forcefully
[15:02:34] <volans>	 not sure that might work also for anycast though... (assuming we'll go that way for the DNS)
[15:02:43] <bblack>	 yeah
[15:02:53] <bblack>	 anycast throws another wrench in those ideas for sure :)
[15:03:32] <bblack>	 I don't think public bgp has a way to advertise MED type stuff right? so people tend to hack it with path-length by repeating ASes?
[15:04:00] <bblack>	 anyways
[15:04:59] <XioNoX>	 either repeated AS, or advertise a smaller prefix (2*/24 instead of a /23)
[15:05:32] <volans>	 right
[15:06:45] <bblack>	 ah interesting
[15:06:46] <XioNoX>	 it's things that are seen worldwide though, so it takes time for the word's routers to converges
[15:07:02] <bblack>	 peers would re-consolidate the 2*/24 into a /23 when re-advertising to others?
[15:07:17] <paravoid>	 no
[15:07:20] <XioNoX>	 they should not
[15:07:30] <paravoid>	 I don't think the scenario you're describing is very realistic
[15:07:35] <bblack>	 that's really interesting, and seems more reliable than adding repeat-AS stuff
[15:07:46] <bblack>	 but we don't have /23 everywhere anyways heh
[15:08:10] <paravoid>	 eqiad dies and we have the time to change our BGP announcements on our backup site (dangerous by itself!) but not the time to tell them to just pull the power cords from our routers
[15:08:12] <XioNoX>	 I haven't read the whole conversation, I only started at BGP :)
[15:08:15] <paravoid>	 or the patch cords from our transits
[15:08:38] <bblack>	 paravoid: define "realistic" :P odds are high neither site will experience this level of event over a 10 year period or whatever
[15:08:50] <paravoid>	 well yeah ok, sure
[15:09:02] <paravoid>	 but I mean even in the case such an even happens (*knocks wood*)
[15:09:31] <bblack>	 in the case this happens, my view is the affected (seems dead) site - everything about it is likely to be in chaos
[15:09:41] <bblack>	 there's a weather event or regional power grid outage going on
[15:10:07] <bblack>	 local people may not be able to move around the local city.  communications (e.g. from DC staff) may be spotty.  people will take time to figure out what's happening.
[15:10:45] <bblack>	 a power flap down and then backup inside an hour, and us not being able to reliably execute any operation physically inside during the DC during that time, seems a reasonable possibility.
[15:10:48] <paravoid>	 ok, 20 minutes until meetings start, and I have to finish the QR slides
[15:10:54] <paravoid>	 so I'll have to cut this short, sorry :(
[15:11:02] <bblack>	 :)
[15:12:29] <bblack>	 TL;DR - DC-level stonith is what we really need, but it's not possible.  But DC-level we might be able to tell our own other sites and the world to treat eqiad as if it were shot in the head.
[15:17:02] <paravoid>	 TL;DR there's a lot of future work to be done in this area :P
[15:17:30] <volans>	 :D
[15:18:41] <bblack>	 yeah stonith is just future-thinking
[15:19:21] <bblack>	 I worry more about the need in the short-term to document how new tooling we're introducing to our operational methods, in the interest of smoother planned-fail, can be used (or worked around) in a real-outage scenario.
[15:20:19] <volans>	 agree as long as we document also old non a/a tools (gerrit & co.)
[15:21:14] <bblack>	 (commit directly on puppetmasters is basically the answer)
[15:21:21] <volans>	 all of them
[15:21:42] <bblack>	 well, that's up to puppet-merge's design really
[15:21:51] <bblack>	 ideally it should handle syncing to the remaining live ones
[15:22:34] <volans>	 IIRC it depends on git when propagating
[15:24:29] <bblack>	 yeah, that's another problem, kinda similar to the etcd-in-eqiad problem
[15:24:39] <bblack>	 these are all tooling issues in handling the outage case
[15:25:12] <bblack>	 as we build up from the old manual BS of how we would've handled the outage case in the past to relying on tools, we need to know that the tooling works for the outage-case too
[15:25:37] <bblack>	 (or else we're just making the outage-case more confusing and difficult as we go)
[15:54:33] <wikibugs>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3171998 (10ayounsi) Talked to Chris and Brandon, we're going to aim for doing the work on Wednesday April 26.
[16:34:17] <_joe_>	 are you merging varnish changes right now? I would like to test the disable puppet - run in codfw - run in eqiad
[16:34:27] <_joe_>	 dance we will have to do on the switchover day
[16:34:34] <_joe_>	 just without merging the patch
[16:34:55] <volans>	 and we prefer with a patch that change something, a noop puppet run is probably much quicker
[16:35:20] <_joe_>	 volans: not so much, I think it's ~ 2-3 seconds faster AFAIR
[16:35:27] <volans>	 you sure?
[16:35:31] <_joe_>	 so I'm ok to test without a patch
[16:35:39] <_joe_>	 it's still a order of magnitude estimate
[16:35:42] <volans>	 ok
[16:36:04] <_joe_>	 bblack: ok to proceed with doing some puppet runs on cache::text?
[16:39:40] <_joe_>	 I'll assume a noop is ok since I see no patches flying
[16:39:48] <bblack>	 yeah I'm in meetings, things are normal on aches
[16:39:50] <bblack>	 *caches
[16:40:00] <bblack>	 afaik
[16:40:02] <_joe_>	 ok sorry
[16:40:08] <bblack>	 modulo maybe checking ema
[16:40:57] <bblack>	 you could do a small VCL patch on one on the text-backend VCL if you want
[16:41:00] <bblack>	 e.g. add a blank line
[16:41:07] <bblack>	 it would simulate the same effect in practice
[16:41:31] <bblack>	 (in terms of puppet mechanics or whatever)
[16:41:40] <volans>	 do you think it would be much slower?
[16:41:50] <volans>	 othwerwise not worth
[16:41:59] <volans>	 or we can do it at the next vlc patch you'll hae
[16:42:06] <bblack>	 no idea, but at least it pushes an actual diff/change of one of the VCL files, triggers reloads
[16:42:16] <bblack>	 changes exit status, etc
[16:42:43] <bblack>	 the app::directors change will do those same things (but with real effects in the content of the VCL instead of a blank line)
[16:43:25] <volans>	 ok, thanks
[22:49:54] <wikibugs>	 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching): Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134380 (10bmansurov) I'm late to the party, but I'd like to make a couple of points below.  1. Would it make sense to separa...