[02:32:10] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3156017 (10Krinkle) [02:32:25] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#805779 (10Krinkle) [05:44:47] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Pokefan95) All files listed here works for me except https:/... [05:47:27] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3156223 (10Pokefan95) https://upload.wikimedia.org/wikipedia/commons/th... [06:01:06] 10Traffic, 06Operations, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#3156237 (10Tbayer) >>! In T107430#2886903, @Tbayer wrote: >>>! In T107430#2882009, @fgiunchedi wrote: >>>>! In T107430#288195... [07:54:54] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10faidon) [08:42:20] ok I've merged the X-Orig-Content-Type workaround and lowered keep on upload backends to 1d [08:43:37] later on we can re-ban all objects on upload with CT: textish and keep an eye on the rate of such objects coming out of the frontends [08:43:47] should be less than now hopefully :) [08:43:52] :D [08:44:17] volans: good morning sir [08:45:56] good morning to you sir, thanks for trying hard to preventing me for patching swift ;) [08:48:30] volans: so let's say I'm going to use cumin for the bans [08:48:34] does this: [08:48:41] cumin -b 1 'R:class = role::cache::upload and *.eqiad.wmnet' "varnishadm ban 'obj.http.content-type ~ \"^text\"'" [08:48:47] look functionally equivalent to this: [08:48:54] salt -b 1 -v -t 30 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run "varnishadm ban 'obj.http.content-type ~ \"^text\"'" [08:49:15] * volans checking [08:51:31] so practically yes, formally you could use this other way of querying with the recent changes in puppet (I need to update the docs) [08:51:39] R:class = profile::cumin::target and R:class%cluster = cache_upload and R:class%site = eqiad [08:51:57] the result is the same in this case because you don't have any .wikimedia.org hosts in the equation [08:52:09] for the timeout, is that important? you want to timeout? [08:52:46] mmh no I don't think so, the varnishadm command should return immediately [08:52:52] also if you want to sleep between hosts add -s N [08:53:01] -s BATCH_SLEEP, --batch-sleep BATCH_SLEEP [08:53:10] advanced! thanks! [08:53:16] and accept a float :D [08:53:44] now for the tricky part :) [08:53:46] also, what is the exit code of varnishadm? [08:53:55] * volans hopes 0 [08:53:57] 'G@cluster:cache_upload and not G@site:codfw and not G@site:eqiad' [08:54:16] I guess the new syntax would come in handy in this case? ^ [08:54:43] 'R:class = role::cache::upload and not *.eqiad.wmnet and not *.codfw.wmnet' [08:54:46] with your syntax [08:55:14] 'R:class = profile::cumin::target and R:class%cluster = cache_upload and not R:class%site = eqiad and not R:class%site = codfw' [08:55:17] with the other [08:55:39] perfect [08:55:43] unfortunately PuppetDB API v3 does not support regex on the resource parameters [08:55:46] oh, and for your question above: [08:55:46] v4 will [08:55:46] If a command is given, the exit status of the varnishadm utility is zero if the com‐ [08:55:49] mand succeeded, and non-zero otherwise. [08:55:56] which seems reasonable [08:56:05] ok, this because with -b1 means that you do one [08:56:22] if it fails (!= 0) cumin stops, does not schedule the next one [08:56:35] if the -p 0-100, --success-percentage 0-100 is left at the default of 100% [08:57:27] yeah that's what we want. If the ban command fails that probably means the syntax is wrong (or worse!) [08:58:48] yeah, if you know that one host might be misbehaving or is down, you can use -p 90 or whatever higher percentage that can account for small failures [08:59:25] and thanks for trying cumin! :D [09:23:09] volans: so I've looked at the diff between swift 1.13.1 and 2.2.0, the fix seems easy [09:23:15] https://phabricator.wikimedia.org/P5207 [09:23:28] thoughts? [09:25:19] not tested and all but at first glance I can't see other changes that might affect IMS behavior [09:25:56] upgrading the entire swift cluster to jessie (or potentially stretch) is part of the current Q goal for swift, so I think the 1.13.1 installations will be gone quite soon [09:28:05] ema: no I think the change should be different, that logic is already done in server.py and raises a HTTPNotModified that calls status_map[304], that calls and HTTPException that calls Response.__init__(self, *args, **kwargs) with status=304 [09:28:17] so in class Response you'll have already self.status = 304 [09:28:27] and we can just drop the CT when this is the case [09:29:38] s/raises/returns/ [09:32:56] volans: mmh, to me it seems that in Response._response_iter they do take care of conditional responses with etag and INM by returning with status 304 and CL: 0 immediately, but they don't take the IMS case into account in 1.13.1 [09:33:06] and that's fixed in 2.2.0 [09:33:51] in 2.2.0 the whole logic is differnet [09:34:24] all GET call resp = request.get_response(response) [09:34:54] and delegate to the Response class the behaviour [09:40:52] but I don't have time to dig more/test it now unless the hotfix doesn't work and we need it [09:41:02] ;) [09:54:24] also ema, I don't think your fix will fix it, because the CT is set from _resp_content_type_property() that looks into the self.headers that is already populated with the default one [09:55:07] maybe something like this could work: [09:55:07] def getter(self): [09:55:07] - if 'content-type' in self.headers: [09:55:08] + if 'content-type' in self.headers and self.status != 304: [09:58:59] or unset the value in self.headers at some point, but might be harder to find the right place [10:06:40] volans: I think you've confused 1.13.1 with 2.2.0 :) [10:06:54] the HTTPNotModified stuff is in 1.13.1 and has been removed in 2.2.0 [10:07:09] yes, and we need to fix 1.13.1 [10:07:14] that is the broken one [10:07:27] if the hotfix doesn't work and we cannot wait the swift upgrade [10:08:53] right, so probably the proper fix (again if the VCL workaround doesn't work) would be to cherry-pick the change where they remove HTTPNotModified and introduce the diff I pasted above [10:09:05] let me dig for that commit, I've seen it somewhere [10:10:45] I think this is it https://github.com/openstack/swift/commit/62a1e7e0593723ff19a83c1533b02f6f444f6942 [10:11:11] at any rate, it would be nice to be able to test this stuff [10:11:37] do we know if it's ok to try out things on the swift hosts in deployment-prep? [10:11:56] * volans #noidea [10:12:19] ema: it is ok but better to log it in #wikimedia-releng [10:12:39] ema: that change remove also that logic from server.py, it's basically moving it [10:12:45] from server.py to swob.py [10:13:00] and no, I don't think it fixes the issue [10:13:06] because the CT header is already set [10:13:09] ema: I would say that beta is exactly for that :-} [10:13:34] ema: assuming we have a swift 1.13.1 in deployment-prep, have you checked/ [10:13:37] ? [10:13:39] we do, yeah [10:13:44] and the bug is repro there [10:13:48] ok [10:14:19] but note on beta the servers use Ubuntu Trusty [10:14:32] exactly what we needed hashar ;) [10:14:37] yeah [10:14:51] comes from trusty-updates/universe [10:15:00] hashar: awesome, I'll play around then! [10:15:03] isn't production Swift on Jessie already? [10:15:14] hashar: half & half [10:15:54] <_joe_> I think we should really not patch switft 1.13 [10:16:17] <_joe_> and work to move to 2.2 in the next month or so [10:16:30] <_joe_> let's hear from filippo next week as well [10:16:30] _joe_: it's already agreed to not patch it if the hotfix works in VCL [10:16:41] obviously [10:16:58] <_joe_> volans: if it doesn't let's just reduce keep to 0 for upload until we've migrated [10:23:06] we can't actually prevent conditional reqs from varnish to swift. even at keep=0 we have grace-mode reqs, they're just rarer. [10:24:09] (also, even if the current patch solves the problem in the long term, it may take some time before we really cycle out cached objects which lack the new X-Orig-Content-Type, especially if they're being 304-refreshed...) [10:25:35] maybe we can match arbitrary object headers in a ban though... [10:26:58] so the bans on CT~text today reset the clock again on bad cached objects. for the next 3d we'll keep accumulating some new ones (because existing cache objects aren't protected with the new X-Orig-Content-Type header), so we'll want to periodically re-ban. [10:28:01] perhaps on the 2nd or 3rd day, we could also ban all objects that lack the X-Orig-Content-Type header from cache to finish the cleanup? [10:28:23] (in theory we could do that right now, but it would destroy most of the cache and cause load pain for swift and network links, etc) [10:29:39] I'm not even sure that varnish has a way in a ban expression to specify all obj that are lacking header foo [10:30:32] <_joe_> yeah I was wondering [10:31:04] <_joe_> bblack: I have a proposal for the switchover for traffic, can I borrow 5 minutes of your time? [10:31:12] not directly, but we could test (not in prod!) a workaround like: obj.http.X-Orig-Content-Type !~ "." [10:31:22] _joe_: yes [10:31:34] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10MoritzMuehlenhoff) @ayounsi I've added you to pwstore and re-encrypted the password files. Docs can be found at https://office.wikimedia.org/wiki/Pwsto... [10:31:45] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3156881 (10MoritzMuehlenhoff) [10:32:08] <_joe_> bblack: ok so, when changing cache::app_directors, that only influences the config of codfw and eqiad, right? [10:32:24] <_joe_> so the caches that can go directly to the backend [10:32:42] <_joe_> or did I get that wrong? [10:32:56] in practice today, yes. technically, although I avoid documenting it very explicity, you could put entries for ulsfo+esams in those backends lists too and it would Just Work, intentionally. [10:33:29] <_joe_> yeah, i was using "codfw" and "eqiad" just for easiness of practical conversation [10:33:38] <_joe_> so, I thought we could do the following: [10:34:18] <_joe_> 1) before the switchover, we disable puppet on all caches in $dc_from and $dc_to [10:34:23] <_joe_> all text caches, even [10:34:36] )we' [10:34:55] (we're talking about the traffic-switching 24h before that's not time-critical, as opposed to MW-RO, right?) [10:35:13] <_joe_> bblack: nope, I'm talking about MW-RO [10:35:21] ok [10:35:47] <_joe_> 2) after that, we merge a "broken" change, that just represents the final state, where $dc_to is direct to its appservers, and $dc_from has no direct route [10:36:26] <_joe_> 3) once we are read-only, we run puppet on the cache hosts in $dc_to; this takes us to an active-active state [10:36:53] <_joe_> 4) once we've switched mediawiki, before we go rw, we run puppet on the cache hosts in $dc_from [10:37:11] <_joe_> so that their traffic will now flow through the chaces in $dc_to [10:37:15] yes, that sequence works as you think it does [10:37:40] <_joe_> we've added scripts to consistently disable, enable and run puppet via cumin [10:38:06] what does "consistently disable" mean? disable + wait for running agents to finish draining out before returning? [10:38:10] <_joe_> https://github.com/wikimedia/puppet/tree/production/modules/base/files/puppet [10:38:13] <_joe_> bblack: yes [10:38:25] ok [10:38:31] <_joe_> https://github.com/wikimedia/puppet/blob/production/modules/base/files/puppet/disable-puppet and https://github.com/wikimedia/puppet/blob/production/modules/base/files/puppet/puppet-common.sh [10:38:51] <_joe_> a further review is welcome, but i tested those scripts a bit [10:39:46] <_joe_> I think that those scripts are absolutely needed even in the 2-commit scenario [10:40:05] <_joe_> unless you can wait ~ 30 minutes between states [10:40:28] my only initial thought is that 30s default (unchangeable) wait in run-puppet-agent is probably too short [10:40:44] <_joe_> yeah I thought of raising that to 60s [10:41:01] <_joe_> but note that if that doesn't happen, we fail and cumin would report it [10:41:04] normal agent runs on caches still commonly take over 30s [10:41:08] <_joe_> ok [10:41:14] <_joe_> I'll raise that to 120s [10:41:17] <_joe_> just to be sure [10:41:27] <_joe_> it won't work on einsteinium, but w/e [10:41:45] and I agree, any way we slice this we need those functions to reliable enable/disable/run [10:42:07] yep [10:42:18] it might be nice to have the run-puppet-agent take an argument to enable for itself too, so that the enable->run timing is closer and less likely to race a cron agent in the first place [10:42:35] <_joe_> bblack: good idea [10:43:20] <_joe_> my only concern was having to pass a message to run-puppet-agent [10:43:31] <_joe_> as we might have hosts with puppet disabled by others [10:43:44] well it can have a force-flag like the enable script does [10:43:44] <_joe_> but adding a switch to go nuclear seems like a good idea [10:43:58] <_joe_> ack :) [10:44:24] if --enable is passed then we can call enable-puppet with $@ [10:44:31] doing a shift [10:44:40] so that we an still pass the mssage or --force to it [10:44:43] <_joe_> volans: or just add a --force switch [10:44:53] what do you mean? [10:44:54] <_joe_> and in that case we do enable puppet before running [10:45:02] <_joe_> unconditionally [10:45:07] I might want to do enable+run without forcing the enable [10:45:13] if was for a different reason [10:45:20] <_joe_> in that case, just fucking use puppet-enable [10:45:21] <_joe_> :P [10:45:44] <_joe_> the nuclear option is just useful if you want to be quick and consistent [10:46:01] so my only concern about the sequence outlined above, is in the general case it's scary to (a) rely on puppet pre-disable and nobody messing with that state while executing some other (switchover or emergency or whatever) action during the rest of your procedure that breaks the critical disables + (b) merge a single-step change that is known to cause a user-damaging race condition without the car [10:46:06] eful disable-controls [10:46:08] yes, but we want to be quick and consistent without messing with that single host that was disabled for a hardware issue [10:46:47] the alternative that's less-scary on that front is like: [10:46:50] (the above was a reply to _joe_ ofc) [10:47:57] <_joe_> bblack: something like this https://gerrit.wikimedia.org/r/#/c/346320/1/hieradata/role/common/cache/text.yaml will allow just 1 change to be propagated during the ro-phase, but it will cause a pii leak [10:48:18] 1) Just before MW-RO window, merge+deploy a change which does "eqiad: foo.eqiad codfw: foo.eqiad" (this is "active/active" from traffic perspective, but leaks PII and still active/passive from the app's perspective) [10:49:02] <_joe_> heh ok, I went the other way around, but still the same outcome [10:49:46] 2) During MW-RO window, merged+deploy a change to "codfw: foo.codfw" only (this is now active/passive with no PII leak, and no race when transitioning from above state. There will be a temporary race of active/active-to-the-applayer during this rolling out to agents) [10:50:06] 3) Done (MW-RO can end when other things are ready, caches already in correct state) [10:50:29] <_joe_> bblack: yeah I used a slightly different approach in what I described here: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki [10:50:36] <_joe_> but equivalent to what you just said [10:50:54] <_joe_> but I thought we could have the best of both worlds by being clever with puppet merges [10:51:05] <_joe_> puppet runs, sorry [10:52:50] <_joe_> so either solution is ok for me, I would love to have 1 commit to merge and to be able to do no PII leaks, but I'm ok with either way [10:53:04] well in your proposed sequence earlier, ruring RO, we only need 2x sets of DC-specific cache agent runs [10:53:19] <_joe_> yes [10:53:26] in my sequence of above, during RO we need 1x actual merge of a commit -> agent-run on both core DCs [10:54:09] so, saving the merge until during RO-time in my version is going to slow you down [10:54:10] <_joe_> yes, well, I assumed we disabled puppet before the switchover and merged all the changes minus the second varnish change, then enabled puppet [10:54:32] <_joe_> we did that last time, we do it for other things [10:54:45] <_joe_> (jobrunners and cronjobs) [10:54:45] yeah it just seems a little scary/fragile to me to pre-merge changes that will definitely break things, and rely on the careful control of re-enables of puppet to sequence it [10:55:25] mostly because there could be other actors involved, there's a lot of steps spread around different things to execute. [10:56:03] tbh in theory not [10:56:16] the automation was done to handle the whole process (merges excluded ofc) [10:56:23] so there shouldn't be other actors [10:57:04] because the whole team will be hands-off during the whole RO of the phase + the time between cache puppet disable and RO-phase start? [10:57:41] <_joe_> what do you mean with "hands-off"? [10:58:02] <_joe_> we shouldn't need to do any action besides monitoring and checking [10:58:12] (and because of this fragility, be specifically instructed not to be working on any other incidental issue manually. e.g. if 3 steps into RO we see a problem and ema wants to push a VCL change to hack around it, he can't, because enabling puppet and running will break the sequence due to the pending dangerous merged change) [10:58:41] <_joe_> yeah that would need merging a revert [10:59:15] <_joe_> I'm ok with not merging changes before the time they need to be applied [10:59:37] the revert could be already prepared right after merging it, JIC [10:59:54] yeah it just seems a little scary to me in the real world, the general idea of "pre-disable puppet, pre-merge several broken changes, sequence them out in an unbroken fashion with carefully controlled re-enables" during a complex sequence [11:00:35] I'm not necessarily trying to veto the idea. If you think you can make it fool-proof in the real world that's fine. [11:00:43] <_joe_> bblack: ok, if you remove the "pre-merge several broken changes" and you just merge that broken change just before running puppet, is that still scary? [11:00:44] but it scares me a bit :) [11:01:04] _joe_: yeah that removes a lot of the scare-factor [11:01:11] <_joe_> ok [11:01:17] <_joe_> that could work :) [11:01:39] <_joe_> in fact, the procedure around merges is very much a WiP [11:02:06] <_joe_> bblack: I'll try to amend that procedure working on that hypothesis [12:10:54] bblack: low-prio, but cp1008's backend vlc is broken: https://phabricator.wikimedia.org/P5206 [12:24:46] bblack: I was thinking, existing cached objects either have CT~text (and thus we can ban them) or don't (and thus should get X-Orig-CT on IMS), right? [12:24:51] or am I missing a piece? [12:26:38] oh yeah, I am :) [12:27:17] they've been cached w/ X-Orig-CT so on IMS they'd get CT~text and we can't reset it [12:27:44] but perhaps we could set X-Orig-CT on vcl_hit if missing to improve the situation a bit? [12:28:52] s#cached w/#cached w/o# above [12:49:03] bblack: also to answer the previous question about banning on missing headers, it seems to work (at least in vcl) [12:49:07] https://phabricator.wikimedia.org/P5210 [13:09:13] <_joe_> I have a question for you cache gurus: do we have a cache policy for ores.wikimedia.org? [13:09:22] <_joe_> or we just pass through to the backend? [13:13:03] _joe_: see hieradata/role/common/cache/misc.yaml [13:13:30] look for caching: 'pass' in cache::req_handling [13:13:35] ema: Re:swift: on second thought yes, also the commit you pointed out will work using the real CT of the object, but not just the paste you send before, the whole move of the logic from server.py to swob.py. [13:14:24] _joe_: so it seems to me that ores isn't pass-only [13:15:05] volans: right, I would have been very surprised if that commit wasn't part of the solution :) [13:15:20] at any rate, it'd be nice to keep the diff to a minimum [13:15:27] perhaps even something like: [13:15:35] if status != 304: [13:15:35] self.headers = HeaderKeyDict( [13:15:35] [('Content-Type', 'text/html; charset=UTF-8')]) [13:15:35] else: [13:15:35] self.headers = HeaderKeyDict([]) [13:15:35] then the oneliner is the minimum :-P [13:18:54] (the one I posted before) [14:56:03] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157520 (10dr0ptp4kt) @Dzahn does Google Search Console for noc@ show that the site verification code matches what's already in DNS like you communicated here? I... [15:51:50] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157734 (10Dzahn) @dr0ptp4kt I don't really see the code but it has a green check box next to "DNS". I gave full access to abaso@wikimedia.org for https://media... [15:52:28] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157735 (10ayounsi) pwstore works fine! We should be good to close this task. [15:54:04] OK so after about 1h of madness we found out that neither the approach I was suggesting nor volans' work properly for unspeakable reasons [15:54:12] however, this does work fine: https://phabricator.wikimedia.org/P5211 [15:54:33] it needs to be applied on swift-proxy only, currently applied to deployment-ms-fe01 [15:54:53] I want to stress on the "unspeakable" part! ;) [15:55:31] the request I've been using for testing is: [15:55:38] curl -v -H 'If-Modified-Since: Sat, 16 Jul 2016 12:40:26 GMT' http://localhost:80/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png 2>&1 [15:55:48] ema: I know it's a nitpick but I would add: and 'Content-Type' in self.headers in the if [15:55:52] JIC [15:56:10] technically should be and self.headers and ... [15:56:46] just because of the madness above we cannot be sure of anything ;) [15:56:57] definitely [15:58:16] ema: yeah I'd avoid the update-on-hit. at our reqrates, touching storage a whole lot more could be costly and make the mailbox stuff worse [15:58:44] ema: but we can certainly just keep re-banning on CT~text periodically for the first couple of days, then ban everything that's missing X-Orig-CT [15:59:59] P5211 looks nice, are we considering patching that into all the ms-fe? [16:01:30] I've mixed feelings about it [16:01:40] sorry, meeting, can argument later [16:04:35] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157769 (10dr0ptp4kt) Bummer, it looks like that didn't do it. Would you please check the noc@ email to see if there's a site verification request that you can a... [16:06:17] bblack: I hope that the ban() vcl function and varnishadm ban are functionally equivalent, if so yeah we should ban on CT~text periodically and then ban on !X-Orig-CT [16:06:42] we can probably double check on cp1008 [16:07:42] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157775 (10Dzahn) I don't really know how to check noc@wikimedia.org email. If i try to login with the credentials i have on mail, i get the " Add Gmail to your... [16:08:15] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157776 (10Deskana) >>! In T161343#3157734, @Dzahn wrote: > Also note an existing owner you share full access with is "searchteam+gwt@wikimedia.org". Do you know... [16:09:50] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157778 (10Dzahn) checking the "messages" in Search Console itself, the last one is from 4/1/17. [16:10:51] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3157779 (10ayounsi) Juniper case 2017-0405-0571 opened. [16:15:00] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157854 (10Dzahn) @dr0ptp4kt I also added you http://mediawiki.org, https://www.mediawiki.org and https://www.mediawiki.org with Full access.. any difference... [16:17:03] with P5211 applied all test.unit.proxy are still green, which seems encouraging :) [16:19:18] lol [16:21:44] tests pass, ship it, right? :) [16:22:05] works on my machine! [16:24:51] "ops problem now" [16:26:04] I've added a section to wikitech for cluster-wide bans with cumin: [16:26:07] https://wikitech.wikimedia.org/wiki/Varnish#How_to_execute_a_ban_across_a_cluster_with_cumin [16:28:12] whenever we are going to ban, it would be interesting to wait for a while between the steps and see when/if/how problems arise [16:28:49] the last two times, it's always been ulsfo only having troubles [16:58:17] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157942 (10MoritzMuehlenhoff) 05Open>03Resolved [17:46:32] <_joe_> ema: we should really transform that into a script [17:47:12] <_joe_> as in, a proper script using cumin as a library :) [17:47:42] ehehe [18:00:18] how to order the steps depends on the service and the dynamic topology though [18:00:22] it's a tricky script! :) [18:01:12] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158202 (10Dzahn) it went down again: 11:02 < icinga-wm> PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:21] bleh [18:01:37] should i leave it down? [18:01:40] or powercycle again [18:01:49] perhaps power it off to make sure it doesn't blip back on, for now [18:02:10] no evidence last time around? [18:02:33] a USB device that is flapping the whole time [18:02:40] keeps finding a USB hub [18:02:40] oh I see now, clicked on ticket :) [18:02:51] ok let's keep it powered off until ops-codfw can look at it [18:02:55] i think it's safe to say this will just end in calling HP and replacing the board [18:03:01] ok [18:03:35] (but we should make that a priority, since we'll soon be moving all our traffic to codfw temporarily!) [18:03:58] powering off via mgmt [18:04:04] yes [18:05:52] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158211 (10Dzahn) < bblack> perhaps power it off to make sure it doesn't blip back on, for now ``` Server Power: On hpiLO-> power off status=0 status_tag=COMMAND COMPLETED Wed Apr 5 18:03:48 2017... [18:06:01] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158212 (10Dzahn) p:05Normal>03High [18:08:08] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158216 (10Dzahn) @papaul Could you take a look at this. It seems we might have to call HP. We should make this a priority since we'll soon be moving all our traffic to codfw temporarily. [18:13:36] XioNoX: re T162099 - we should probably flip the metrics and defaults here between lvs2002 and lvs2004 on the cr[12]-codfw for now [18:13:36] T162099: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099 [18:14:28] (lvs2002 and 2004 are a redundant pair that speak BGP to those routers to advertise service IPs. I think it's in the router config that 2002 gets a better metric to be the default when both are advertising, and is also the static fallback if neither is advertising) [18:17:00] bblack: sure, looking [18:17:42] bblack, paravoid, Juniper is okay to proceed with the RMA, I'll need onsite contacts and shipping info [18:22:07] hmmm oddly I can't ssh into our routers like I used to remember doing [18:22:30] every time there are uploads to releases.wm.org , f.e. https://releases.wikimedia.org/debian/pool/main/p/parsoid/ we first still have the older package version cached by varnish. Then a while later cache expires but we have to wait. Can i tell how long it will take until it expires? [18:23:15] mutante: do you mean just the directory index page? [18:24:04] hmm the directory index isn't cached anyways [18:24:16] the actual files should have different names... [18:24:25] well, 2 things. the directory index page i still dont see 0.71 there, but subbu does [18:24:33] and also the actual package you get with apt-get install [18:25:06] it is still giving us the old version but if we wait long enough.. eventually the new one [18:25:27] ah you're right the index is cached, just not for me because $random_cookies [18:25:39] it's happened on another occasion [18:25:48] when subbu uploads new parsoid versions [18:26:05] we get tickets that upload has worked but you still cant install [18:26:08] the package installed by "apt-get install" probably depends on some index file that doesn't change names [18:26:14] yes [18:26:23] the actual file is probably fine [18:26:29] the Packages file i think [18:27:07] since releases server isn't sending any CC stuff [18:27:34] it should use the default ttl on misc, which is 1h [18:27:56] oh! that is shorter than i though [18:28:07] that's short enough to not worry about it i think [18:28:16] somehow expected something like 24h [18:28:27] thanks, for now subbu just wanted to know an "ETA" [18:29:40] it's probably subject to rare race conditions too, which could make it take a few hours at most. but it should usually be 1h [18:30:09] (that's the whole, "the TTL is per-cache-layer" thing) [18:30:23] alright [18:30:30] yeah so, I can't ssh into any of our network devices anymore [18:30:40] like it doesn't remember my ssh keys [18:31:27] (I should have 3x keys everywhere named bb@neo-[123]) [18:31:50] the last change that affected ALL network decives at once was probably when Faidon added Arzhel yesterday [18:32:04] yeah [18:32:10] I'd go peek, but... P [18:33:47] oh nevermind, I'm being dumb [18:34:09] I forgot that I never did go try to switch them all to the new keys, I still have an on-disk key I'm supposed to manually use for net stuff [18:35:44] XioNoX: peeking at cr1-codfw, the metric thing is: [18:36:21] policy-statement LVS_import { term secondary { ... adds metric 10 to the IPs for .4, .5, .6 which are lvs200[456] }} [18:36:39] [edit policy-options policy-statement LVS_import term secondary from] [18:36:39] - neighbor [ 10.192.17.4 10.192.17.5 10.192.17.6 ]; [18:36:39] + neighbor [ 10.192.17.5 10.192.17.6 10.192.17.2 ]; [18:36:51] yeah [18:37:01] bblack: It looks like I have to do that on both devices [18:37:04] or could just do a separate term that adds 20 to .2 separately, since it's temporary [18:37:05] cr1/2 [18:37:27] sure [18:38:06] now I'm reading through trying to see where we set the static-default stuff at [18:38:22] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158350 (10ayounsi) Juniper is ready to proceed with an RMA. We need to sync up with the DC's remote hands for that. [18:39:49] ah found it [18:40:04] XioNoX: look for the comments in show config that say "backup route" (2x for ipv6, 3x for ipv4) [18:40:57] got it [18:41:38] bblack: what's the difference between 10.192.1.* and 10.192.17.* ? [18:43:53] different interfaces on different vlans for the same hosts in the case of LVS [18:44:48] I'm not sure why it's using different IPs for neighbor vs static-route there tbh [18:45:19] but faidon might know, or it might be just an inconsistency that doesn't matter in practice? [18:45:50] I'd just stick with the current scheme and mess with the .2 vs .4 for now [18:46:20] (and note the changes in the lvs2002 ticket so we remember to undo this stuff later) [18:47:39] for context: the LVS machines in the core datacenters all have 4x physical interfaces on them, so that they're directly-connected to all the per-row switch stacks [18:47:53] (so that they can route traffic to machines in any row) [18:48:29] (and then they're virtually on both the public and private VLAN in each row as well) [18:49:06] hmmm [18:49:09] okay [18:49:19] now that makes me wonder more about the .1 vs .17 question though :) [18:50:07] ok so, we're using the .17 IPs for 4+5+6 because that's their primary interface [18:50:19] and the .1 IPs for 1+2+3 because that's *their* primary interface [18:50:23] bblack: so the static route to 10.192.1.2 needs to be pointing to .4 now ? or deactivated? [18:50:34] ah okay [18:50:36] so I think I mislead you above [18:50:58] the +20 metric for lvs2002 should add to 10.192.1.2 [18:51:18] and switching the static route should change from 10.192.1.2 to 10.192.17.4 [18:51:35] that makes more sense :) [18:52:40] "primary interface" being the one IP on that host that maps to its actual FQDN in DNS [18:53:18] bblack@lvs2001:~$ host 10.192.1.2 [18:53:18] 2.1.192.10.in-addr.arpa domain name pointer lvs2002.codfw.wmnet. [18:53:18] bblack@lvs2001:~$ host 10.192.17.2 [18:53:18] 2.17.192.10.in-addr.arpa domain name pointer vl2018-eth1.lvs2002.codfw.wmnet. [18:53:21] bblack@lvs2001:~$ host 10.192.1.4 [18:53:23] 4.1.192.10.in-addr.arpa domain name pointer vl2017-eth1.lvs2004.codfw.wmnet. [18:53:26] bblack@lvs2001:~$ host 10.192.17.4 [18:53:29] 4.17.192.10.in-addr.arpa domain name pointer lvs2004.codfw.wmnet. [18:53:45] I'm still a bit confused :) [18:54:07] ok let me reword things a bit: [18:54:46] codfw has 4x rows of equipment A B C D, each has a separate switch stack [18:55:20] each row has two primary VLANs (aside from special things like labs, analytics, etc), e.g. public1-a-codfw + private1-a-codfw are the public + private -address VLANs for Row A in codfw [18:56:04] the LVS hosts, like the juniper routers, are special because they're actually functioning as cross-vlan routers [18:56:37] so the LVS hosts, like the juniper routers, have 4x physical interfaces reaching those 8x VLANs, so that they can directly-route packets to any host in any row/vlan [18:57:28] so even just speaking about IPv4, there are 8x different IP addresses for host lvs2002 [18:57:53] (which you can see if you go to e.g. lvs2001 CLI and do "ip -4 addr") [18:58:27] but only one of the 8x different interface IPs for a given LVS box is it's "primary" address as a host (as opposed to just for routing packets around) [18:59:03] which is the private IP on eth0, and it's the only one of the IPs that maps in DNS to the actual LVS hostname, instead of the interface specific ones [18:59:24] 10.192.1.2 is the IP of lvs2002.codfw.wmnet and is on eth0 there [18:59:51] whereas 10.192.17.2 is the IP of "vl2018-eth1.lvs2002.codfw.wmnet." in DNS and is not the primary interface on eth0 for lvs2002 [19:00:02] $ ssh lvs2002.codfw.wmnet [19:00:02] Connection timed out during banner exchange [19:00:02] Did I miss something? [19:00:18] lvs2002 is powered down for failure, try 2001 for similar comparison [19:01:15] anyways, the point is: for the static backup routes and the BGP metric tweaking in the juniper router configs, we're using the "primary" IP of whichever LVS we're specifying [19:01:42] in codfw, that's the .17.[456] IPs for lvs200[456], but it's the .1.[123] IPs for lvs200[123] [19:02:04] we placed their primaries on opposite vlans so that if a switch stack dies we can't lose both sides of the LVS redundancy [19:03:26] okay [19:03:30] that all makes sens [19:03:44] there's probably a simpler way to explain that, sorry :) [19:04:39] I think I got it [19:06:45] most other things are relatively straightforward, LVS-related things are a bit complicated :) [19:07:04] rewinding a bit to give more context that will help down the line: [19:07:33] the bulk of our public traffic (e.g. the HTTPS reqs to en.wikipedia.org) happens via LVS using this stuff [19:07:36] bblack, so on cr2, which is where lvs2002 is primary, I need to push the following to 1, make the static point to lve2004 behind cr1 (.17.4) and 2/ make the route to 10.192.1.2 less prefered, even though 10.192.1.2 is down right now, so I guess it's for when it comes back up? [19:08:17] XioNoX: sounds right. cr1 should be identical, I think! [19:08:29] bblack https://www.irccloud.com/pastebin/nvJKZjiD/ [19:08:45] looking at BR1 now [19:09:07] oh I need not to forget v6 :) [19:09:57] yeah just for the static backup route [19:10:07] also in the past, does the term need "protocol bgp;" like the other term? [19:10:11] (don't ask me!) [19:11:12] s/past/paste/ two lines up [19:13:34] it *shouldn't* be needed, but you're right, better to have it [19:13:56] while we're on the topic, to provide some broader future context: [19:14:13] bblack: 2620:0:860:101:10:192:17:4 doesn't reply to pings, did I get the v4->v6 mapping wrong? [19:14:27] the bulk of our public service traffic comes in through this mechanism with LVS, and it's asymmetric routed (in LVS terms it's LVS-DR "Direct Routing") [19:14:53] the inbound packets from users hit the cr[12] core routers, they forward the packet to the lvs machines that advertising the public service IPs to them via BGP [19:15:21] LVS makes an intelligent decision about routing it to one of several internal servers directly (without hopping back through the router, because it is directly connected to all VLANs) [19:16:05] the internal server also has the public IP configured, on its loopback interface, and sends the response packet with that public source IP, directly back to the core cr[12] routers (skipping over LVS for the return path) [19:17:00] XioNoX: yeah the :101: part is different for each vlan for v6 [19:17:20] it's 2620:0:860:102:10:192:17:4 [19:17:21] ah, it's 102 [19:17:22] ayounsi@lvs2004:~$ ip -6 addr | grep 17 [19:17:23] for lvs2004 [19:19:11] ugh [19:19:20] before you commit anything! :) [19:19:47] while I've been diving through all these details, I've made a mistake way up at the top of this whole topic, which has persisted through it all! [19:20:07] in the core sites there are 6x LVS boxes: 123456 [19:20:18] 1 2 3 are primary, 4 5 6 are secondary [19:20:33] they each have separate roles (separate sets of public IPs) [19:20:52] so lvs2001's secondary is lvs2004, lvs2002 matches with lvs2005, lvs2003 matches with lvs2006 [19:21:09] so we're trying to swap lvs2002 for 2005 here, not 2004 :P [19:21:29] never trust me on the details I guess :) [19:22:05] (at the edge sites there's only 4 of them and the mapping is 1:3 + 2:4, at the core sites there's 6 of them and it's 1:4, 2:5, 3:6) [19:22:31] XioNoX: ^ [19:23:29] ah okay [19:23:54] bblack: so replace .4 with .5 everywhere in the diff? [19:24:15] yeah, basically. (or :4 with :5 in the ipv6 case) [19:25:50] thinking longer-term to however we end up doing our next iteration of network config management [19:26:05] it would be nice to be able to script on top of it to make certain well-known changes like this [19:26:31] (e.g. build a script on top of the base cfg mgmt that can do things like "swap LVS primacy for lvs2002 + lvs2005") [19:29:52] there's so many ways you could get there. some ideal pie-in-the-sky future might have at least some of the router config being generated from our central config metadata in puppet/hiera [19:30:02] indeed, but how often does that happen? [19:30:19] not often really [19:30:57] XioNoX: it's nice to see you around here :) [19:31:01] bblack, good to push to both cr1/2 ? https://www.irccloud.com/pastebin/Kecy61aZ/ [19:31:58] XioNoX: LGTM conceptually! :) [19:32:43] it would also be cool to actually be pushing most of our router config changes through real CI/review like puppet, too :) [19:33:11] bblack: I'm about to call it a day, will proceed with the purges tomorrow morning (but feel free to go ahead if you've got the time meanwhile) [19:33:38] XioNoX: anyways, once that's in place update the ticket too and let them know it's now safe to power on an unreliable lvs2002 for debugging or post-replacement or whatever [19:33:42] ema: ok :) [19:34:03] ema: how often have you been purging? I was just thinking clean it up 1-2x a day until we get to the final ban [19:34:55] bblack: yeah, I had ideas to do that at Mozilla, but people were not very familiar with git/CI and other tools like that, which kind of blocked it [19:35:27] bblack: today I haven't! All purges in SAL, last one was yesterday at 13:00 [19:35:37] 1-2x a day seems reasonable [19:36:09] XioNoX: oh another thing, you might want to log some basic mention of router config changes for now (since we don't have good CI/review on them, we often do that) [19:36:12] alright, pushed [19:36:31] I'm not sure I understand [19:36:33] by "log", I mean there's a bot that will record whatever you put on a line starting with "!log" in the wikimedia-operations channel [19:36:42] so over in that channel, you could say something like: [19:37:08] !log changed lvs routing fallback/metrics for T123456 [19:37:09] bblack: Not expecting to hear !log here [19:37:09] T123456: Special:CentralAuth reports account attachment, which - being standalone - is confusing, report accout creation as well - https://phabricator.wikimedia.org/T123456 [19:37:30] just some kind of indicator of what you did [19:37:39] (doesn't need deep details, but ticket links are nice) [19:38:00] and then that's gonna end up on https://wikitech.wikimedia.org/wiki/Server_Admin_Log (SAL) [19:38:05] right [19:38:23] and then when you're AFK and someone else shows up and has some horrible problem to debug, they can dig through SAL for time-correlations with work done, etc [19:39:30] bye folks :) [19:39:34] bye! :) [19:41:20] okay I see [19:41:54] !log pushing https://www.irccloud.com/pastebin/Kecy61aZ/ to cr1/2.codfw for T162099 [19:41:54] XioNoX: Not expecting to hear !log here [19:41:55] T162099: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099 [19:42:04] bblack: ^ [19:42:15] XioNoX: but it doesn't listen for that command here, so do that over in #wikimedia-operations [19:42:39] (it probably should accept !log here in this channel too, but I've never gone and figured out how to get it set up) [19:43:41] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10ayounsi) Pushed the following to cr1/2.codfw. When lvs2002 comes back online for troubleshooting it should not receive any traffic. ``` [edit routing-options rib inet6.0 static route 2620:0... [19:46:46] thanks :) [20:07:05] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10RobH) I've emailed evoswitch to open an inbound shipment ticket. Once I have that reference, I'll update this task so @ayounsi can have Juniper dispatch the replacement part. [20:07:12] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158709 (10RobH) a:03RobH [20:11:43] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158717 (10ayounsi) Step by step instructions for the remote hands: # Locate the chassis: http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/concept/mpc-mx480-description.html # L... [20:14:31] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158719 (10RobH) Inbound ticket # is 7326745, please go ahead and have them dispatch the part. Update this task with the tracking # and assign to me, and I'll get the inbound ticket updated. [20:50:15] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158836 (10ayounsi) From Juniper: > Thank you for the information on provided, and the RMA request has been processed for the FPC: > - RMA number: R200119594 > - Product ID: MPC5E-40G10... [21:35:00] 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158919 (10RobH) I'm also being CC'd on those emails from Juniper. Once they reply back with the tracking #, I'll update EvoSwitch for the open shipment ticket and open the ticket for the smart hands req... [22:57:55] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3159129 (10Papaul) a:03Papaul [23:56:36] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3159330 (10Dzahn) re: mail to noc@ I was stupid of course i can check that, it's just an alias for root@ and all ops get that. but .. i can still not see one fr...