[00:00:23] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3135842 (10faidon) I haven't heard back, but I noticed [[ https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1205416 | PR1205416 ]] now says: >... [00:25:41] 07HTTPS, 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253#3135933 (10Krinkle) [00:45:39] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3129255 (10Dzahn) It looks like the mediawiki.org zone in DNS already has a TXT record for Google verification: 600 IN TXT "google-site-verific... [00:46:26] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3135970 (10Dzahn) p:05Triage>03Normal [08:35:53] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136379 (10akosiaris) Done. I 've deleted and `vlan default` entry as well as the manually added (by me) `private1-d-eqiad` and added the `all` VLAN. ``` show... [08:55:45] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136403 (10akosiaris) 05stalled>03Open [09:32:42] 10Traffic, 06Operations: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360#3129683 (10ema) > upload.wikimedia.org resolves to 194.168.4.100 (cache1.service.virginmedia.net.) That's not right. Perhaps this was a temporary issue with [[ http://community.virginmedia.com/t5/Switched... [10:54:37] 10Traffic, 06Analytics-Kanban, 06Operations, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3136586 (10elukey) The traffic is definitely decreased a lot from last week, but I am still seeing some 503s (way more than before). I a... [12:51:46] 10Traffic, 10DNS, 06Operations: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3136754 (10dr0ptp4kt) Oh! In that case it is *probably* just actions done with the noc@ account to delegate "full" access to abaso@wikimedia.org to https://media... [12:52:04] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136756 (10faidon) We've lived with this bug in codfw for so long, I'd say to let it be as-is until we're done with the switchover and postpone that for May on... [12:56:36] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136757 (10akosiaris) Agreed [13:54:21] _joe_ and/or volans: some minor design-question stuff on the active/active patch for varnish (public) services and the active/passive services (mediawiki, swift?), when you have a moment: [13:54:52] bblack: sure [13:55:03] for those that are active/passive (so, in any normal steady-state, we'll have it turned on for exactly one of codfw or eqiad, like mediawiki)... [13:55:44] there's a few options for how we handle transitioning them from one side to the other, preferences on that really come down to whether they can all tolerate an active/active period (from varnish perspective) or not [13:56:20] if they can tolerate a short window of active/active (e.g. while they're in RO-only mode), we can design around expecting that as part of the process [13:57:13] if even one service can't tolerate that during switching, then I need to build in a mechanism for the only other alternative, which is to temporarily block all traffic (5xx) during the switching process, so that there's never a time with async queries going to both sides [13:57:37] using failoid? [13:57:54] (since the process, no matter how far down the design road we eventually get, involves async updates to multiple cache nodes and/or sites, those are really the only two options: either service-outage during switch, or temporary active/active during switch) [13:58:13] it wouldn't use failoid in this case. varnish would just generate a synthetic 5xx on its own [13:58:19] ok [13:58:55] to put it in concrete-ish terms, the configuration for such a service for the varnish-level stuff looks like: [13:59:19] cache::app_directors: [13:59:19] appservers: [13:59:19] backends: [13:59:19] eqiad: 'appservers.svc.eqiad.wmnet' [13:59:20] # codfw: 'appservers.svc.codfw.wmnet' [13:59:24] ^ yaml fragment [13:59:36] with codfw commented out like that, we're active/passive and only talking to eqiad [14:00:00] the option is either to comment both or to uncomment both I guess [14:00:15] if we assume a model where it's always ok to go through a temporary active/active during the switching process, then we can do that sanely in two commits [yes to your question]: [14:00:28] 1) uncomment codfw (now active active) and ensure that change is pushed to all [14:00:33] 2) comment out eqiad [14:01:13] whereas if we need outage-like behavior, we'll have to support commenting-out both, and the VCL code would interpret that as "send 5xx for all queries to this service" [14:01:23] still two commits, but the middle state is both-commented-out [14:01:37] is complex to support both cases? [14:02:13] no, supporting both is about the same complexity as supporting just the latter (outage-mode). the main reason I ask at all is that, if we're in a world where we can assume we don't need the outage-like case, [14:02:33] I'm asking because the switchover is an ideal case, in case of a real outage for example MW might not be in RO mode hence not support active/active [14:02:36] and so on... [14:02:37] it's nice to have the code simply fail (as in "config error, not updating varnish") when both are commented out, as a safety-net against mis-configuration [14:02:50] but probably in a real outage we'are alrady getting failures for the outage :) [14:02:55] well in a real outage we kind of don't care about the dead-side, right [14:03:39] but if you think we'll functionally need intentional-outage-mode, then I won't make "all backends missing" a human config error, and support that as a valid way to ask varnish to 5xx all queries to that service [14:05:23] it's indeed a dangerous thing to allow for an automatic 5xx if there is a config mistake [14:06:13] well, if we need intentional outage mode and also want that sanity-check, we can make outage-mode more-explicit in the config, too [14:06:30] and passing through failoid using the yaml would mean 4 commits, so too ugly [14:07:56] so yeah, I could add an extra switch that makes it explicit, too [14:08:07] I'm not aware of all the services internal logics enough to ensure you that we don't have any case that requires the failure in the middle [14:08:50] like have an optional key (alongside backends:) like "outage: true" or something english-obvious [14:09:09] where if that key is set, zero-backends is allowed [14:09:12] yeah, that would help to be safe and keep the minimum numbers of commit [14:09:39] allow_empty_backends :D [14:10:22] that's much clearer as to what it means in the code, but I like making people actually type that they're causing an outage :) [14:11:02] but ok [14:11:08] yeah, anything works [14:11:14] thanks! [14:11:37] I guess _joe_ might be able to better answer your real underlying question, sorry [14:12:16] well it's a fair point that we don't really know if you ask it generally-enough. we might not need it today but reasonably need it during the next switch attempt for something new/unanticipated about the process. [14:13:06] aside from hacking on that last bit though, the bulk of the patch is pretty-well vetted and working correctly (but is the "human error if zero backends" mode) [14:13:22] it's on deployment-prep and been through compiler output checking, etc. I think it will probably be mergeable. [14:13:27] <_joe_> sorry I'm just back [14:13:27] (this week) [14:13:34] <_joe_> reading backlog [14:13:42] that's great! [14:14:09] I guess adding the failsafe should not be that complex then [14:14:22] yeah it's not bad [14:16:49] <_joe_> so I think we don't need the outage-like mode [14:18:46] <_joe_> bblack: I'm still not sure if it makes sense to have 2 commits + puppet runs to switch varnish [14:19:17] <_joe_> given the stress we're putting on switching everything else fast and with one command (sort-of) [14:19:26] "sort-of" :) [14:19:57] <_joe_> well, at the logical level, almost everything switches in a few seconds, if we cut out the whole [14:20:04] <_joe_> read-only-part [14:20:27] <_joe_> having to do two puppet runs to switch traffic is going in exactly the opposite direction [14:20:38] how would you cut out the read-only part? [14:21:21] <_joe_> I am saying that the traffic patterns internally will switch in a few seconds [14:22:03] <_joe_> we keep a longer read-only phase basically to ensure an easy transition [14:22:11] yes, but you're still going through an outage-like mode or an active/active phase that requires readonly-mode during those few seconds, right? [14:22:56] <_joe_> oh yes, I was commenting on the "two commits phase" [14:23:15] <_joe_> right now we need one commit, IIRC, to switch traffic [14:23:34] so, the varnish solution (the active/active patch I'm discussing) is fairly flexible in how we handle this, it supports all the reasonable ways one could re-arrange how things are done [14:24:10] <_joe_> bblack: that was my past understanding as well. [14:24:58] the "one commit" model from how things work today is from re-arranging the traffic steps so that parts of it aren't in the critical window, at the expense of a temporary PII leak either before or after the MW switch for a window of time until all related varnish steps are done [14:25:09] there's still actually more than one commit, just some of them can be placed outside your RO window [14:25:22] <_joe_> bblack: right [14:25:25] (as part of the "traffic" pre- or post- steps, whichever way you want to arrange it) [14:25:34] we could still do that in the new model, with the same tradeoffs [14:25:54] or we can step through two commits in the middle of RO and never leak PII [14:26:37] <_joe_> bblack: I'm just not so happy that varnish will be the only thing for which we need commits to do it. But now I get the reason of "two commits" [14:26:51] that wasn't really an option before, because one of the steps was whole-cluster, so it would've synced the RO phase with switching all the other services on text cluster at the same time, too [14:27:00] <_joe_> that's assuming we never want to have cross-dc traffic between the varnishes and the appservers [14:27:12] basically, yes [14:27:40] the one-commit model in the new stuff would look like this (well, this is the pre-switch mode, there's a similar mode with the order reversed and varnish cleanup post-RO): [14:27:51] starting point is: [14:28:03] eqiad: 'appservers.svc.eqiad.wmnet' [14:28:15] # codfw: 'appservers.svc.codfw.wmnet' [14:28:29] first commit (before RO period, not in critical time window) is to change to: [14:28:38] eqiad: 'appservers.svc.eqiad.wmnet' [14:28:42] eqiad: 'appservers.svc.eqiad.wmnet [14:28:45] bah! [14:28:49] first commit (before RO period, not in critical time window) is to change to: [14:28:52] eqiad: 'appservers.svc.eqiad.wmnet' [14:29:04] codfw: 'appservers.svc.eqiad.wmnet' [14:29:15] second commit (during your RO period) is to change to: [14:29:25] eqiad: 'appservers.svc.codfw.wmnet' [14:29:32] codfw: 'appservers.svc.codfw.wmnet' [14:29:38] <_joe_> bblack: ok that makes more sense :) [14:29:56] third commit is: [14:30:03] # eqiad: 'appservers.svc.eqiad.wmnet' [14:30:08] codfw: 'appservers.svc.codfw.wmnet' [14:30:39] so you get three total commits, but only the middle one is during the RO period, and we leak PII on both sides (nevermind my earlier comments about two ways to do it with either pre- or post- leak. it's actually always a leak on both sides I think) [14:31:06] I'm sure I'm missing something, but we couldn't use a discovery DNS name? [14:31:23] not really, no [14:37:55] well, if I try to wrap my head around a way that dns discovery could be used here, it's possible it could be done. we'd still need the bulk of this patch in place first, and it would just plug different hostnames into the config above and not require commits. [14:38:16] but it would also require that they be different names from the internal switching, and that the names be varnish-cache-dc -specific, too [14:38:57] e.g. even though we have an appservers.discovery.wmnet for internal stuff, the config above would be something more like: [14:39:08] yes I was thinking you needed more than one [14:39:16] eqiad: appservers.for-varnish-in-eqiad.discovery.wmnet [14:39:24] codfw: appservers.for-varnish-in-codfw.discovery.wmnet [14:39:39] or whatever not-as-awful scheme one had that conveyed the same bits [14:40:23] but even then, it doesn't quite work right for active/passive and optimal routing and/or avoiding PII leak [14:40:23] appservers.eqiad-backends.d.w ? :) [14:40:43] why? [14:41:05] I mean, assuming that varnish will pick the new names within the TTL [14:41:19] it has a setting for its own TTL, basically (which is kinda horrible, but whatever) [14:41:21] changing those names should be absolutely equivalent to committing in puppet right? [14:41:52] not quite: changing those names doesn't give us the ability to do what commenting out one of those two lines does above [14:42:14] (and having them resolve to failoid or whatever doesn't, either) [14:43:57] <_joe_> brandon is saying that to control that via etcd you'd need to create yet some more confd horror [14:44:10] <_joe_> if I'm getting it right [14:44:17] <_joe_> as in getting the same flexibility [14:44:24] well, more than that, but yes, it's not easy with the constraints in play [14:44:34] maybe it's better to step back and explain what commenting-out does: [14:45:01] I was about to say: different possible approach, what if the equivalent of that bits in hiera are managed by etcd and just loaded by VCL/varnish so that the result is the same as in the same generated file [14:45:05] so we have this cache-tiering, right? esams-cache misses go to eqiad-cache and so-on [14:45:13] yep [14:45:34] in the current model (what's live today), only one DC is the final "direct" destination in the cache route table (currently eqiad) [14:45:46] which means all other cache backends just forward to eqiad, and eqiad decides which appserver to send things to [14:46:04] (that everything funnels to eqiad is a per-cache-cluster decision, not per-appservice) [14:46:34] in the newer active/active patch we're discussing, that stuff all fundamentally changes [14:47:03] there's still an inter-cache routing table that applies to the whole cache-cluster (e.g. "text", which is what handles MW and RB and such) [14:47:15] but it looks like this, with a loop between the two core DCs: [14:47:25] cache::route_table: eqiad: 'codfw' codfw: 'eqiad' ulsfo: 'codfw' esams: 'eqiad' [14:47:53] when a request comes into a cache at any of these DCs, the routing decision goes like this: [14:48:14] 1) Identify which appservice the request belongs to (host/uri regexes, etc) [14:48:42] 2) In that appservice's data, check if a backend is defined for the DC we're in right now. If so, send traffic to that applayer backend. [14:48:53] 3) Else, route to the next DC in the cache route table for the cluster. [14:49:15] so if restbase's backend list is defined as: [14:49:24] eqiad: rb.eqiad.wmnet codfw: rb.codfw.wmnet [14:49:39] esams still forwards RB traffic to eqiad, and ulsfo still forward RB traffic to codfw [14:49:56] ok, clear [14:49:58] but at both of those places, it knows to drop the traffic directly to the local rb service instead of forwarding to another cache [14:50:19] but if you comment out the codfw entry there, codfw forwards the traffic to eqiad, where eqiad drops it all to only the eqiad RB [14:50:38] so basically we can have a routing from the local varnish backed directly to the applayer or do another step to the other varnish backend and let it decide [14:51:08] right... [14:51:23] so given that, probably the next iteration of trying to put dns discovery into this would be: [14:52:11] "well ok, so when we want the effect of commenting out restbase's codfw entry, we updated the discovery-dns so that restbase.for-varnish-in-codfw.discovery.wmnet so it resolves to the eqiad varnishes' IP" [14:52:43] except cross from cache to cache is not like crossing from cache to app [14:53:09] we have to have a globally-consistent chash when hopping to another cache, to make the cache effectively scale its size across machines [14:53:13] (chash on URI, basically) [14:53:34] so LVS can't do that, and DNS can't do that. only varnish having the whole list of eqiad caches to run through its chash director can do that [14:54:03] (and on top of that problem, having varnish use DNS this way uses a "DNS director", which replaces the slot the chash director code lives in) [14:55:39] stepping back out of example-land, the varnish routing problems needs more metadata than dns-disc can provide, is the core of the problem [14:55:59] we could still make it etcd-driven, but it would have to be direct etcd data consumption by the VCL code, not via-DNS [14:56:46] could it be via file? I didn't check how current puppet does it [14:56:48] so that means either writing an etcd vmod that works for this purpose, or templating out confd fragments of VCL code [14:56:58] but both paths are rather complex [14:57:13] it's not something we've really allocated any time to solve, yet [14:57:32] the core issue there is that VCL is not a real general-purpose language [14:57:38] the only way DNS would "comment out" things would be through NXDOMAIN, but far from me to suggest that, for obvious reasons [14:58:10] yeah but NXDOMAIN doesn't really help us via varnish's DNS director. it would just make it fail on the side resolving to NXDOMAIN (5xx) [14:58:23] there's no good way to base a decision on that in the VCL code to switch routing [14:59:50] the only realistic paths forward that get us to a better place than "hieradata commits" are: [14:59:57] so, still on DNS option, there could be a way, but is rather ugly [15:00:17] (no, I don't think there's any way, ugly or not, via DNS) [15:00:20] use a predefined IP to "comment it out" [15:00:44] the predefined IP is consumed by varnish's DNS-director code, not our VCL code [15:00:57] meaning you don't have access to it? [15:01:08] or able to make logic on top of it [15:01:31] hmmm ok, that's possible. that's taking ugly to the extreme, though :( [15:01:49] I said "rather ugly" as a premise to my defence [15:02:07] anyways, the slightly-less-ugly approaches are: [15:02:46] 1) We use confd to template out another VCL fragment. To keep the confd template simple, we just have it setting req.http.Foo variables that the rest of the code can act on. [15:03:12] 2) We tie etcd directly into the VCL with some kind of vmod_etcd that makes the same workflow as the above a bit less painful. [15:04:09] (2) seems like wasted effort, writing new vmods in C is tricky and a long-term investment, and we're still aiming towards getting rid of varnish anyways [15:04:48] yes I'd rather go for (1) [15:05:12] (1) is a viable path forward, but there's a long and complex road between "where our varnish config was 6 months ago" (in terms of the awful puppetization -> data -> templating -> runtime VCL, etc) to where we can drop that into place [15:05:38] you have to get there in steps that can be swallowed in smaller chunks (even then they're painful refactor steps) [15:06:16] I've merged up several of those pre-steps over the past 6 months. this active/active work is final part of that, basically. [15:06:41] it gets us to the point where structurally we're ready for (1), but it's still driven currently by a hieradata commit instead of confd-templated values. [15:07:15] and even that, I wasn't sure I'd have ready in time for this effort (we're still cutting it close), which is why didn't promise etcd-driven for varnish for this quarter. [15:07:31] sure [15:07:41] so that's kinda where things are at [15:08:03] the active/active commit is a step in the right direction, and it doesn't go all the way, and there's little chance we can engineer our way through to the end before the 19th safely [15:09:03] assuming we merge this, we still have the option of 1 commit + short PII leak or 2 commits without PII leak right? [15:10:04] 1 commit during RO of course, I'm not counting the ones before/after [15:10:48] "short" probably being the broader duration of the whole switchover affair, maybe up to hours? I donno, but yes. [15:11:23] last time around a year ago, we spent 46 minutes in RO-mode. Shoving back-to-back serial commits through cache_text should take 3-5 minutes tops. I don't see it as a big deal, really. [15:11:52] say 2 minutes if it's a single commit instead? [15:12:04] the switchback was much more quicker and we aim for a pretty short RO period [15:12:35] if it's 1 commit we can disable puppet and ensure disabled before RO, merge and puppet merge, hence just having to enable puppet + puppet run in parallel, I expect less than 1m for this [15:12:44] probably even less [15:12:53] well, it's a tradeoff to make. Do you want to spend [time for 2x serial commit pushes] and leak no PII, or [time for 1x serial commit push] and peak PII around the window for a bit? [15:13:45] before, the PII leak was unavoidable unless we took a much longer RO-window and synced MW-switching with switching every other cache_text service [15:13:47] I think that the important fact is that the new active/active mode is a step forward, it allows 2 options while last year we had one [15:13:54] at this point it's avoidable with a slight delay [15:14:05] so let's go for it and maybe see if we can have it with etcd too (no promises of course) [15:14:24] I don't see the etcd part happening before the 19th. this is barely ready in time as it is. [15:14:31] ok then [15:14:58] but still is an improvement compared to last year and leave us the choice to 1 vs 2 commits, PII leak vs no-leak [15:15:18] so for me is let's got forward with this and then make the decision of which one we want to use separately [15:15:26] "ensure disabled before RO" is of course possible, but we really need to look at the whole global sequence and make sure that happens in the right place [15:15:27] at that point is a pure "political" decision [15:15:41] because again, that blocks other cache_text-affecting commits before RO too [15:15:57] sure [15:16:09] I suspect the political decision will be to take the extra minute to not leak PII :) [15:16:18] or minute(s) as the case may be [15:17:04] if you want it "quick" it requires a gerrit rebase + a blind merge (manual +2 verified) [15:17:13] the pre-puppet-disable also doesn't gaurantee success of the puppet step. either way you're going to have to wrap some logic on ensure puppet succeeds and wasn't stuck hung/running before or whatever. [15:17:16] + puppet-merge + puppet run [15:18:00] (we still have random master failures, and the pre-disable needs to includew a pre-wait that all running agents finish up) [15:20:12] yes, I can ensure when we start the RO that puppet is disabled and no puppet runs on any of the text cache hosts [15:20:20] manually? [15:20:46] can be part of the switchdc steps as a pre-step or not, our choice [15:20:56] in any case, though, how are we ensuring we get past a random failure of an agent run during the critical apply? [15:21:25] get past meaning re-run of puppet? [15:21:56] yeah I suppose that would be the best way, all things considered [15:22:04] loop on puppet agent until it succeeds, basically [15:22:47] or fail hard so we can step out and look (and then the timeline goes out the window if that unexpected case happens, but it's better than not-knowing that one of the hosts didn't update to the new config and breaking active/active+RW rules) [15:23:29] we already know if any command fails, the steps are/will be designed to be idempotent, so the quick way is [15:23:42] one host fail puppet, the step fail, just re-run the step [15:24:08] I'm guessing cumin makes the failure obvious [15:24:09] or add the loop logic inside the python step to retry the failed ones [15:24:20] or add the loop loginc inside the bash commands cumin executes [15:24:21] (searching text salt outputs and counting hosts to be sure it didn't miss one sucked) [15:24:38] any exit code != 0 is considered a failure [15:24:54] still catching up with this backlog [15:25:09] so silly question: what about that old patch series that is posted on gerrit? [15:25:43] yeah honestly, I think we'll always have a need for a cumin-wrapped action of some kind (wherever the logic lies) that allows us to say "ensure agent runs successfully on these hosts", which wraps up retrying failures and getting around racing an already-running agent from before your merge, or whatever [15:26:09] and having it now would make this part earier [15:26:56] also your time estimates for puppet runs may be off (or note), last time we definitely run into puppet load issues that slowed everything down [15:27:02] paravoid: what old patch series? [15:27:23] this time is way less servers + more puppetmasters, so it may not actually be affected much [15:27:29] paravoid: puppet has gotten better since then, it's been tested to run fast even with e.g. 100x hosts in parallel via salt or whatever [15:27:29] bblack: well ok, not so old, https://gerrit.wikimedia.org/r/#/c/339667/ [15:28:00] paravoid: most of this conversation is about that patch in one way or another. in what context above do you mean "what about?" [15:28:45] maybe I should catch up with the backlog then :) [15:28:47] that's the "active/active" patch [15:29:23] so after that patch, we're going to need 1 or 2 commits? [15:29:42] paravoid: so a TL;DR is: with the old patch we had only 1 commit and we were leaking PII during the switchover, if we merge the active/active patch we have the choice 1 commit leaking PII or 2 commits without leaking PII [15:29:54] ^ that [15:30:04] with the intent to move the 2 commits to etcd, but not before the switchover [15:30:17] s/intent/general intent/ [15:30:18] the 1-patch leaking PII means two other patches (before and after) and leaking from when the first goes in until the third goes in [15:31:43] so that patch series right now it doesn't buy us much w.r.t. the switchover? [15:31:54] (if we go the no-leak 2-patches-during-RO path, the two patches are serial - there's no step inbetween them, it's just critical that the first patch fully agents before the second patch merges) [15:32:36] paravoid: that wasn't really a question earlier in the conversation, but I could address its benefits :) [15:34:22] 1) What it buys for the MediaWiki Readonly-mode switching process is just choice: we can do it much like we did a year ago, or we can push 2x varnish commits during the RO phase and avoid the PII leak we had last year. [15:35:14] 2) For the active/active services, it changes things entirely: we can make them fully active/active for normal use ahead of this switch, and then just shut off their eqiad sides for load-testing during the switch period (with singular async commits). [15:36:14] 3) More importantly for all the above: all of the services are de-coupled from each other, instead of being tied to making global cache-routing changes for the whole of e.g. cache_text. active/active vs active/passive vs "which dc is currently disabled" etc become independent per-service distinctions without inter-dependencies due to being in the same varnish cluster. [15:38:20] 4) From a less-functional-today perspective, it's progress towards the kind of code/data/model that our VCL needs to have to get etcd-based (instead of commits-based) switching in the future. Even though we didn't have time to keep pushing for etcd, this gets the bulk of the refactoring towards that eventual goal out of the way and tested [15:40:20] (also it's not a patch series anymore, I squashed it to one patch. it looks good in compiler outputs across the varnish fleet, and it's deployed on deployment-prep since yesterday and working fine) [15:40:37] oh nice [15:44:58] anyways, we (well, volans and I) decided we should make a decision by the end of the week to either merge it or not, since we need time to get all the steps in order after [15:45:30] I wasn't sure then (early yesterday), but at this point I think it will be pretty safe to merge during this week. [15:46:24] the other decision point is then, for the MediaWiki case: do we minimize RO-time at the cost of PII leak, or do we kill the PII leak and double the RO time imposed by the varnish part of the process (from 1 commit merge->push->success, to two such commits in a row) [15:46:27] nice [15:47:19] brb [15:47:19] and I need that decision (the second one) in order to adapt some parts of the switchdc automation based on the choice made [15:48:21] bblack: I didn't look at the patch (I can if needed of course) but yeah, seems to me that at this point is worth concluding this effort by merging it to go forward [15:51:43] volans: I would normally ask for review. Well maybe taking a step further back: I would normally never write such awful code to begin with, it's really ugly. But the ugly is imposed by the constraints of the situation (that VCL is what it is, everything flows from that). [15:52:03] lol [15:52:10] volans: but realistically, I don't expect anyone else to be able to meaningfully review that patch. Although you can check it for really stupid glaring things. [15:52:39] it's code that generates code that generates code that generates code [15:53:53] and compiles both in pascal and fortran and VCL? :) [15:54:32] something like that [15:55:12] the layers in the system as a whole are like hieradata->puppet->ruby->VCL->C [15:55:47] hieradata looks are kinda-messy, puppet is a broken language, ruby is a good but odd language, VCL is not even a proper language and is possibly more horrible than anyone can imagine, and then C is scary and dangerous [15:56:17] the perfect combination! [15:56:42] luckily the patch contains no actual inline-C code :) it's just implicit in my mental model and makes it sounds scarier :) [15:57:14] now I want to see this patch, do you have the link/ID at hand? [15:57:27] :) [15:57:36] if VCL were a real language, we could simply have hieradata/puppet export raw data structure blocks to include files and have VCL operate on them as functional data directly [15:58:00] it's the fact that VCL is such an incomplete thing that makes it necessary to generate VCL code from ruby code instead of that [15:58:09] https://gerrit.wikimedia.org/r/#/c/339667/9/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb [15:59:05] nice, ruby function definitions inside the erb [15:59:35] which generate varnish code-blocks from hieradata :) [16:00:04] the goal there in set_backend__ is to transform from this kind of hieradata: [16:00:07] cache::app_directors: [16:00:09] appservers: [16:00:12] backends: [16:00:15] eqiad: 'appservers.svc.eqiad.wmnet' [16:00:17] # codfw: 'appservers.svc.codfw.wmnet' [16:00:19] api: [16:00:22] ... [16:01:06] to this kind of VCL code on-disk (taken from deployment-prep live deployed config): [16:01:09] sub set_backend__ { [16:01:11] if (req.http.Host == "citoid.wikimedia.org") { [16:01:14] set req.backend_hint = citoid_backend.backend(); [16:01:16] } elsif (req.http.Host == "cxserver.wikimedia.org") { [16:01:19] set req.backend_hint = cxserver_backend.backend(); [16:01:21] } else { [16:01:24] if (req.url ~ "^/api/rest_v1/") { [16:01:26] set req.backend_hint = restbase_backend.backend(); [16:01:29] } elsif (req.url ~ "^/w/api\.php") { [16:01:31] if (req.http.X-Wikimedia-Debug) { set req.backend_hint = appservers_debug.backend(); } else { set req.backend_hint = api.backend(); } [16:01:34] } elsif (req.url ~ "^/w/thumb(_handler)?\.php") { [16:01:37] if (req.http.X-Wikimedia-Debug) { set req.backend_hint = appservers_debug.backend(); } else { set req.backend_hint = rendering.backend(); } [16:01:40] } else { [16:01:43] if (req.http.X-Wikimedia-Debug) { set req.backend_hint = appservers_debug.backend(); } else { set req.backend_hint = appservers.backend(); } [16:01:45] } [16:01:48] } [16:01:50] } [16:01:53] the code templates out differently depending on which DC we're evaluating it within (e.g. eqiad vs codfw vs esams), and of course any changes to the hieradata [16:02:05] right [16:02:13] if this were codfw, parts of that would replace: [16:02:18] set req.backend_hint = citoid_backend.backend(); [16:02:19] with: [16:02:31] set req.backend_hint = cache_eqiad.backend(); [16:02:45] based on current hieradata, anyways, and within the context of one specific application service by the host/uri regexes [16:03:23] (and set X-Next-Is-Cache if doing so, which triggers other code that needs to know if we're talking to the applayer or another cache) [16:04:49] %Q, you rubysta :) [16:05:11] I don't even like ruby much! :) [16:05:43] me neither, I was just trolling :D [16:08:04] oh my... I just find the join(' els'), nice trick, I guess it's worth 100 points of technical debt by itself :) [16:12:31] in some larger philosophical view, varnish really isn't an HTTP cache, it's a tech-debt generation engine [16:13:25] but from that same 10,000-ft view, my stance on all related things lately is to just accept the VCL-induced ugliness, but try to confine it to code rather than data [16:13:47] move data that would've otherwise been implicitly inside of VCL code, instead to explicit hieradata declarations consumed by the awful code [16:14:08] because that structured data helps with and lives through the future transition from VCL to what's-beyond-VCL (e.g. ATS) [16:14:26] I agree [16:15:59] and I also agree that is practically impossible to me to fully validate that code, without studying all the VLC part [16:19:09] I trust your testing :) [16:19:21] now I have to go, I'll try to reconnect later, sorry I have an errand to do [16:49:32] oh I guess I missed one thing in all the above shortly after joe chimed in [16:49:49] 14:16 < _joe_> so I think we don't need the outage-like mode [16:50:26] if we don't actually need an outage-like mode (varnish can always go temporary-active-active during the rollout of a switch from one side to the other, for any active/passive service) [16:50:48] then the patch is already complete (the variant that's on deployment-prep now). if we do, I have a small chunk of additional work to do there. [16:51:50] an outage-like mode is also by-definition a 2-commit process, too. the one-commit way is always going to overlap requests to both sides of the fence like temporary active/active (fairly briefly, during async pushing of that commit to all the nodes) [16:54:18] (and pulling this topic back around into DNS discovery: outage-like-mode is what failoid is for, too. if we really never need that, we don't need the failoid pattern. we just flip the new DC on before flipping the old DC off, and there's some active/active-ish overlap while it pushes slightly-async to DNS servers and as the TTLs expire) [16:55:44] <_joe_> bblack: actually once our apps have clear read/write separation, the "outage mode" will only be for writes [16:55:48] <_joe_> and that's ok [16:55:50] actually even the order of the on/off flips doesn't matter in that case, as it would treat 2xdown the same as 2xup, basically [16:56:13] <_joe_> say you could have all reads reliably sent to one backend that's alwways active-active [16:56:19] ok so we're keeping the failoid outage-mode stuff for future separated-readwrite [16:56:22] <_joe_> and have the writes go through outage mode [16:56:29] <_joe_> bblack: yes, basically [16:56:40] but we don't want that for varnish similarly? [16:57:00] (e.g. varnish has appservers-ro and appservers-rw as distinct backends and makes the choice based on GET vs POST or whatever) [16:57:12] the sticky-cookie stuff, basically [16:57:51] personally I'd love for that to never actually happen, but it was the last-known plan on that front [16:59:05] (I think it would be superior to shove the ro/rw mess inside the software stack of the offending application - let it re-route its own post requests to the endpoint in the other DC where writing is active, and let other services/varnish treat it like it's active/active and be oblivious) [16:59:46] <_joe_> bblack: well, if the offending application is the user's browser [16:59:48] <_joe_> :P [17:00:01] (but I get that's asking a lot from the MW side of things in practice, and that effort could be better spent moving towards real A/A) [17:00:19] by "offending" I mean "failing to be active/active" [17:01:11] <_joe_> oh ok [17:02:15] <_joe_> they joys of systemd: I forgot to attach my timecapsule to the linux "server", and systemd just refused to start without that fucking useless partition [17:02:23] <_joe_> because "a mount job as failed" [17:02:44] <_joe_> sorry, just to say that's the shit I'm dealing with while we're talking :P [17:02:57] in other words, don't expose the rest of the world (varnish and other consuming internal services) to the ro/rw madness. Pretend it's active/active for everyone else. Make the front edge of the MW stack somewhere (php code? crazy apache config?) deal with the issue and do whatever's necessary there (e.g. know which side is active for writes, and proxy POST requests over to the other side directl [17:03:03] y, or whatever) [17:04:33] <_joe_> bblack: that might already be taken care of somehow, btw, but let's not rush to the next six months. By all means, the outage-mode for varnish is not needed now [17:04:46] <_joe_> and it *might* be interesting (not mandatory) later [17:22:21] 10Traffic, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3137827 (10MaxSem) The only valid use for labs is WMF projects, and those don't support JS in IE under 9 (9 iS soon going to be... [18:34:48] _joe_: ok, makes my life easier :) [19:55:07] 10Traffic, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3138272 (10grin) >>! In T161256#3137827, @MaxSem wrote: > The only valid use for labs is WMF projects, This is not true in it... [20:17:56] 10Traffic, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3126688 (10Peachey88) >>! In T161256#3138272, @grin wrote: > I would expect some background check from you before answering. Le... [20:22:40] 10Traffic, 06Operations, 10Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#3138340 (10ema) [20:37:03] 10Traffic, 06Operations, 10Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10ema) >>! In T114104#1685872, @mark wrote: > It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirabl... [21:12:54] bblack: hey :) [21:13:02] bblack: can I go ahead and merge the first ttl_cap/keep swap patch tomorrow? https://gerrit.wikimedia.org/r/#/c/343844/ [22:08:37] ema: yes :) [22:09:55] bblack: cool! [22:10:10] also, while doing spring cleaning I've bumped into https://phabricator.wikimedia.org/T82747 [22:10:33] the patch is merged into pybal's master branch but it doesn't seem to be in the 1.13 branch [22:10:46] nor in prod, consequently [22:12:09] heh [22:12:18] nice find :) [22:12:48] might be worth cherry-picking I guess [22:17:26] honestly it's been so long I have little memory of that for voicing an opinion on the patch [22:17:54] reconstructing from phab, I think I independently observed the ipv6 issue, filed a pointless phab task at https://phabricator.wikimedia.org/T103880 , then merged it into the older rt-imported one [22:18:05] and then mark made a patch that got lots of +1/+2 [22:18:45] but I have no idea, perhaps it doesn't cleanly pick back to 1.13? [22:20:59] it does pick cleanly, I'll try it out on poor pybal-test2001 and see [22:21:13] CC: _joe_ [22:26:35] 10Traffic, 06Operations, 10Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10BBlack) I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we shou...