[06:28:18] Krenair: so far, nothing... I've pinned cryptography version to 1.7.1 (the one shipped in stretch) as as you can see in the announcement stretch is not affected: [stretch] - python-cryptography (Vulnerable code introduced later) [08:58:38] 10Traffic, 10Operations: Discard of cold, labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [11:38:30] alternate domains patch merged with puppet disabled, testing on cp1008 [12:00:22] no crashes so far :) [12:01:18] config-master.w.o and grafana.w.o work fine through cp1008 (default pass) [12:01:31] phab.wmfusercontent.org too [12:08:20] depooling cp1067 and testing the changes there [12:21:24] cache_text sites keep on working fine, cache_misc requests to varnish-be go to the right place (eg: phabricator) [12:21:30] (and no crashes yet!) [12:26:22] it's harder to test the whole fe->be part though, as requests gets routed to backends that don't have the changes applied yet [12:26:56] perhaps it would be useful to be able to specify a backend hint through a special header, accepted from trusted_nets only or so [12:36:50] well in the long term, we probably want backends (ATS) handling the decision rather than frontends anyways [12:42:23] frontends still need to choose which cache backend to use though [12:43:50] right, but if the ATS backend layer is universal, there's no real decision-code, just a constant "chash over these N backends" for all traffic [12:44:52] anyways, I think I misread your last line as meaning for real traffic, I now think you meant for debugging purposes [12:44:58] correct :) [12:46:04] it would be nice to choose one of those N backends for testing purposes, in case we apply a change to that one only [12:46:35] yeah [12:49:27] btw, this is also languishing a bit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/434055 [12:49:33] anyways, so far I've verified that varnish-be on cp1067 works for enwiki, phabricator and phab.wmfusercontent.org [12:50:02] int responses (eg: TLS redirect) from varnish-fe also work fine for various domains [12:50:12] I picked it apart back in barcelona and it seems solid on basic principles as a first step (we might want to change how we threshold, at a later time, etc) [12:50:52] and the code is mostly-tested (in vagrant i think), and there wasn't a better way to structure it in terms of which VCL hooks it used and how, that I could find. [12:51:04] just needs some final review and sheperding through deployment [12:51:21] sounds good [12:53:53] [added a nitpick while I had it open] [12:55:54] bblack: unless you have specific tests in mind to perform on cp1067 alone I'd enable puppet on the rest of text-eqiad [12:57:09] well, repool cp1067 first and keep an eye on it [13:02:13] what sucks is that any test traffic may pollute FEs with bad responses from un-puppetd BEs [13:02:22] (including cross-dc) [13:02:39] yeah [13:03:01] so far I've been trying pass-y things mostly (eg: phab, https://it.wikipedia.org/wiki/Genova?x=1) [13:03:30] so I guess core DCs first minimizes the damage, and hopefully nobody's misdirecting misc-domain requests to the nodes anyways [13:03:40] I can't think of any other tests than what you've runa [13:04:15] the main ??? in my head is the tradeoff between rolling out slowly to give time for some new crashy clusterfuck to appear on less than all nodes, vs rolling out quicker to avoid pollution issues. [13:04:51] but I guess, so long as cache_text traffic is working fine, we can just not care if cache_misc results are polluted, and wait a few days before we start moving IPs or DNS, etc... [13:04:54] pollution issues should't be *that* bad as no misc DNS entry points to text yet [13:05:09] right [13:05:51] I'd say go ahead and finish up eqiad today, so we have a solid 2-layer we can test against by hitting eqiad-fe directly [13:06:08] awesome [13:06:33] repooling 1067 and keeping an eye on it for some minutes to see if real traffic messes things up [13:06:48] the other minor complication to keep in mind as we get to the next steps: text's IP is in high-traffic1 and misc's is in high-traffic2. [13:07:10] ordinarily I'd say move the IP to the text cluster first, then change DNS later, then remove the legacy IP much later [13:07:34] but given it's misc (heh), and the above, we might be wiser to simply leave the IPs alone and make the change solely at the DNS layer. [13:07:47] (and if someone has hardcoded IPs past cache_misc decom, well, they shouldn't have) [13:08:20] we can look for traffic a few days after the DNS move anyways, to double-check [13:09:35] semi-related: it sounds like cp1075-99 are almost ready for us to start banging on. I'm gonna poke at the hardware layer once one of them is available and see if I can sort out the basic issues with the new disks and new NICs [13:10:03] but the timing of that and misc->text still seem fairly-well aligned, which changes which eqiad nodes we use for ATS testing as well. [13:10:34] site.pp relevant to that, now: [13:10:38] node /^cp20(0[39]|15|21)\.codfw\.wmnet$/ { # ex-cache_maps, earmarked for experimentation... [13:10:51] node 'cp1046.eqiad.wmnet', 'cp1047.eqiad.wmnet', 'cp1059.eqiad.wmnet', 'cp1060.eqiad.wmnet' { # ex-cache_maps, earmarked for experimentation... [13:11:05] those were the 8 nodes saved off (from cache_maps) for the ATS initial test clusters [13:11:55] and with the eqiad hardware refresh, it doesn't include any new nodes for cache_misc, and cp1071-74 are the newest of the old nodes, so those would be temporarily cache_misc until this misc->text thing is done. [13:12:29] but if the timing aligns, we can instead shift all of this a layer, and go ahead and decom those cp10(46|47|59|60), and use 71-74 for ATS instead. [13:12:52] yeah that would be good [13:15:36] ah, I forgot to merge the VTC tests! That would be a useful thing to... test :) [13:22:06] :) [13:24:10] all green (except for 16-normalize-path.vtc which is broken for unreleated reasons and we should fix at some point) [13:29:17] random varnishlog output looks ok, no crashes after > 20 minutes [13:29:58] if there are no issues in the next 10 minutes I'll proceed with the rest of text-eqiad [13:44:49] still no drama, applying the patch elsewhere in text-eqiad [13:54:42] patch applied, we can now test cache_misc things with 208.80.154.224 [13:55:32] websockets are known not to work yet, we need to poke at nginx for that [13:55:53] but grafana, config-master and such should work fine [13:56:43] doh, segfault on cp1068 :( [14:00:47] seems like the only one so far, and I've caused it [14:00:50] investigating [14:03:47] what was the trigger? [14:04:31] bblack: a specific request it seems, see cp1068:/root/2018-07-24.panic [14:11:29] it seems to at least think the vcl was warm [14:12:04] something happened I guess at fetch-time [14:12:48] connection workspace was NULL too, that seems questionable [14:13:43] the ws part of http_conn? [14:14:06] yeah [14:14:23] interesting, yes [14:16:48] so another interesting point, is that I think the crashing-point is probably right near where our old extrachance patch was (is?) [14:17:12] vbe_dir_gethdrs() is likely where that backtrace ends in, I think (somewhere close anyways!) [14:18:59] I wonder if this was a "force a fresh connection due to retry" sort of case, or somehow related [14:21:14] retries = 0, failed = 0, flags = {do_stream, do_pass, uncacheable} [14:21:31] that seems to rule out the retry scenario ^ [14:23:27] still [14:23:31] I just got another repro, exactly the same request [14:23:40] (which I'm gonna stop trying now) [14:23:51] there's some overlap in the two patches that touch that area around GetHdrs -> ... [14:24:22] the 0009-force-fresh and 0010-extrachance-retries [14:24:53] I wonder if there's some bug already in that area in our current set of code, and the alternate vcl stuff is just what is needed to hit the case [14:25:36] possibly! [14:25:57] was the new panic in the same backtrace? [14:27:17] yes, see 2018-07-24.panic-2 [14:27:47] I donno, I think maybe off on a distracting tangent that may be unrelated here [14:27:57] I'm not finding more evidence as I go, just less [14:29:57] different backend chosen and different memory offset are the only real diff, otherwise it's essentially an identical panic [14:30:05] yup [14:30:15] I guess that's good. if it were kinda random, then the crash is probably far from the source where some corruption occurs [14:30:49] also good: we seem to have a very precise way to reproduce [14:31:05] objcore[fetch]->boc->state = invalid [14:31:46] that seems (a) smelly in general (unless "invalid" is what it's supposed to temporarily be right when we're in the crashing code?) and (b) seems to trigger a memory of some other recent patchwork/bugs we looked at? [14:32:07] maybe boc just isn't fleshed out until headers are done receiving though [14:33:46] it'd be nice if we had a more-precise backtrack [14:33:48] *backtrace [14:34:13] I'm guessing it's function-pointer indirections that cause the lack of function names there [14:34:16] I'm gonna install varnish-dbg, restart, and see if we get one [14:34:25] ok [14:34:49] nevermind, it's installed already [14:39:25] I wonder if this is somehow related to where and how backends/directors are defined wrt the two VCLs? [14:39:40] are they all defined in the main vcl, or redefined in the alt vcl for the misc backends, or? [14:39:56] oh but this is an fe crash, it should be all the same, hmmm [14:41:02] are the directors/backends for the be stuff (random, and chash) named conflictingly between the two VCLs, or maybe need different naming? [14:42:20] they're defined twice, which we should likely avoid, once in wikimedia-common_misc-backend.inc.vcl and once in wikimedia-common_text-backend.inc.vcl: [14:43:18] right, defined twice with the same naming and setup, too [14:43:30] I'm not even sure how backends/directors map to VCLs anyways [14:44:26] could try as a repro test: disable puppet on 1068, make a second copy of directors.frontend.vcl for cache_misc, change the naming slightly (e.g. prepend "x_" for all the labeling), change the misc include and misc references to it manually [14:45:11] or alternately and simpler, try not defining them in misc at all (stop including directors.frontend.vcl there) [14:45:23] if either approach fixes it, it's a pointer [14:46:11] let's try [14:46:59] the second approach may of course just not compile because they need to be defined in the vcl they're referenced in, I donno [14:47:31] correct : [14:47:46] Symbol not found: 'cache_local.backend' (expected type BACKEND): [14:48:01] ok [14:48:37] so try making a separate copy, with x_cache_local, x_cache_local_random, and x_be_cp10... labeling, and edit misc to match? [14:49:21] what I'm thinking about, is we know from the normal single-VCL case, there's some kind of sharing of shared backends that match, IIRC [14:49:37] maybe there's some kind of false sharing here between backends defined in different VCLs that breaks things, or something [14:49:49] maybe, yes! [14:49:59] unfortunately I have to go out for an errand soon :( [14:50:02] ok [14:50:24] as far as we know, the crash is confined to the misc case anyways [14:50:36] if someone starts abusing it, I guess I can revert changes from today [14:51:01] it should be enough to drop cache::alternate_domains from hieradata/role/common/cache/text.yaml [14:51:29] ok [14:51:36] the rest of the DCs are puppet-disabled right? [14:51:41] correct [14:51:52] ok [14:53:25] I should be back before 18:30 CEST [14:54:05] re-enabling puppet on 1068 for the time being [16:02:46] 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) p:05Triage>03Normal [16:26:33] I'm back! [16:28:43] btw re: the webp patch I mentioned earlier, we should at least wait for gilles to get back from vacation first so he can observe deploy [16:28:53] (I kinda forgot tha tpart) [16:29:22] I'm in a meeting now, and I haven't done anything functional or useful re: varnish while you were away [16:29:33] np [17:39:47] ok, I've disabled alternate domains support on text for now and left it enabled on cache_canary [17:39:55] re-enabling puppet on all cache nodes [19:24:37] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) [20:15:12] 10Traffic, 10Analytics-Kanban, 10DNS, 10Operations, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10JAllemandou) [20:31:24] FYI, about to depool eqsin for cr1-eqsin planned software upgrade [20:31:28] bblack: ^ [20:51:08] Confirmed oob/console works, waiting a bit more for traffic to drain, will then restart the router [21:26:00] Router seem healthy, still taking a while to the routing daemon to process the ton of new prefixes to learn [21:38:11] like we saw previously, ~15min while the routing daemon is at full speed, back to normal [21:43:26] Will repool eqsin in 15min if everything stays quiet [23:01:59] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Sakretsu) For the record, IP and registered users are still reporting this issue from mobi...