[11:53:00] extrachance PR opened https://github.com/varnishcache/varnish-cache/pull/2135, 4.1.3-1wm4 built and uploaded to carbon [11:53:29] (that's the one including the gethdr_extrachance parameter) [13:46:22] bblack: I'm gonna test the wm4 upgrade on cp1008 [13:48:39] ema: ok :) [14:28:46] hmmm I wonder what happened with cache_maps lately. Looks like the maps team must've enabled something big for prod use, but I don't remember news about it. [14:29:28] their avg reqrate multiplied by 8x or so starting ~1w ago [14:31:51] bblack: from yesterday's ops meeting: "maps: mapframe activated on ruwiki" [14:34:17] I guess I missed it! :) [14:46:36] ema: I'm still staring at the vcl_synth/vcl_backend_error thing. I think the current change is an improvement, but: [14:49:16] 1) in vcl_synth, the "> 400 && !413" conditional on emitting the synthetic response seems wrong. If it goes via vcl_synth it's synthetic and we're not trying to preserve some error response from a backend. >= 400 would make sense, to avoid rewriting 2xx and 3xx that are internally-generated, though. [14:50:24] right, I think we removed the conditional already in the past [14:50:34] and then we brought it back when reverting the change [14:51:17] I don't have a succinct (2), but there's still something wrong in some big picture [14:51:38] at least part of the reason we have those synth error pages is to normalize backend errors to all look the same to the user style-wise. [14:51:59] in the old v3 vcl_error, we were catching e.g. 404 or 501 from backends and re-synthesizing too, I think. [14:52:17] which in v4 would have to happen in vcl_backend_response, if we want to *replace* [45]xx content with our standardized output [14:52:43] (which is where at least the 413 exception makes sense - for some reason we wanted backend 413 to show to user directly) [14:52:56] the 400 one I still don't get though, maybe the reason for it has been lost [14:53:43] either way [14:54:34] in v4: vcl_synth is truly-synthetic (we generated the response code in VCL). vcl_backend_error is only for a fetch failure (no valid HTTP response from immediate backend for whatever reason) or failing max_restarts in vcl_backend_response (which we'll never do, because we retry once in frontends and zero times in backends) [14:54:47] a valid 404 or 501 from the backend never hits either path [14:56:16] bblack: we did ask mar.k about the 400/413 exception but that was introduced a few days before his vacation to thailand :) [14:56:42] from my irc logs: [14:56:44] > presumably i wanted to let any existing error message from upstream (apache/mw) through instead of generating the varnish error, but don't remember why [14:56:50] I'm ok with keeping the 413 exception. I know it has something to do with content uploads (e.g. images to commons), and it seems legit there could be a reason not to hide the details of that message [14:57:06] but failing to overwrite a 400 doesn't sound like something I can explain [14:58:07] either way, there's 3 potential changes from all of the above rambling: [14:58:28] 1. in vcl_synth, the conditional for synthetic error body should just be >= 400 (I'm pretty sure about that) [14:59:09] 2. if we want varnish to overwrite all applayer error output bodies with its own, something like the > 400 && !413 -> synth logic needs to exist in vcl_backend_response itself [14:59:19] 3. maybe the above should be >=400 && !413 [15:04:59] although you can't call synthetic("FOO") from vcl_backend_response, only vcl_backend_error or vcl_synth [15:06:37] and there's no sane way to get from vcl_backend_response->vcl_backend_error [15:07:47] well, there's kind-of a way with max_retry [15:08:25] ok let's rewind a bit. What's the goal here? Fixing up X-CDIS, bringing clarity to _synth/_backend_error, or both? [15:08:50] this is all after the X-CDIS patch, unrelated [15:09:08] the goal in (1) is correctness. if we synthesize a 400, it should have our standard error output, ditto 413. [15:09:38] sounds reasonable [15:09:41] 2/3 is about whether varnish should standardize (overwrite) error bodies on application-generated [45]xx [15:10:09] I think, in the past, that was what we were doing in at least some cases? but the history there is complex and decisions questionable [15:10:39] the current varnish4 code doesn't overwrite them, ever, AFAICS [15:10:54] how about we go ahead with (1), then monitor the current 400s/413s and see how to proceed wrt (2) and (3)? [15:11:42] well we're already exempting 400 and 413 from the applayer from body-rewrite, we don't need those conditionals anymore for that purpose (because we're exempting all applayer [45]xx from rewrites implicitly) [15:12:21] the question is whether we should bring back some past concept of overwriting applayer [45]xx bodies, and exempt 400 and 413 from that for hysterical raisins [15:13:08] there's a ticket somewhere about standardizing [15:13:45] https://phabricator.wikimedia.org/T76560 is related [15:14:39] I thought there was a related ticket somewhere showing examples of differing HTML output from various sources for 5xx (e.g. MW, Varnish, whatever-else) [15:15:41] I don't see it under traffic anymore, though [15:17:45] anyways, let's just leave it as it is. synthetic() is only for synthetic(), and backend errors show through. that's probably the best generic way to do things anyways, otherwise we get bogged down in the details of which error responses from which backends need to not be overwritten. [15:18:26] overwriting in varnish doesn't seem like a great way to standardize [45]xx outputs anyways. Those standards can be templates used by app devs. Or someone can raise it as a new issue if they want. [15:21:00] yep [15:21:45] meanwhile, I'm checking if search results for 'vcl site:phabricator.wikimedia.org' and the likes are tagged as traffic [15:22:12] I found that search engines are better than phab search sometimes :) [15:22:30] s/sometimes// [15:24:54] ema: you don eon cp1008 for now? [15:25:16] bblack: yes the upgrade went fine (and the new param does show up in varnishadm) [15:25:35] awesome [15:25:41] so actually we can start upgrading/rebooting whenever you like [15:26:19] ok [15:26:27] have you checked if the new kernel package is everywhere? [15:27:37] yeah it's upgraded on all cache nodes [15:27:43] ok [15:28:11] I'm gonna manually try the vcl_synth stuff on cp1008 right quick, just in case it fails to compile or blows up [15:28:58] alright [15:30:30] cp1008 is running the new kernel already for the record [15:37:51] lgtm [15:38:06] (I broke appservers.svc backend on cp1008 to test connectfail) [15:38:56] which ends up with a 503 after waiting 5s for connect timeout, with: X-Cache: cp1008 int, cp1008 miss and X-Cache-Status: int [15:39:25] unbreaking/puppeting [15:50:00] bblack: can I start upgrading and rebooting maps? [15:50:15] ema: yes! :) [15:50:21] alright then! :) [16:02:42] ok cp2015 came back online fine, I've bumped the icinga downtime from 420s to 480s to take into account package upgrade time [16:02:49] carrying on with the rest of maps [16:49:08] random TLS stuff: our 1-week averages for FS+AEAD are over 90% for the past week or two. been slowly growing forever of course, just a random milestone crossed on the way. [16:49:33] the progress on that is more about the rest of the world upgrading browsers than about anything we do, although we do configure better than most to maximize what we can about that number. [16:50:35] but now we can really say (without various asterisks) that 10% or less of the client population/requests out there are using sub-optimal crypto, when negotiating with an optimally-configured site like ours. [16:51:42] there's a fruther ~9% that are using "not great, but not horrible" middle-ground crypto too (FS, but still CBC-mode). [16:52:25] and then there's still the slowly-declining ~0.4% that don't do forward secrecy at all [16:55:09] a year ago the FS+AEAD number was ~78%, and the crappy non-FS number was about ~0.6% [17:41:14] ema: re traffic-pool on cp1047 [17:41:15] https://phabricator.wikimedia.org/T148723 [17:41:25] ^ it was reimaged recently, that's the change vs other servers [17:41:51] perhaps on fresh reimage, we're not ending up with traffic-pool started by-default, but existing nodes still do [17:42:04] that also explains why ssh failed :) [17:42:16] maps just finished btw [17:42:23] puppet intentionally doesn't start the unit, but it's supposed to install it such that it starts auto on boot [17:42:33] oh you mean it went to PXE and reimaged itself again? [17:42:48] bblack: nope, I mean my ssh loop skipped it [17:43:05] oh ok [17:43:30] so you did that one, but it wasn't in your loop because you had it out before? [17:43:57] anyways, looking at the traffic-pool situation there [17:44:15] Nov 15 17:34:14 cp1047 sh[1816]: ERROR:conftool:Error when trying to set/pooled=yes on cp1047.eqiad.wmnet [17:44:16] Nov 15 17:34:14 cp1047 sh[1816]: ERROR:conftool:Failure writing to the kvstore: Backend error: Compare failed : [30175 != 30184] [17:44:20] yeah [17:44:26] so, not an issue with init stuff [17:44:31] etcd failure :P [17:44:43] yeah I'm pleasantly surprised it was the only one [17:44:43] so, fix pooling and start the unit [17:45:11] that's a new kind of etcd failure too [17:45:12] _joe_: ^ [17:45:20] I did fix pooling already, unit started right now [17:45:42] ok [17:45:57] <_joe_> bblack: uhm let me check the etcd status [17:46:01] puppet doesn't arm the unit automatically because it could mess with manual operations [17:46:15] <_joe_> bblack: that looks like a race condition as I do CAS based on revisions [17:46:45] <_joe_> so something else modified the same value between when I read it and when it tried to write [17:46:54] seems likely! [17:46:57] upgrading/rebooting misc [17:47:01] should be a case we handle though, right? [17:47:13] <_joe_> bblack: probably yes [17:47:33] did we ever do something about the raft internal error thing? [17:47:45] <_joe_> remember conftool is at the moment one of my "120% time" projects, so it's likely missing a lot of polish [17:47:49] <_joe_> bblack: nope [17:47:55] 120% time lol [17:49:10] _joe_: so the CAS you're talking about is driven by the client (conftool)? [17:49:21] <_joe_> yes [17:49:42] I wonder if we even need something like CAS there [17:49:44] <_joe_> it reads the value, does some work, the tries to write if needed [17:49:55] as opposed to just blind write at the end [17:50:01] <_joe_> bblack: it's a failsafe, but yes it's debatable [17:50:21] failsafe against multiple conftools (or similar tools) making concurrent updates [17:50:30] <_joe_> yes [17:50:55] I'd argue there's nothing fundamentally wrong with concurrent updates in this case. if they're concurrently changing from no=>yes and one wins, the other can just succeed and pointlessly write a second time. [17:51:08] <_joe_> yes [17:51:17] and if there's a real-world race with differing desired values, well. either way it's a failure to someone somewhere. [17:51:47] if the other side had been delayed another 0.1s you'd never observe the CAS failure and still end up in the situation you were trying to prevent [17:52:08] (that you see no=>yes in your output, but someone/thing else changed it to no by the time you saw the output) [17:53:03] I wonder what it was racing with in this case anyways, though. [17:54:07] <_joe_> bblack: I still have to confirm, though [17:55:47] <_joe_> bblack: argh [17:55:48] <_joe_> Nov 15 17:17:32 conf1001 etcd[618]: the clock difference against peer f7dc0a60b27bbf3 is too high [1.029440001s > 1s] [17:56:23] <_joe_> can someone else look into it? [17:56:37] <_joe_> I was taking a break :) [17:56:43] <_joe_> or I'll do in 20 [18:00:46] oh, nice [18:15:50] <_joe_> bblack: any more failures? [18:16:00] not that I know [18:16:08] <_joe_> it would be interesting to understand what happened there with ntp and all [18:16:16] the clock thing is interesting. probably post-reboot, NTP hadn't synced up yet when traffic-pool invoked conftool [18:16:17] <_joe_> at the moment I see no more such errors [18:16:29] it was right after reboot, which is common in this case [18:16:36] <_joe_> bblack: it's an etcd error, those machines are up since 19 days [18:16:40] (it's a service that comes up on boot and executes conftool) [18:16:45] <_joe_> the client-side clock doesn't count [18:16:57] oh, I thought this was client<->server clock [18:17:07] <_joe_> one of the servers drifted more than 1 second compared to the others [18:17:11] ok [18:17:14] <_joe_> and that's worrying, tbh [18:17:23] did we reboot the recdns/ntp machines today? [18:17:28] or apply another ntp.conf change? [18:17:38] <_joe_> no idea [18:17:42] <_joe_> moritzm: ^^ [18:17:50] nah hydrogen has been up 6 days [18:17:58] and its ntpd simiarly [18:18:16] oh... [18:18:21] conf1001 is virtual isn't it? [18:18:24] no, no recdns reboots toda [18:18:33] <_joe_> nope it's a physical server [18:18:38] in fact no reboots at all from my side [18:18:42] are any of the pool virtual? [18:19:01] ntpd and VMs don't play well historically [18:27:23] <_joe_> bblack: nope [18:27:44] <_joe_> anyways, i'm in the 12th hour being around, it's enough I guess :) [18:28:46] I'm planning to test-drive systemd-timesyncd BTW: https://phabricator.wikimedia.org/T150257 [18:33:14] on cp3009's bios: [18:33:15] Error: Memory initialization warning detected [18:33:43] then the boot process stops, "Strike the F1 key to continue, F2 to run the system setup program" [18:34:14] nothing happens pressing either keys though [18:34:16] wasn't that one already-dead [18:34:34] https://phabricator.wikimedia.org/T148422 [18:34:41] it's supposed to be depooled [18:34:49] you have to hit F1 to get it through this, though [18:34:57] it's Esc and then ! [18:35:05] Esc, Shift+1 [18:35:39] oh yeah you're right! [18:36:21] nothing to see then :) [18:49:39] misc done [18:51:03] bblack: tomorrow morning I'll work on text and upload unless you want to go ahead first :) [18:51:17] ema: ok sounds awesome :) [19:26:15] 10netops, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2764445 (10RobH) [19:47:37] 10netops, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2796865 (10RobH) [21:00:38] 10netops, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2797084 (10jcrespo) [21:01:11] 10netops, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2764445 (10jcrespo) 05Open>03Resolved a:03jcrespo This is done, only pending tasks are... [21:38:24] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2797271 (10Andrew) [21:42:28] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2797284 (10Andrew) [21:43:47] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773828 (10Andrew) [21:44:56] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2797286 (10Andrew) [21:50:37] what's sanitarium? [21:54:37] a database [21:55:20] or better, a database role Platonides [21:55:32] ah [22:05:00] thanks, volans [22:08:26] feel free to ask if you want more info :) [22:09:25] not sure how up-to-date is but you can find some more info on https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs [22:09:30] Platonides: ^^^ [22:12:17] :)