[09:09:25] 10Traffic, 10Beta-Cluster-Infrastructure, 06Operations, 13Patch-For-Review, 07WorkType-Maintenance: beta cluster varnish cache can't apt-get upgrade nginx-full: nginx: [emerg] unknown "spdy" variable - https://phabricator.wikimedia.org/T134362#2279766 (10hashar) 05Open>03Resolved That was a transient... [12:24:28] 10Traffic, 06Operations: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280319 (10BBlack) [12:24:49] 10Traffic, 06Operations: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280333 (10BBlack) [12:24:53] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2280332 (10BBlack) [12:27:29] 10Traffic, 06Operations: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280338 (10BBlack) [12:28:10] 10Traffic, 06Operations: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280353 (10BBlack) [12:28:13] 10Traffic, 06Operations: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280352 (10BBlack) [13:35:53] 10Traffic, 06Operations: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280645 (10BBlack) [13:40:30] bblack: can't believe you actually have a patch for this already [13:45:23] paravoid: it's a blind patch, I have no idea if it compiles or works yet. more like "this is what I think things might look like" [13:46:17] also, it's exposing all the ugly corner issues in our abstractions for cache::foo and tlsproxy, and how they mismatch nginx's per-server vs per-site settings :P [13:46:55] I think for the real thing, we'll probably want to whitelist what hostnames are allowed to do the websocket upgrade too (at least in varnish, maybe in nginx too) [13:47:27] (server hostnames I mean, since there are many unrelated services on cache_misc, and maybe some of them don't deal with upgrade->pipe well) [13:52:24] what's the downside of having it enabled for all, though? [13:52:39] worst case an Upgrade will pass-through and get rejected by the backend, no? [13:53:33] I would hope so :) [13:53:49] we may find we have to set idle disconnect timeouts longer for websockets, oo [13:53:52] *too [13:54:15] but I hope not, our defaults are ~3 mins [13:54:27] and I think websocket server apps can send ping frames (say once a minute) to keep things alive [13:54:56] but everything around websockets is a little grey-ish, it's not a very consistent standard in practice [13:56:36] if the magic upgrades work well and don't interfere with normal traffic flow over the same hostnames/ports, maybe we can get phab to re-use its existing ports/services for its websocket thing too [13:57:00] might be nicer than setting up a separate hostname/IP/service/etc [13:57:58] nod [14:11:44] both cp1065 and cp1066 rebooted fine, no mdadm issues [14:13:25] \o/ [14:16:39] bblack: did you by any chance see more varnishstatsd issues in the past weeks? T132430 [14:16:39] T132430: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430 [14:18:05] I haven't gone looking for them, but no [14:19:17] alright! DC preferences to begin the misc upgrade? How about esams? [14:21:39] sounds ok to me [14:22:03] esams, ulsfo, codfw, eqiad [14:30:10] mmmh I've depooled cp3007 but it still gets frontend traffic [14:31:24] ok, can you pause there a sec? [14:31:28] sure [14:31:36] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2280887 (10Dzahn) >>! In T112765#2242776, @BBlack wrote: > ... or if phab's websocket stuff would be on a completely different public hostname and/or backend server... [14:38:05] ema: yeah so, etcd data is definitely right (cp3007 depooled), but ipvsadm is not updated on lvs3002 [14:38:17] pybal there is still running, but it's no longer processing etcd updates [14:38:36] I wish we had a check for this, or that it just killed itself completely if the etcd thread died [14:39:28] the current pid 8373 started on Apr 27, and we have full log output since then [14:39:43] <_joe_> bblack: uhm this is indeed annoying [14:39:57] <_joe_> truth is any error in that "thread" should be recovered [14:40:38] well [14:40:53] the scary thing is, for pid 8373 in journal output, there's no indication of any failure [14:41:06] <_joe_> bblack: which pybal is this? [14:41:06] it starts up, goes through the usual Initialization Complete from loading the initial lists [14:41:19] then just normal runtime random check failures, but no etcd-related outputs/crashes [14:41:23] lvs3002 [14:41:24] <_joe_> is it still running? [14:41:26] yep [14:41:37] <_joe_> because I suspect an ipvsadm shellout fuckup [14:41:41] <_joe_> won't be the first time [14:41:43] checking 3004 [14:41:48] <_joe_> and we have a way to verify that [14:42:16] <_joe_> bblack: which pool is that? [14:42:17] ok so lvs3004 has the failure [14:42:28] cache_misc pool, it should have depooled cp3007 from nginx,varnish-fe [14:42:47] lvs3004 has: May 03 17:00:04 lvs3004 pybal[15612]: Unhandled error in Deferred: .... [14:42:59] with an ssl handshake failure for etcd [14:43:06] <_joe_> curl localhost:9090/pools/misc_weblb_80 [14:43:06] <_joe_> cp3007.esams.wmnet: enabled/up/pooled [14:43:22] <_joe_> ok so it's an etcd failure [14:43:30] no [14:43:39] root@palladium:~# confctl --tags dc=esams,service=varnish-fe,cluster=cache_misc --action get 're:cp3007.*' [14:43:42] {"cp3007.esams.wmnet": {"pooled": "no", "weight": 1}, "tags": "dc=esams,cluster=cache_misc,service=varnish-fe"} [14:43:47] it's a pybal etcd-fetcher failure [14:43:54] <_joe_> bblack: pybal-etcd failure, yes [14:43:57] ok [14:43:59] <_joe_> sorry I wanted to say that [14:44:15] so lvs3004 looks kind of like what I expect [14:44:25] we see an exception in the journal output, and it didn't recover or kill itself [14:44:38] lvs3002 is scarier: it also is missing the update, but has no exception output [14:45:04] <_joe_> ok, can you put all of this in a ticket? I don't have time to look at pybal _now_ [14:45:14] <_joe_> but I'll hopefully have in the next few days [14:45:21] on lvs3002 there are a couple of 'Memory allocation problem', perhaps unrelated [14:45:28] where? [14:45:39] journalctl _PID=8373 | grep Mem [14:45:41] oh at startup [14:45:45] that's normal [14:46:07] alright [14:46:28] if it runs an ipvsadm command that fails for some reason (say, the service already exists and it's a no-op), ipvsadm's way of signalling that is to say "Memory allocation failure", which is braindead [14:46:41] and the pybal just logs that with no indication of where it came from, but that's what it is [14:47:38] <_joe_> bblack: that will all be better (well, solved) when I have time to finish https://gerrit.wikimedia.org/r/#/c/272679/ [14:47:51] _joe_: honestly, if we don't even have a reliable way to detect that pybal has stopped updating from etcd, how do we trust any depool operation? does train/scap stuff depend on that already? [14:49:01] _joe_: I'll take care of the ticket [14:49:02] <_joe_> bblack: nope [14:49:29] <_joe_> bblack: the whole etcd integration is a bit shaky, I would love to make it better [14:49:46] <_joe_> but again, pybal would need some dedicated engineering time [14:50:22] <_joe_> I was envious of google where the "OIT" team could dedicate 4 engineers full time for 6 months to write seesaw [14:50:35] <_joe_> which is sort of similar to pybal [14:51:04] ok, next time around I'm making this a quarterly goal to fix all the high-prio pybal code issues [14:51:49] for now, I'm going to restart pybal on 3004 + 3002 [14:52:03] ema: we can check the first depool on the others and confirm if they need restarts too, when we get to each DC [14:52:17] OK [14:52:19] and I guess tail the pybal log to confirm the other 3 machines as we go too [14:52:31] <_joe_> bblack: honestly, I think the etcd integration should've used the python client for etcd with defertothread [14:52:44] <_joe_> (since python-etcd is blocking) [14:53:06] <_joe_> and just not try to build a somewhat-working driver itself [14:53:22] <_joe_> there are a full array of things it would gain from using python-etcd [14:53:33] <_joe_> including retries, failure management, etc [14:53:52] well yeah that's all great, as is the state machine and netlink, etc [14:54:02] but for now I just want: if it stops working, something should scream at us [14:54:16] right now it's silent and deadly [14:54:22] <_joe_> that's the way to solve the bugs, but I agree, screaming should be enough :P [14:57:40] bblack: did you restart pybal? cp3007 looks actually depooled now [14:58:01] yeah [14:58:14] I have log tails open on both pybals to see what happens with repool and next depool too [14:58:26] is a pybal restart usually enough to fix these issues? [14:58:45] ema: yeah, the problem is pybal restarts aren't exactly a safe/routine operation [14:59:14] but usually, if things are working ok (and you may want to verify that they are), restarting the secondary, then waiting a minute or so, then restarting the primary, is reasonably-ok, as long as it's not done too often [14:59:23] it will interrupt some user connections with RST :/ [14:59:47] oh I see [15:00:19] perhaps in these scenarios, a better plan of action would be: [15:00:27] I was wondering the same when I had to restart pybal in codfw after a puppet upgrade [15:00:43] stop pybal on secondary, wait a short while, restart pybal on primary, then wait a while longer, then start pybal on secondary [15:01:02] if we know both machines are actually working fine, and it's just a question of kicking pybal... [15:01:27] the mechanism here is that they both advertise the service IPs to the hardware routers via BGP [15:01:43] but the routers also have a fallback that if neither one is speaking BGP, they static fall back to the primary [15:02:44] if you stop the primary while the secondary is running, routing flips over to the secondary, and probably RST most active connections, etc [15:03:13] but if we stop the secondary first, then restart the primary, I think routing would stay on the primary and not RST [15:03:41] the other factor in all of this is that it takes a short delay after pybal startup before it even begins speaking BGP, which is async from service-start command [15:04:07] so yeah... [15:04:37] secondary-stop, wait say 30s to be sure, primary-restart, wait say 1 minute to be very very sure BGP is back, secondary-start, would be a better sequence that's less user-disruptive [15:04:50] or watch logs for BGP [15:05:25] but then again, I don't see the normal BGP info in the logs anymore, I think that's something we lost when we tried to suppress pybal log spam in general a while back :/ [15:05:45] I see BGP log entries on shutdown, but not on startup [15:10:38] 10Traffic, 06Operations, 10Pybal: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2281050 (10ema) [15:18:24] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281073 (10TheDJ) [15:21:49] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281083 (10BBlack) Can we get some more debugging details from the browser? Is there some way you can ask it for more detail about the nature of the failure when it happens? There was a similar report from... [15:24:27] ema: in any case, we just have to be careful and watch LVS/pybal, but otherwise proceed [15:24:44] proceeding [15:31:18] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281170 (10Danny_B) I can confirm that most of the cases I remember happened when POSTing (save/preview page, save preferences, filtering on special pages...), can't guarantee it was //only// POST though...... [15:32:24] bblack: https://gerrit.wikimedia.org/r/#/c/287971/ [15:33:40] ema: LGTM [15:34:04] ema: I'd save time and do those by-dc for the next ones maybe [15:34:27] as long as puppet's disabled before the merge [15:35:10] sounds good [15:44:01] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281216 (10Samtar) Just to tag on, also Europe (UK) and also on POST events - it's very periodic (twice today through a couple of hours editing) [15:44:15] looks like the upgrade worked on cp3007 [15:44:32] some of the vtc tests are failing, perhaps because of vcl changes during the past weeks? [15:46:54] probably! [15:47:02] does the VCL even compile? [15:47:23] not only it compiles, it even works! [15:47:26] ok [15:47:31] what's the vtc test failing? [15:47:36] git.w.o [15:47:46] eg: 02-git.w.o-x-forwarded.vtc [15:47:57] EXPECT resp.status (301) == "200" failed [15:48:21] oh maybe those tests predate all the changes for HTTPS on misc-web? [15:48:44] maybe [15:48:54] so, this request works [15:48:56] curl -v -I -k -H 'Host: etherpad.wikimedia.org' https://localhost [15:49:05] using git instead of etherpad doesn't [15:49:30] I think gitblit is still alive [15:49:30] as in: curl hangs [15:49:57] oh, wait then after some seconds the request goes through [15:50:35] git.wm.o is very very very slow [15:50:52] is VTC hitting the live server backend? I thought it used mock backends [15:51:01] VTC isn't [15:51:11] it isn't hitting the live server [15:52:28] ok so, after the last date of commits to the tests, git.wm.o did have HTTPS changes [15:53:48] it's now using varnish-level HTTPS redirect/enforcement [15:54:26] so your client req needs X-Forwarded-Proto: https if it doesn't want to get a 301 from varnish [15:55:00] mmh it looks like a different problem [15:55:14] it makes sense [15:55:22] EXPECT resp.status (301) == "200" failed [15:55:37] you sent the VCL a query for git.wm.o without XFP: https, so it's sending the client a 301->HTTPS [15:56:09] it should look like the etherpad test does (which already had HTTPS redirect support) [15:56:43] since those tests were last updated, we removed all redirect exceptions [15:56:59] every service on cache misc now requires XFP: https from nginx (or test client) or it gets 301->HTTPS [15:57:08] indeed adding XFP is enough to make the 02- test pass [15:57:50] the server s1 expectations are still the same though. we still hack in X-Forwarded-Port, and git.wm.o still expects both headers and gets them [16:01:15] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281288 (10BBlack) Without a more-detailed error message or some kind of trace of the connection attempt, it's difficult to get to the bottom of this. There are a thousand reasons a secure connection can f... [16:13:00] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281369 (10BBlack) As a random experiment, perhaps some of those reporting could try this in FF 46.0.1? 1. Type 'about:config' in the URL bar (it will probably pop up a warning about voiding your warranty... [16:14:55] mmh, 01-basic-caching.vtc still fails. We expect a 404 when no Host header is specified (or when the domain is not served by misc), but we get a 200 instead from varnishtest [16:15:28] (wikimedia_misc-backend.vcl this time) [16:16:22] curling :3128 we get the expected behavior though [16:20:31] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281416 (10BBlack) Note the above would test the hypothesis that we're hitting this: https://www.ruby-forum.com/topic/6878264 [16:22:35] ema: yeah that seems odd... [16:23:22] ema: does it fail both c1 and c2 (no host, host==example.com)? [16:23:31] yep [16:23:53] note that varnishtest *does* add a host header even if you don't specify any, but we should get 404 anyways (it adds Host: 127.0.0.1) [16:24:32] **** s1 0.6 rxhdr| GET / HTTP/1.1\r\n [16:24:32] **** s1 0.6 rxhdr| X-Forwarded-For: 127.0.0.1, 127.0.0.1\r\n [16:24:32] **** s1 0.6 rxhdr| X-CDIS: miss\r\n [16:24:32] **** s1 0.6 rxhdr| Accept-Encoding: gzip\r\n [16:24:32] **** s1 0.6 rxhdr| X-Varnish: 1002\r\n [16:24:34] **** s1 0.6 rxhdr| Host: 127.0.0.1\r\n [16:24:41] **** s1 0.6 txresp| HTTP/1.1 200 OK\r\n [16:24:41] **** s1 0.6 txresp| Content-Length: 0\r\n [16:24:59] hmmmm [16:25:25] CL:0 is suspicious. perhaps this is badly-interacting with the HTTPS redirect code? [16:26:38] I donno, it seems to work fine in practice [16:26:44] it does [16:26:48] something must be amiss with the test? [16:27:39] it could be, but IIRC the test was Working Fine [16:28:03] yeah [16:28:45] uh, it works on my test misc instance in labs... [16:29:12] this is weird [16:29:39] well the VTC tests don't go through the live daemon at all, right? they spawn separate varnishd? [16:29:50] correct [16:32:07] how are you running them on cp3007? [16:32:22] varnishtest $filename (as root) [16:33:39] **** v1 0.4 CLI RX| Unused sub cluster_be_recv_applayer_backend, defined:\n [16:33:46] which is the part that does the 404-ing [16:34:23] <% if @cache_route == 'direct' -%> // tier-one caches must select an applayer backend call cluster_be_recv_applayer_backend; [16:34:26] <% else -%> [16:34:29] ah! [16:34:38] the backend tests don't work on all backends, they only work on direct backends (eqiad) [16:34:45] that's why it works on my test instance [16:34:46] in configuration terms, anyways [16:35:24] you could set the direct flag in tests that expect direct behavior, when templating out the test VCL, I guess? [16:35:41] this gets into the whole messy ordeal of how exactly we define what we're testing though [16:35:54] exactly, we had similar discussions in the past [16:36:03] ideally our varnishtest run should test all the VCL scenarios on any run, not just the VCL that's applied to the current host in some sense [16:36:16] and some tests will be specific to what layer of backend we're dealing with [16:36:47] I guess there are two approaches to that really, but I like that one [16:37:10] varnishtest run tests own node's VCL, and CI infra has to do varnishtest in the context of several distinct nodes in different clusters/DCs [16:37:45] vs varnishtests runs tests on all valid combinations of VCL variables like "direct", thus hitting the VCL for all types of nodes in one run. CI infra can run it abstractly and not specifically for a node. [16:38:28] (making it more like python unit tests, rather than node-specific puppet-compiler checks) [16:38:54] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281529 (10Dzahn) list of IPs that still show up now.. and the names they resolve t: | 10.68.17.70 | integration-slave-precise-1011.integration.eqiad.wmflabs. | 10.68.16.5... [16:39:06] anyways, I'd say carry on and ignore the VTC tests for now if we have explanations for the failures [16:40:00] it's hard to reason about the long term of the 'direct' flag in VCL right now anyways [16:40:17] really that we have a 'direct' flag set in backend VCL based on what datacenter you're in is a temporary hack [16:40:28] in the longer term, direct access is going to be a very dynamic thing [16:41:05] (even a frontend in esams will sometimes contact the applayer directly for pass-traffic, and thus run the 'direct' code blocks to do necessary final transforms for the applayer and select the correct applayer host) [16:41:41] cp3007 is serving traffic and looks good [16:41:56] pybal logs look sane [16:42:03] nice! [16:42:47] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10AlexMonk-WMF) Ran `root@rt1:~# dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia` on rt1.servermon.eqiad.wmflabs (10.68.19.15) Puppet is broken on t... [16:44:53] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281550 (10Krenair) I can't log in to phab-01.phabricator.eqiad.wmflabs (10.68.16.201), even as root. Maybe someone with access to the labs salt master can get in. [16:45:07] bblack: I'll stop here for today and continue with the other hosts in esams tomorrow morning if nothing breaks [16:45:41] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281552 (10fgiunchedi) I did `graphite-labs.graphite.eqiad.wmflabs` and `graphite1.graphite.eqiad.wmflabs` [16:47:55] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281561 (10Dzahn) integration-raita can be disregarded. that was fixed by hashar. i think it just needs a little more time to disappear from the UI but there is no new data [16:48:48] ema: ok, cya! [16:48:59] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281578 (10Krenair) >>! In T115330#2281529, @Dzahn wrote: > | 10.64.48.132 | 3(NXDOMAIN) `templates/wmnet:822:wmf4727-test 1H IN A 10.64.48.132` - resolving that h... [16:49:37] bye! [16:54:32] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281594 (10Dzahn) wlmjurytool2014.wlmjurytool.eqiad.wmflabs. - killed gmond, there is puppet fail about starting ganglia-monitor and i think it's self-hosted master. but gm... [17:18:51] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281737 (10Samtar) @BBlack completely understand, I'll try the above (Win 8.1 Pro + FF 46.0.1) and report back - saying that, I've spent the last couple of minutes trying to force it to happen (both before... [17:21:53] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281758 (10Dzahn) >>! In T115330#2281542, @AlexMonk-WMF wrote: > Ran `dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia` I did the same on the last couple ins... [17:53:21] 10Traffic, 10DNS, 10Mail, 06Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2281892 (10Dzahn) 05Open>03Resolved a:03Dzahn Now it has an SPF record. ``` ;; QUESTION SECTION: ;phabricator.wikimedia.org. IN TXT ;; ANSWER SECTION: phabr... [17:53:37] 10Traffic, 10DNS, 10Mail, 06Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2281896 (10Dzahn) a:05Dzahn>03Mschon [18:52:08] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282229 (10hashar) [18:53:28] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10hashar) I have moved @Dzahn list of IP/FQDN to the task detail ( https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-4wg7fhgbo3bwvli/ ) this wa... [18:54:28] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282254 (10hashar) [18:55:04] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10hashar) [18:57:17] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282265 (10Dzahn) [18:58:28] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) [19:00:20] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282270 (10Dzahn) [19:01:51] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) Maybe somebody in Analytics could take care of limn and maintenance.analytics? [19:26:05] 10Traffic, 06Operations, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2282393 (10Aklapper) [19:52:23] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2282469 (10BBlack) I should've noted above: if you apply the manual HTTP/2 disable, please don't forget to turn it back on later after sufficient testing! [20:05:27] 10Traffic, 06Operations, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2282530 (10BBlack) I've merged the first two patches, which are really pre-patches from this ticket's POV. There's some interaction between this work and T107749 , so I'll put the detail... [20:30:19] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282655 (10BBlack) I've been reviewing and re-testing a bunch of related things today. There are several inter-mixed issues and I'm not even going to try to separate the... [20:38:19] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2282696 (10BBlack) In the patch you mention separate cache-control headers for 'error' responses. What kinds of error responses? Are these 500s? [20:56:47] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282789 (10BBlack) And of course, there are still elevated random 503's on text, like before. Need to confirm if it's unrelated and coincidental (unlikely), or which of... [21:05:13] 10netops, 06Operations: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2282802 (10faidon) [21:16:58] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282869 (10BBlack) FWIW, the random 503s look like this on GET of plain article URLs (and other things, of course): ``` 421 VCL_return c hash 421 VCL_call c mis... [21:21:09] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282886 (10BBlack) Reverting just the HTTP/1.1 nginx patch makes the 503s go away (still using upstream module).... needs more digging in more-isolated testing.... [21:28:25] 10netops, 06Operations: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282892 (10faidon) [21:30:57] 10netops, 06Operations: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282907 (10faidon) This is Juniper case [[ https://casemanager.juniper.net/casemanager/#/cmdetails/2016-0510-0764 | 2016-0510-0764 ]] now. [21:51:35] 10netops, 06Operations: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2282958 (10faidon) p:05High>03Normal Link up since 21:12:38Z. Waiting to hear from Zayo about the root cause and if it was on their side. [22:27:40] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283170 (10Yurik) @BBlack, the graphoid service now sets 3600 maxage on success, and 300 maxage on failure: ```Cache-Control: public, s-maxage=3600,... [22:35:35] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283201 (10BBlack) @Yurik: Well, we can talk about longer cache lifetimes later. For something new it's fine. But my earlier question still stands:... [22:45:15] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283231 (10Yurik) @bblack, those are actually 400 errors, e.g. [[ https://www.mediawiki.org/api/rest_v1/page/graph/png/Extension%3AGraph%2FDemo/21130... [23:24:23] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2283386 (10mmodell) @dzahn: yes but it can run on the same hardware as the www service.