[04:57:10] <ema>	 we've just had a brief 503 spike in upload@ulsfo https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&panelId=2&fullscreen&from=1506657939219&to=1506660260286&var-site=ulsfo&var-cache_type=upload&var-status_type=5
[04:57:40] <ema>	 culprit in this case is cp4021's frontend, c4021 being the only host left running with numa_networking true
[05:10:03] <ema>	 ok let's see what happens unsetting max_connections on text@esams backends
[05:35:06] <bblack>	 re: cp4024 yeah they need a manual (or wait for cron) stapling run
[05:35:11] <bblack>	 before pooling, after long downtime
[05:35:51] <bblack>	 I was watching some stats on 4021, it had higher loadavg than 4022 too, and very different profiles in other performance senses
[05:36:22] <ema>	 yeah
[05:36:36] <ema>	 much better to only keep one node with numa enabled, I think
[05:36:38] <bblack>	 so yeah, hopefully it's just the NUMA stuff
[05:36:55] <bblack>	 we can also back off to "numa_networking: on" instead of "isolate"
[05:37:02] <ema>	 yup
[05:37:09] <bblack>	 my suspicion at this point is that "isolate" just doesn't work out with our current hw
[05:37:33] <bblack>	 (since the only eth+disk are all stuck on node0, putting a heavy process like varnish-be that uses both over on node1 causes problems)
[05:37:45] <ema>	 on the esams front, I've disabld max_connections and I'm varnishlogging all !PURGE requests (-w ~ema/2017-09-29.varnishlog)
[05:38:20] <bblack>	 delta from cp402x, we'd want the disk controller moved to node1, and a second eth port on node1 (separate IP for varnish-be)
[05:38:21] <ema>	 tricky not to run out of disk space though :)
[05:38:26] <bblack>	 or something like that, to make "isolate" style work
[05:39:30] <bblack>	 perf top was showing some filesystem and blockdevice junk higher than it should be, vs 4022, too.  which further backs up the theory that the problems stem from varnish-be's cross-node disk traffic (and spreads from there to affect other things)
[05:40:19] <bblack>	 anyways, back on the subject of 4024, it probably needs a post-puppet reboot too, if you didn't already
[05:41:32] <ema>	 just rebooted it again, yeah
[05:43:07] <bblack>	 ok
[05:43:46] <bblack>	 in that case, probably ok to pool it up
[05:45:14] <bblack>	 re: 4021, I think the evidence against isolate is pretty clear at this point
[05:45:43] <bblack>	 maybe bump it down to "on" in hieradata -> puppetize -> reboot and let it be a singular comparison point for that rather than the probably-busted "isolate"
[05:46:28] <ema>	 sounds good!
[06:42:06] <ema>	 heh of course now that I'm looking nothing happens
[06:46:39] <ema>	 re-enabling max_connections on a subset of the nodes (cp304[0-3])
[07:34:00] <ema>	 so, no significant issue this morning 
[07:34:03] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506658581540&to=1506669253612
[07:34:26] <ema>	 enabling puppet on the remaining ones (cp303[0-3])
[11:56:46] <bblack>	 heh yet another ensure=>absent failure that wasn't expected, this time it's sysfs stuff
[11:56:54] <bblack>	 (luckily no reboots to fix these)
[11:58:18] <bblack>	 oh wait, no, there is no "absent", it's my failure not modules/sysfs
[11:58:56] <bblack>	 (unless it should be fully-managing /etc/sysfs.d/)
[12:00:29] <bblack>	 hmmm, it should be
[12:37:41] <bblack>	 ah I get it now, slow morning heh
[12:38:30] <bblack>	 the /etc/sysfs.d/ is declared as purge/recurse in sysfs::init, which is in turn only generally invoked indirectly from sysfs::conffile and/or sysfs::parameters
[12:39:08] <bblack>	 therefore, only so long as you continue to have one or more sysfs things defined, the sysfs module manages the dir and removes files that no longer apply
[12:39:32] <bblack>	 if you remove the last/only sysfs::parameters from a node, the corresponding file in /etc/sysfs.d/ is not removed
[12:47:13] <ema>	 hey :)
[12:47:46] <ema>	 codfw's LVSs upgraded to the latest pybal; also we now have pybal metrics from ulsfo/codfw in prometheus
[13:49:14] <ema>	 I'm starting to mess around with the data: https://grafana.wikimedia.org/dashboard/db/pybal
[14:56:30] <ema>	 does do_stream=false set CL? yes :) https://phabricator.wikimedia.org/P6059
[16:02:51] <wikibugs_>	 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3646919 (10ayounsi)
[16:29:19] <wikibugs_>	 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3647025 (10ayounsi)
[17:04:04] <bblack>	 ema: next question in that line of thinking: given N layers of varnish cache, if the bottom-most generates the CL header by setting beresp.do_stream=off in vcl_backend_response(), but the other intermediate caches leave default beresp.do_stream=on ....
[17:05:22] <bblack>	 does the generated CL get preserved and seen up through the stack and at the client?
[17:06:53] <bblack>	 (also, does it go back to being a chunked/streamed transfer for the other layers, just one that carries an additional "unecessary" CL header?  Or does it just become a functionally-streaming but non-chunked response for the rest of the stack? (just the async part of streaming - not waiting on full responses before forwarding traffic through)
[17:14:40] <bblack>	 even if all that works out well, we probably need to fine-tune cache_misc's implementation a bit before spreading it around.  it seems pretty naive with just the !CL check.  maybe it should be checking beresp.status too or something.
[19:08:46] <wikibugs_>	 10netops, 10Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3647470 (10BBlack) RFC 8205 (BGPSec) got published this week, which will use RPKI to secure against bad route announcements by signing UPDATE messages - https://tools.ietf.org/html/rfc8205
[20:04:42] <wikibugs_>	 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3647562 (10BBlack) a:05BBlack>03RobH
[20:05:51] <wikibugs_>	 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622887 (10BBlack) @RobH - these are good to go for decom now.  They're still booted, but have been depooled, removed from confd/lvs/etc, re-roled in p...
[22:23:09] <XioNoX>	 https://medium.com/netflix-techblog/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99