[04:57:10] we've just had a brief 503 spike in upload@ulsfo https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&panelId=2&fullscreen&from=1506657939219&to=1506660260286&var-site=ulsfo&var-cache_type=upload&var-status_type=5 [04:57:40] culprit in this case is cp4021's frontend, c4021 being the only host left running with numa_networking true [05:10:03] ok let's see what happens unsetting max_connections on text@esams backends [05:35:06] re: cp4024 yeah they need a manual (or wait for cron) stapling run [05:35:11] before pooling, after long downtime [05:35:51] I was watching some stats on 4021, it had higher loadavg than 4022 too, and very different profiles in other performance senses [05:36:22] yeah [05:36:36] much better to only keep one node with numa enabled, I think [05:36:38] so yeah, hopefully it's just the NUMA stuff [05:36:55] we can also back off to "numa_networking: on" instead of "isolate" [05:37:02] yup [05:37:09] my suspicion at this point is that "isolate" just doesn't work out with our current hw [05:37:33] (since the only eth+disk are all stuck on node0, putting a heavy process like varnish-be that uses both over on node1 causes problems) [05:37:45] on the esams front, I've disabld max_connections and I'm varnishlogging all !PURGE requests (-w ~ema/2017-09-29.varnishlog) [05:38:20] delta from cp402x, we'd want the disk controller moved to node1, and a second eth port on node1 (separate IP for varnish-be) [05:38:21] tricky not to run out of disk space though :) [05:38:26] or something like that, to make "isolate" style work [05:39:30] perf top was showing some filesystem and blockdevice junk higher than it should be, vs 4022, too. which further backs up the theory that the problems stem from varnish-be's cross-node disk traffic (and spreads from there to affect other things) [05:40:19] anyways, back on the subject of 4024, it probably needs a post-puppet reboot too, if you didn't already [05:41:32] just rebooted it again, yeah [05:43:07] ok [05:43:46] in that case, probably ok to pool it up [05:45:14] re: 4021, I think the evidence against isolate is pretty clear at this point [05:45:43] maybe bump it down to "on" in hieradata -> puppetize -> reboot and let it be a singular comparison point for that rather than the probably-busted "isolate" [05:46:28] sounds good! [06:42:06] heh of course now that I'm looking nothing happens [06:46:39] re-enabling max_connections on a subset of the nodes (cp304[0-3]) [07:34:00] so, no significant issue this morning [07:34:03] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506658581540&to=1506669253612 [07:34:26] enabling puppet on the remaining ones (cp303[0-3]) [11:56:46] heh yet another ensure=>absent failure that wasn't expected, this time it's sysfs stuff [11:56:54] (luckily no reboots to fix these) [11:58:18] oh wait, no, there is no "absent", it's my failure not modules/sysfs [11:58:56] (unless it should be fully-managing /etc/sysfs.d/) [12:00:29] hmmm, it should be [12:37:41] ah I get it now, slow morning heh [12:38:30] the /etc/sysfs.d/ is declared as purge/recurse in sysfs::init, which is in turn only generally invoked indirectly from sysfs::conffile and/or sysfs::parameters [12:39:08] therefore, only so long as you continue to have one or more sysfs things defined, the sysfs module manages the dir and removes files that no longer apply [12:39:32] if you remove the last/only sysfs::parameters from a node, the corresponding file in /etc/sysfs.d/ is not removed [12:47:13] hey :) [12:47:46] codfw's LVSs upgraded to the latest pybal; also we now have pybal metrics from ulsfo/codfw in prometheus [13:49:14] I'm starting to mess around with the data: https://grafana.wikimedia.org/dashboard/db/pybal [14:56:30] does do_stream=false set CL? yes :) https://phabricator.wikimedia.org/P6059 [16:02:51] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3646919 (10ayounsi) [16:29:19] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3647025 (10ayounsi) [17:04:04] ema: next question in that line of thinking: given N layers of varnish cache, if the bottom-most generates the CL header by setting beresp.do_stream=off in vcl_backend_response(), but the other intermediate caches leave default beresp.do_stream=on .... [17:05:22] does the generated CL get preserved and seen up through the stack and at the client? [17:06:53] (also, does it go back to being a chunked/streamed transfer for the other layers, just one that carries an additional "unecessary" CL header? Or does it just become a functionally-streaming but non-chunked response for the rest of the stack? (just the async part of streaming - not waiting on full responses before forwarding traffic through) [17:14:40] even if all that works out well, we probably need to fine-tune cache_misc's implementation a bit before spreading it around. it seems pretty naive with just the !CL check. maybe it should be checking beresp.status too or something. [19:08:46] 10netops, 10Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3647470 (10BBlack) RFC 8205 (BGPSec) got published this week, which will use RPKI to secure against bad route announcements by signing UPDATE messages - https://tools.ietf.org/html/rfc8205 [20:04:42] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3647562 (10BBlack) a:05BBlack>03RobH [20:05:51] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622887 (10BBlack) @RobH - these are good to go for decom now. They're still booted, but have been depooled, removed from confd/lvs/etc, re-roled in p... [22:23:09] https://medium.com/netflix-techblog/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99