[04:14:25] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237613 (10Dzahn) [04:15:00] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237601 (10Dzahn) btw, what is the right phab tag for fundraising-tech ? [04:16:45] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237615 (10Dzahn) [04:19:02] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237616 (10Dzahn) pfw means "payments firewall" [[ https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions | naming conventions ]], so this is a Fundraising Tech issue [04:20:03] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237601 (10Cmjohnson) These hosts are being decommissioned. [04:23:44] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237622 (10Dzahn) p:05Triage>03Low thanks @cmjohnson. lowering prio. [07:12:18] 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237745 (10ayounsi) 05Open>03Resolved a:03ayounsi >>! In T164554#3237613, @Dzahn wrote: > btw, what is the right phab tag for fundraising-tech ? https://phabricator.wikimed... [10:38:05] 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238212 (10ema) p:05Triage>03Normal [12:54:56] 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238201 (10BBlack) How timely! The subject of how to do completely-seamless reloads (especially for TCP) is quite thorny. I've been pondering it and fighting with the issues for years on the UDP side for gd... [12:57:27] 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238580 (10BBlack) Also note from that lengthy post - if we were willing to test the scalability of iptables on cache hosts (which we've avoided for fear that it won't scale over cores like the rest of what w... [13:24:53] 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238690 (10BBlack) Hmmm another thing - when we first deployed this OCSP updating method, GlobalSign was giving us 8-hour OCSP validity windows. At present (just checked) we're getting 4-day validity from Gl... [13:45:59] 10netops, 10Monitoring, 06Operations, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#3238719 (10akosiaris) 05Open>03Resolved After a year and 2 months, I can finally happily resolve this. scap3 is now used to deploy librenms,... [13:59:12] all cache hosts upgraded to varnish 4.1.6 \o/ [14:00:20] I was looking for a decent way of adding dropped/overlimit/requeue stats to prometheus (tc -s -d qdisc show dev eth0|grep dropped) and ended up in a endless spiral that lead me to using netlink in golang [14:01:52] \o/ [14:02:09] well at your first line anyways [14:02:13] :) [14:03:21] so how did netlink in golang go? :) [14:04:40] hehe I haven't really made friends with github.com/mdlayher/netlink yet, which is what node_exporter uses for wifi stats [14:05:12] it doesn't look impossible, but it also doesn't look trivial so perhaps using prometheus textfile exporter with a cron job would be easier as a start [14:05:41] btw on the nginx ocsp cron stuff, apparently in git history I noticed the change from 8h to 4d for globalsign back in october and already adjusted the icinga check for it heh [14:05:50] just not the hourly cron [14:06:00] ok [14:06:45] so yeah daily reload instead? [14:06:56] or is there any way to detect whether a reload is needed in the first place? [14:07:06] right, at least reduces the impact by 1/24 or so [14:07:14] the reload is always needed to udpate the ocsp file [14:07:32] err [14:07:39] right, at least reduces the impact to 1/24 of what it was before :) [14:30:42] I just validated all the mem sizing results, only cp4010 doesn't have the new malloc sizing out of text+upload (fixed now) [14:30:57] maps/misc missed it of course, so doing a splayed out restart of all of those [14:31:32] sounds good! [14:45:35] does the modules-load.d approach work fine on cp* to address the ipvs sysctl race we noticed a while ago? for conntrack sysctl settings on at least the kafka brokers it's ineffective: https://phabricator.wikimedia.org/T136094#3238925 [14:46:37] moritzm: AFAIR on lvs hosts it did the trick, yes [14:47:51] I'm wondering if that's actually atomic enough, modules-load.service will trigger the loading of the kernel module in the kernel, but I doubt that it actually waits to return until the module load has been completed [14:48:47] really our puppet runs should enforce the settings anyways (not that that's the complete answer either, but it would help) [14:49:04] (as in check the live values and re-set if necc on each run) [14:49:36] I think it only invokes the command if the files changed on-disk right now, I don't think it queries live values [14:50:12] yeah [14:50:48] it really sucks that systemd-sysctl is so useless, the recommendation by upstream is to ship custom udev rules :-/ [14:51:03] that would also eliminate the need for the unit-restart-on-file-change stuff. the unit would just be for boot time (where this race is) [14:51:38] I guess on jessie+ the old sysctl -p type of command is gone? [14:51:57] oh no, it's still there [14:53:41] I don't know if it's very smart either though (about not writing if it doesn't have to) [14:54:55] heh yeah "sysctl --system" just blindly writes all the configured values [14:55:13] I'm not sure if that's universally non-disruptive. some kernel modules might take action on the write, even if the value is unchanged. [15:00:26] 10Traffic, 10Monitoring, 06Operations, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3239010 (10fgiunchedi) [15:22:21] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3239118 (10elukey) I can see some interesting logs on mw2146 with error log set to info: ``` 2017/05/05 15:20:53 [info] 7794#7794: *7 client timed out (110: Connection t... [15:37:06] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3239166 (10elukey) Red herring, I found a way to reproduce the problem. I've set up `sudo tcpdump -n -v -i lo port 443` in tmux on mw2146 and ran the following requests:... [15:56:14] 10netops, 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 switch power configuration - https://phabricator.wikimedia.org/T164594#3239225 (10Papaul) [15:56:40] 10netops, 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 switch port configuration - https://phabricator.wikimedia.org/T164594#3239241 (10Papaul) [17:56:40] 10Traffic, 06Operations: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10BBlack) [17:57:04] 10Traffic, 06Operations: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#3239728 (10BBlack) [18:00:25] 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239748 (10BBlack) [18:00:45] 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239764 (10BBlack) [18:00:48] 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack)