[04:14:25] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237613 (10Dzahn)
[04:15:00] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237601 (10Dzahn) btw, what is the right phab tag for fundraising-tech ?
[04:16:45] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237615 (10Dzahn)
[04:19:02] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237616 (10Dzahn) pfw means "payments firewall" [[ https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions | naming conventions ]], so this is a Fundraising Tech issue
[04:20:03] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237601 (10Cmjohnson) These hosts are being decommissioned.
[04:23:44] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237622 (10Dzahn) p:05Triage>03Low thanks @cmjohnson. lowering prio.
[07:12:18] <wikibugs>	 10netops, 06Operations: pfw-eqiad.wikimedia.org - 3 interfaces down (fundraising hosts) - https://phabricator.wikimedia.org/T164554#3237745 (10ayounsi) 05Open>03Resolved a:03ayounsi >>! In T164554#3237613, @Dzahn wrote: > btw, what is the right phab tag for fundraising-tech ?  https://phabricator.wikimed...
[10:38:05] <wikibugs>	 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238212 (10ema) p:05Triage>03Normal
[12:54:56] <wikibugs>	 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238201 (10BBlack) How timely!  The subject of how to do completely-seamless reloads (especially for TCP) is quite thorny.  I've been pondering it and fighting with the issues for years on the UDP side for gd...
[12:57:27] <wikibugs>	 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238580 (10BBlack) Also note from that lengthy post - if we were willing to test the scalability of iptables on cache hosts (which we've avoided for fear that it won't scale over cores like the rest of what w...
[13:24:53] <wikibugs>	 10Traffic, 06Operations: Investigate nginx reload behavior - https://phabricator.wikimedia.org/T164579#3238690 (10BBlack) Hmmm another thing - when we first deployed this OCSP updating method, GlobalSign was giving us 8-hour OCSP validity windows.  At present (just checked) we're getting 4-day validity from Gl...
[13:45:59] <wikibugs>	 10netops, 10Monitoring, 06Operations, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#3238719 (10akosiaris) 05Open>03Resolved After a year and 2 months, I can finally happily resolve this. scap3 is now used to deploy librenms,...
[13:59:12] <ema>	 all cache hosts upgraded to varnish 4.1.6 \o/
[14:00:20] <ema>	 I was looking for a decent way of adding dropped/overlimit/requeue stats to prometheus (tc -s -d qdisc show dev eth0|grep dropped) and ended up in a endless spiral that lead me to using netlink in golang
[14:01:52] <bblack>	 \o/
[14:02:09] <bblack>	 well at your first line anyways
[14:02:13] <ema>	 :)
[14:03:21] <bblack>	 so how did netlink in golang go? :)
[14:04:40] <ema>	 hehe I haven't really made friends with github.com/mdlayher/netlink yet, which is what node_exporter uses for wifi stats
[14:05:12] <ema>	 it doesn't look impossible, but it also doesn't look trivial so perhaps using prometheus textfile exporter with a cron job would be easier as a start
[14:05:41] <bblack>	 btw on the nginx ocsp cron stuff, apparently in git history I noticed the change from 8h to 4d for globalsign back in october and already adjusted the icinga check for it heh
[14:05:50] <bblack>	 just not the hourly cron
[14:06:00] <ema>	 ok
[14:06:45] <ema>	 so yeah daily reload instead?
[14:06:56] <ema>	 or is there any way to detect whether a reload is needed in the first place?
[14:07:06] <bblack>	 right, at least reduces the impact by 1/24 or so
[14:07:14] <bblack>	 the reload is always needed to udpate the ocsp file
[14:07:32] <bblack>	 err
[14:07:39] <bblack>	 right, at least reduces the impact to 1/24 of what it was before :)
[14:30:42] <bblack>	 I just validated all the mem sizing results, only cp4010 doesn't have the new malloc sizing out of text+upload (fixed now)
[14:30:57] <bblack>	 maps/misc missed it of course, so doing a splayed out restart of all of those
[14:31:32] <ema>	 sounds good!
[14:45:35] <moritzm>	 does the modules-load.d approach work fine on cp* to address the ipvs sysctl race we noticed a while ago? for conntrack sysctl settings on at least the kafka brokers it's ineffective:  https://phabricator.wikimedia.org/T136094#3238925
[14:46:37] <ema>	 moritzm: AFAIR on lvs hosts it did the trick, yes
[14:47:51] <moritzm>	 I'm wondering if that's actually atomic enough, modules-load.service will trigger the loading of the kernel module in the kernel, but I doubt that it actually waits to return until the module load has been completed
[14:48:47] <bblack>	 really our puppet runs should enforce the settings anyways (not that that's the complete answer either, but it would help)
[14:49:04] <bblack>	 (as in check the live values and re-set if necc on each run)
[14:49:36] <bblack>	 I think it only invokes the command if the files changed on-disk right now, I don't think it queries live values
[14:50:12] <moritzm>	 yeah
[14:50:48] <moritzm>	 it really sucks that systemd-sysctl is so useless, the recommendation by upstream is to ship custom udev rules :-/
[14:51:03] <bblack>	 that would also eliminate the need for the unit-restart-on-file-change stuff. the unit would just be for boot time (where this race is)
[14:51:38] <bblack>	 I guess on jessie+ the old sysctl -p type of command is gone?
[14:51:57] <bblack>	 oh no, it's still there
[14:53:41] <bblack>	 I don't know if it's very smart either though (about not writing if it doesn't have to)
[14:54:55] <bblack>	 heh yeah "sysctl --system" just blindly writes all the configured values
[14:55:13] <bblack>	 I'm not sure if that's universally non-disruptive.  some kernel modules might take action on the write, even if the value is unchanged.
[15:00:26] <wikibugs>	 10Traffic, 10Monitoring, 06Operations, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3239010 (10fgiunchedi)
[15:22:21] <wikibugs>	 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3239118 (10elukey) I can see some interesting logs on mw2146 with error log set to info:  ``` 2017/05/05 15:20:53 [info] 7794#7794: *7 client timed out (110: Connection t...
[15:37:06] <wikibugs>	 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3239166 (10elukey) Red herring, I found a way to reproduce the problem. I've set up `sudo tcpdump -n -v -i lo port 443` in tmux on mw2146 and ran the following requests:...
[15:56:14] <wikibugs>	 10netops, 06Operations, 10ops-codfw: codfw:  ganeti2007-ganeti2008 switch power configuration - https://phabricator.wikimedia.org/T164594#3239225 (10Papaul)
[15:56:40] <wikibugs>	 10netops, 06Operations, 10ops-codfw: codfw:  ganeti2007-ganeti2008 switch port configuration  - https://phabricator.wikimedia.org/T164594#3239241 (10Papaul)
[17:56:40] <wikibugs>	 10Traffic, 06Operations: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10BBlack)
[17:57:04] <wikibugs>	 10Traffic, 06Operations: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#3239728 (10BBlack)
[18:00:25] <wikibugs>	 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239748 (10BBlack)
[18:00:45] <wikibugs>	 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239764 (10BBlack)
[18:00:48] <wikibugs>	 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack)