[07:11:28] <wikibugs>	 07HTTPS, 10Traffic, 06Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3266376 (10Bawolff)
[09:03:01] <wikibugs>	 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3266536 (10fgiunchedi) @ayounsi thanks! I'm for excluding analytics hosts for this alarm on the basis that the alarm itself isn't actionable...
[09:05:04] <wikibugs>	 10netops, 06Operations, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3266556 (10ayounsi)
[09:33:55] <ema>	 from #varnish-hacking:
[09:33:55] <ema>	 10:20 < hermunn> ema: I am hoping that #1764 will be fixed in 4.1.7, but thanks for reminding me.
[09:34:16] <ema>	 so yeah perhaps 4.1.7 will include the backported fix to make -pnuke_limit work again
[11:38:45] <wikibugs>	 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#3267222 (10TheDJ)
[13:05:00] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3267476 (10elukey)
[13:16:03] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227497 (10Nuria) Let's (as a first step)  send these errors to graphite.
[13:44:40] <bblack>	 going back over some lists of things we can do during the BBR-related "downtime" - probably one of the better ones is fixing up the lvs1007-12 situation
[13:45:16] <bblack>	 I think hypothetically the Row D situation is out of the way, we just need to go back and validate how the ports are hooked up and VLAN'd and then re-install the machines
[13:46:00] <bblack>	 T150256
[13:46:00] <stashbot>	 T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256
[13:53:15] <wikibugs>	 10Traffic, 10netops, 06Operations: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10BBlack) Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 .  The idea was to try our best to ensure that a given vl...
[13:54:06] <paravoid>	 BBR-related downtime?
[13:55:43] <ema>	 paravoid: meaning that we don't want to change anything that could impact performance till BBR deployment in order to have a baseline for performance comparisons
[13:55:51] <paravoid>	 ah!
[14:03:27] <bblack>	 which slows down our usual insane rate of changes that tend to sometimes impact perf :)
[14:03:50] <bblack>	 I guess downtime is the wrong word without context here.  Downtime in our rate of pushing changes
[14:04:24] <volans>	 bblack: then you might have time to take a look at https://gerrit.wikimedia.org/r/#/c/353332/ :-P
[14:04:57] <bblack>	 volans: hmmm...
[14:05:02] <volans>	 TL;DR after Faidon's changes that module is unused
[14:05:35] <bblack>	 volans: I'm assuming you're right about currently unused, but that's only because we're stable on a combination of kernels and interface cards that happen to not require disabling any offload features...
[14:06:02] <volans>	 exactly, so we wanted to check with you if you want to keep it around
[14:06:54] <bblack>	 I guess if we need it again, we could always go history searching, if we remember to.
[14:07:31] <bblack>	 probably the probability of needing it is low? but it's hard to predict the future
[14:07:58] <volans>	 I can add a comment somewhere else to point to it and the latest SHA1 that has it
[14:08:17] <volans>	 any suggestion where it could be a good place? :D
[14:08:23] <bblack>	 guessing about that future and making a call is basically bikeshedding and arbitrary
[14:09:10] <bblack>	 a more concrete way would be to have some kind of (informal?) policy about these issues in general.  Is our puppet philosophy to maintain a library of possibly-dead code in case of future need, or to keep it as lean as possible and avoid dead code?
[14:09:35] <bblack>	 I think I probably lean towards the latter in the general case.
[14:09:44] <bblack>	 dead code doesn't get tested or ported well as the ecosystem evolves
[14:10:47] <volans>	 me too
[14:10:52] <bblack>	 38 lines even counting comments and blanks.  re-implementing it next time has at least a decent chance of doing it better anyways :)
[14:11:53] <volans>	 lol :)
[14:12:03] <volans>	 ok, thanks for the time
[14:30:35] <mark>	 let's talk about pfw in -netops?
[15:00:33] <wikibugs>	 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3267810 (10faidon) 14.1X53-D43 seems to have been released on May 11th. This particular PR isn't mentioned on the release notes, so the fix may or may not be i...
[15:14:52] <ema>	 New approaches to network fast paths
[15:14:57] <ema>	 https://lwn.net/Articles/719850/
[16:12:11] <volans>	 ema, bblack: FYI I'm about to merge https://gerrit.wikimedia.org/r/#/c/350771/ (compiler results linked in the last comment)
[16:12:44] <volans>	 in case you want to have a second look. I've picked random CP hosts for the compiler assuming they are equivalent for this change.
[16:15:50] <bblack>	 volans: yeah they are, lgtm
[16:16:13] <volans>	 great, thanks!
[16:41:11] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3268379 (10Pcoombe) @DStrine @AndyRussG Can we prioritise working on this again once banner sequencing is done? It would...
[17:21:52] <wikibugs>	 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3268647 (10faidon) After another round with ATAC, this is the latest: > PR 1238906 is the original PR for this issue and it was raised by me. This is fixed sta...
[18:05:44] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3268810 (10Cmjohnson) @ayounsi It appears that the optics swap on asw-c did not help...should I replace on cr2?  cmjohnson@asw-c-eqiad> show interfaces xe-8/0/38 extensive | match error...
[18:22:38] <bblack>	 hmm there's some upload 503 spikes happening
[18:24:17] <bblack>	 maybe cp1074
[18:25:08] <bblack>	 yeah, mailbox lag :(
[18:27:13] <ema>	 heh
[23:15:57] <bblack>	 so, cp4021 power results:
[23:15:59] <bblack>	 #Max.Power=327 W | 1116 Btu/hr
[23:15:59] <bblack>	 #Max.Power.Timestamp=Wed May  3 11:16:11 2017
[23:16:15] <bblack>	 ^ this is the max it ever recorded in drac, but honestly I think it's fluke during bootup
[23:16:33] <bblack>	 I looked at some existing older R630 and such, and they have crazy unrealistic maximums recorded as well
[23:16:53] <bblack>	 cp3039 has:
[23:16:54] <bblack>	 #Max.Power=1746 W | 5959 Btu/hr
[23:16:54] <bblack>	 #Max.Power.Timestamp=Fri Apr 14 16:07:36 2017
[23:17:02] <bblack>	 and also this to go with it:
[23:17:07] <bblack>	 #Max.Amps=65.5 Amps
[23:17:07] <bblack>	 #Max.Amps.Timestamp=Wed Nov 16 11:54:13 2016
[23:17:19] <bblack>	 neither one of those figures is actually achievable by the hardware power supplies, so :P
[23:17:48] <bblack>	 anyways, using sysbench to max out everything I can on cp4021:
[23:18:38] <bblack>	 5 different sysbenches running thread+mutex+io tests, all cpus maxed out, load avg in the 100s
[23:19:22] <bblack>	 best I can hit is 245W
[23:20:12] <bblack>	 also notable as a reference point, the cp3039 average over last hour/day/week is more like 169W
[23:20:38] <bblack>	 and I expect these new hosts to be less power draw (they're slower cpus, and newer gen)
[23:20:50] <bblack>	 so I think 245W peak is a reasonable figure
[23:22:07] <bblack>	 so even if we assumed all the cp* and all the lvs/misc boxes all peaked up at the same time maxing out their cpu and io
[23:22:21] <bblack>	 18*245 = 4.4kW
[23:23:54] <bblack>	 throw in headroom for router + switches
[23:24:05] <bblack>	 still, there's no good reason we can't live within 2x3kw racks
[23:24:19] <bblack>	 (which is DR's base quote)