[07:11:28] 07HTTPS, 10Traffic, 06Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3266376 (10Bawolff) [09:03:01] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3266536 (10fgiunchedi) @ayounsi thanks! I'm for excluding analytics hosts for this alarm on the basis that the alarm itself isn't actionable... [09:05:04] 10netops, 06Operations, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3266556 (10ayounsi) [09:33:55] from #varnish-hacking: [09:33:55] 10:20 < hermunn> ema: I am hoping that #1764 will be fixed in 4.1.7, but thanks for reminding me. [09:34:16] so yeah perhaps 4.1.7 will include the backported fix to make -pnuke_limit work again [11:38:45] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#3267222 (10TheDJ) [13:05:00] 10Traffic, 10Analytics, 06Operations, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3267476 (10elukey) [13:16:03] 10Traffic, 10Analytics, 06Operations, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227497 (10Nuria) Let's (as a first step) send these errors to graphite. [13:44:40] going back over some lists of things we can do during the BBR-related "downtime" - probably one of the better ones is fixing up the lvs1007-12 situation [13:45:16] I think hypothetically the Row D situation is out of the way, we just need to go back and validate how the ports are hooked up and VLAN'd and then re-install the machines [13:46:00] T150256 [13:46:00] T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256 [13:53:15] 10Traffic, 10netops, 06Operations: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10BBlack) Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vl... [13:54:06] BBR-related downtime? [13:55:43] paravoid: meaning that we don't want to change anything that could impact performance till BBR deployment in order to have a baseline for performance comparisons [13:55:51] ah! [14:03:27] which slows down our usual insane rate of changes that tend to sometimes impact perf :) [14:03:50] I guess downtime is the wrong word without context here. Downtime in our rate of pushing changes [14:04:24] bblack: then you might have time to take a look at https://gerrit.wikimedia.org/r/#/c/353332/ :-P [14:04:57] volans: hmmm... [14:05:02] TL;DR after Faidon's changes that module is unused [14:05:35] volans: I'm assuming you're right about currently unused, but that's only because we're stable on a combination of kernels and interface cards that happen to not require disabling any offload features... [14:06:02] exactly, so we wanted to check with you if you want to keep it around [14:06:54] I guess if we need it again, we could always go history searching, if we remember to. [14:07:31] probably the probability of needing it is low? but it's hard to predict the future [14:07:58] I can add a comment somewhere else to point to it and the latest SHA1 that has it [14:08:17] any suggestion where it could be a good place? :D [14:08:23] guessing about that future and making a call is basically bikeshedding and arbitrary [14:09:10] a more concrete way would be to have some kind of (informal?) policy about these issues in general. Is our puppet philosophy to maintain a library of possibly-dead code in case of future need, or to keep it as lean as possible and avoid dead code? [14:09:35] I think I probably lean towards the latter in the general case. [14:09:44] dead code doesn't get tested or ported well as the ecosystem evolves [14:10:47] me too [14:10:52] 38 lines even counting comments and blanks. re-implementing it next time has at least a decent chance of doing it better anyways :) [14:11:53] lol :) [14:12:03] ok, thanks for the time [14:30:35] let's talk about pfw in -netops? [15:00:33] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3267810 (10faidon) 14.1X53-D43 seems to have been released on May 11th. This particular PR isn't mentioned on the release notes, so the fix may or may not be i... [15:14:52] New approaches to network fast paths [15:14:57] https://lwn.net/Articles/719850/ [16:12:11] ema, bblack: FYI I'm about to merge https://gerrit.wikimedia.org/r/#/c/350771/ (compiler results linked in the last comment) [16:12:44] in case you want to have a second look. I've picked random CP hosts for the compiler assuming they are equivalent for this change. [16:15:50] volans: yeah they are, lgtm [16:16:13] great, thanks! [16:41:11] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3268379 (10Pcoombe) @DStrine @AndyRussG Can we prioritise working on this again once banner sequencing is done? It would... [17:21:52] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3268647 (10faidon) After another round with ATAC, this is the latest: > PR 1238906 is the original PR for this issue and it was raised by me. This is fixed sta... [18:05:44] 10netops, 06Operations, 10ops-eqiad: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3268810 (10Cmjohnson) @ayounsi It appears that the optics swap on asw-c did not help...should I replace on cr2? cmjohnson@asw-c-eqiad> show interfaces xe-8/0/38 extensive | match error... [18:22:38] hmm there's some upload 503 spikes happening [18:24:17] maybe cp1074 [18:25:08] yeah, mailbox lag :( [18:27:13] heh [23:15:57] so, cp4021 power results: [23:15:59] #Max.Power=327 W | 1116 Btu/hr [23:15:59] #Max.Power.Timestamp=Wed May 3 11:16:11 2017 [23:16:15] ^ this is the max it ever recorded in drac, but honestly I think it's fluke during bootup [23:16:33] I looked at some existing older R630 and such, and they have crazy unrealistic maximums recorded as well [23:16:53] cp3039 has: [23:16:54] #Max.Power=1746 W | 5959 Btu/hr [23:16:54] #Max.Power.Timestamp=Fri Apr 14 16:07:36 2017 [23:17:02] and also this to go with it: [23:17:07] #Max.Amps=65.5 Amps [23:17:07] #Max.Amps.Timestamp=Wed Nov 16 11:54:13 2016 [23:17:19] neither one of those figures is actually achievable by the hardware power supplies, so :P [23:17:48] anyways, using sysbench to max out everything I can on cp4021: [23:18:38] 5 different sysbenches running thread+mutex+io tests, all cpus maxed out, load avg in the 100s [23:19:22] best I can hit is 245W [23:20:12] also notable as a reference point, the cp3039 average over last hour/day/week is more like 169W [23:20:38] and I expect these new hosts to be less power draw (they're slower cpus, and newer gen) [23:20:50] so I think 245W peak is a reasonable figure [23:22:07] so even if we assumed all the cp* and all the lvs/misc boxes all peaked up at the same time maxing out their cpu and io [23:22:21] 18*245 = 4.4kW [23:23:54] throw in headroom for router + switches [23:24:05] still, there's no good reason we can't live within 2x3kw racks [23:24:19] (which is DR's base quote)