[11:47:30] 10Traffic, 10Operations: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031#4186323 (10Joe) p:05Triage>03Normal [13:12:44] 10Traffic, 10Operations, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186612 (10Vgutierrez) [13:12:49] 10Traffic, 10Operations: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4186609 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [14:05:48] https://blog.acolyer.org/2018/05/03/stateless-datacenter-load-balancing-with-beamer/ [14:16:55] pretty interesting, seems like a smart solution [14:20:29] indeed, it is mentioned in the paper only but it is actually on github https://github.com/Beamer-LB [14:21:15] AES128-SHA usage dropped again below 0.09% (0.0889%) \o/ [14:24:59] at first I wondered why they bothered with the DIP/VIP tunneling/rewriting (as opposed to configuring VIPs on loopback like we do), but I suspect on a second re-read it will turn out that it's critical to their daisy-chaining strategy I'm sure. [14:27:56] (either that or it's just a simpler abstraction when making this work with virtual instances and/or containers but not strictly necessary. have to dig more to figure that out) [14:28:53] the downside reason to avoid it if it's not strictly necessary, is that it requires encapsulation, and therefore you need a bigger MTU on the mux->server path to avoid fragmenting full-sized request packets from the user-facing side. [14:29:14] (and server<->server for daisychain) [14:30:24] one of many cases when I wish it were simple to turn on jumbo frames inside our network. [14:32:27] (but it's not: for one we'd have to be careful about not breaking devices that can't handle it, but those might mostly be on the mgmt network which can be left at normal MTU. Another is fixing initial install-time issues with the bigger MTU vs default. And another is we'd need an Internet-MTU hack on the hosts that applies only to non-WMF networks, so they don't try to send oversized packets to [14:32:33] clients (e.g. the caches and other hosts that send direct public-facing outbound traffic). [14:33:57] the latter bit is the hardest one, but only applies to interfaces on our public subnets, not private. [14:40:40] 10Traffic, 10netops, 10Operations, 10ops-codfw: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4186910 (10Papaul) @BBlack please let me know when you have time to work on this. Thanks. [14:49:33] TIL why it would be hard to turn on jumbo frames, thanks :) [15:06:21] 10Traffic, 10Operations, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186976 (10Vgutierrez) After completing T193376 and analyzing the gathered data, we've got the following results for 24h of traffic data beginning at 2018-05-03 16:57: * 46%... [15:11:56] 10Traffic, 10Operations, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186984 (10Vgutierrez) [15:56:54] elukey: BTW, did you have the chance of asking to your team about including TLS data in webrequest? O:) [15:57:48] nope sorry didn't manage to yet but I promise I'll do it soon :) [16:00:16] haha thx :D [17:03:48] bblack: cr1-eqsin's interfaces are back up [17:04:29] annnnnnnnd down again [17:04:50] lol [17:04:53] randomly? [17:05:07] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4187502 (10Krinkle) @Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dashboards but running into an issue with the `va... [17:08:29] bblack: nah, I think we're starting to narrow down the issue, and it seems to be related to the HA features (graceful-switchover), will update the task [17:08:59] great to have a router with 2 routing-engines, if they can't failover [17:10:23] the whole software-licensing-the-enablement-of-hardware-ports is kind of scammy and ugly to begin with [17:11:34] seems like they could've at least taken an approach more like, e.g. Intels' 80486SX move. [17:12:20] (historical context: the 80486DX had a floating-point unit, and the 80486SX didn't and was cheaper. they were all manufactured the same, but intel clipped a wire on the SX to disable the FPU and sold it cheaper. [17:12:24] ) [17:12:50] you can question the economics of the thing, but that kind of pricing "works". you can at least do it in hardware though, instead of some fallible software-licensing system :P [17:14:10] for a long time Juniper's licenses were a "honor based system", you could use the features, but it would not be legal without the license, and the device was only generating syslog saying "feature X is being used without license" [17:16:19] my guess so far is that the router *sometimes* fails at synchronizing the licenses between the 2 REs after any unrelated commit [17:17:49] heh the router's acutally-gone now (as in we're not getting full transit) [17:18:09] I have my local hostsfile hacked to use eqsin even though I'm in TX, was working up until sometime just recently :) [17:22:18] indeed [17:22:26] and CPU has been at 100% https://librenms.wikimedia.org/graphs/to=1525713600/type=device_processor/from=1525692000/legend=no/lazy_w=652/device=159/ [17:24:08] all the BGP sessions are UP, the router seems to be struggling to process all the received pefixes [17:24:23] "rpd" is the daemon eating the CPU [17:24:55] `Table Tot Paths Act Paths Suppressed History Damp State Pending` [17:24:55] `inet.0 1567453 677387 0 0 0 402113` [17:25:03] see the pending [17:25:19] down to 51999 [17:26:19] is this some effect of the issues with the RE-failover and/or licensing, etc? or is this a general problem we're gonna have anytime we restart all the interfaces (or the router itself), where the CPU takes a while to chug through initially processing all the global routing it gets? [17:27:27] I'd guess the former, such long downtime is not normal, even for this kind of weak router [17:29:57] that issue is also similar to what we had when I tried to enable `nonstop-routing` (note that it's not enabled) [17:32:40] alright, pending is at 0, and router is now routing [17:53:56] 10Traffic, 10netops, 10Operations: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4187668 (10ayounsi) Current troubleshooting actions based on JTAC suggested next step: ```lang=diff [edit system] - commit synchronize; [edit chassis redundancy] - graceful-switchover; [ed... [17:58:25] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4187673 (10Cmjohnson) @vguiterrez I updated the firmware on lvs1016 [18:52:32] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4187802 (10Vgutierrez) @Cmjohnson I still see the same FW version from ethtool and same MSI-X: ```name=FW version root@lvs1016:~# ethtool -i enp4s0f0 |grep firmware firmware... [20:17:39] 10Traffic, 10Multimedia, 10Operations: Update Media dashboard in Grafana to use Prometheus metrics - https://phabricator.wikimedia.org/T193445#4188187 (10Imarlier) Hey, Multimedia team -- probably makes the most sense for you to handle this. [20:25:01] 10Traffic, 10netops, 10Operations: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4188239 (10ayounsi) > So you can use either the configuration statement and as long as the configuration active on both REs no affectation should be seeing on license status or use the request s... [21:11:16] bblack: we can re-pool Singapore anytime now, let ne know if you want to wait a bit longer [21:19:29] XioNoX: is there any remaining unusual risK? [21:20:17] bblack: no [21:20:32] ok let's do it then, can you do the commit? [21:21:23] sure [21:22:18] bblack: I updated the task with the "RCA" and changes made [21:23:03] ok [22:53:08] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4188625 (10CCogdill_WMF) Thanks for the meeting on Thursday, everyone! I'm following up with IBM about potentially: * getting them to obtain a DV cert *...