[02:21:14] another ref point: dell's essa configurator claims at full load these machines should consume ~380W [02:22:46] I think that's just doing the math, it may not be easy to get them there with realistic software [02:23:04] the processor TDPs are 105W x 2cpu = 210W max for CPUs [02:24:15] in any case, all of this data is about the 12x caches. the 6x lvs/misc are similar machines, but with slower/lower-power CPUs, far less installed ram, and far less expected CPU load [02:24:58] if I can only benchmark the cache box to 245W, then 18x245 = 4.4kw seems like a reasonable peak (the actual 18 servers will never get there in practice) [07:13:12] 10netops, 06Operations, 10ops-eqiad: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3269921 (10ayounsi) @Cmjohnson Yes please, and you can put the previous optic back on asw-c-eqiad. If that still doesn't solve the issue, the cable will need to be swapped. [07:17:15] https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=218af599fa635b107cfe10acf3249c4dfe5e4123 [07:17:38] (BBR with embedded pacing) [08:06:24] 10Traffic, 06Operations, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10Ottomata) I don't think that this should cause any problems on our side. I'm not aware of any maps specific jobs we run. @jallamandou let's remember to remove... [08:12:12] godog: cool [08:12:48] and it automatically disables itself if egress uses fq [08:15:32] indeed, I think it should make it to 4.13, not that it makes a big difference to us [08:29:45] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270027 (10elukey) I didn't read a lot of documentation about BBR but I am wondering if it could help in a local LAN use case like the Hadoop clu... [10:19:05] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November 14th to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270402 (10ema) [10:20:12] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270406 (10ema) [10:23:54] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270435 (10Nuria) Summing up from IRC's conversation between @nuria and @ema: From the 2nd of November we start seeing a shift of th... [10:24:50] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270438 (10Nuria) {F8109003} This is the offset data for wikimedia mobile [10:31:16] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270516 (10Nuria) [10:54:26] 10Traffic, 10netops, 06Operations, 10Pybal: Deploy pybal with BGP MED support (for primary/backup) in production - https://phabricator.wikimedia.org/T165584#3270574 (10mark) [12:22:02] 10Traffic, 06Analytics-Kanban, 06Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270816 (10Nuria) @ema: has the way we compute nocookies flag on X-analytics changed? It should take into account "all" cookies not... [12:48:31] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270916 (10BBlack) There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). I... [12:49:13] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270917 (10BBlack) Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and i... [12:57:23] 10Traffic, 10Analytics, 10Analytics-Cluster, 06Operations: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3270926 (10Ottomata) [12:57:31] 10Traffic, 10Analytics, 10Analytics-Cluster, 06Operations: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (10Ottomata) [13:20:08] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696835 (10ayounsi) >>! In T147569#3270027, @elukey wrote: > where LibreNMS periodically notifies us that the switch ports are saturated due to s... [13:28:26] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270985 (10elukey) Thanks for the feedback, I thought it was more a problem of port capacity completely used (100%) and buffers filled, now it ma... [13:28:32] ema: I think the simplistic way we did the qdisc switch to fq on cp1008 isn't going to work out well elsewhere [13:28:46] ema: I think on cp1008 we did "tc qdisc replace dev eth0 fq" right? [13:29:59] ema: in any case, on the real hosts with real bnx2x cards and all the interface-rps hacks, etc... the situation is very different. the interface-scaling hacks use the "mq" qdisc to spread over many queues (1:1 mapping of hw+sw queues) [13:30:20] ema: but then mq uses pfifo_fast within each queue, and we can use fq there [13:31:08] ema: so the right approach on the bnx2x hosts is to set the default to fq (the sysctl we're setting from puppet), and then do "tc qdisc replace dev eth0 mq", which re-inits the mq using multiple fq underneath it instead of multiple pfifo_fast [13:32:01] 10Traffic, 10netops, 06Operations: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10ema) Current situation: |host | port | switch | port | redundancy issues | |lvs1007 | eth0 | asw2-a5-eqiad | xe-0/0/8.0 | lvs1010 eth1 also on asw2-a5... [13:32:38] ema: (but assuming we have no startup race problems with the sysctl, the "default fq" sysctl alone should still work right for the bnx2x on fresh reboot - it will set the default for the initial mq setup) [13:35:26] 10Traffic, 10netops, 06Operations: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3271021 (10BBlack) Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE? [13:36:34] bblack: for now, in the table above I've just highlighted the redundancy issues instead of proposing a solution :) [13:38:17] yeah [13:38:51] I've gotten myself lost in a black hole of re-examining how our interface-rps script works on modern hosts with huge cpu core counts heh [13:39:02] (which led to the realization about the fq qdisc stuff) [13:39:13] nice [13:39:23] but now I'm noticing some other things too [13:39:35] on cp1008 we went for `tc qdisc replace dev eth0 root fq`, to answer you previous question [13:39:53] one is that I thought when #cpus was > #queues, we would spread each IRQ onto multiple cores, but that doesn't seem to be the case, we just use the first cpus [13:40:51] yeah for the others, we really have to do the fq sysctl first, and then replace to mq [13:41:12] I wish there was a way to puppetize it properly, maybe I'll figure it out along the way [13:44:46] hmmm I think interface-rps actually is still working as-designed, but it only doubles up cpus-per-IRQ if the cpu count is 2x or more of the queue count (which it isn't, we have 15 or 16 hardware queues and 24 hardware cpu cores) [13:45:36] one thing that could be fixed up a little though, would be for it to be cpu-die-aware (or numa-aware if you prefer?) in distributing them though [13:46:33] with the current code it just runs through the physical core numbers linearly, so if you have 16 IRQs and 24 physical cores (12 on each cpu die), it puts 12 IRQs on the 12 cores of the first cpu, and then the last 4 IRQs on the first 4/12 cores of the second cpu [13:46:43] it could've got 8/12 on each [13:49:01] 10Traffic, 10Analytics, 10Analytics-Cluster, 06Operations: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3271107 (10Ottomata) It looks like a lot of the hard work for this has been done for Cassandra over in T108953 and T111113. Documentation for this is... [14:34:19] 10netops, 06Operations, 10ops-eqiad: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3271277 (10Cmjohnson) @ayounsi Changing the optics on cr1 appears to have worked..no new errors. Please review and resolve if you're satisfied. cmjohnson@asw-c-eqiad> show interfaces x... [16:04:14] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3271522 (10elukey) Ticket closed as won't fix. The main issue is that not all the clients are sending the close notify, and nginx follows what the majority of the browser... [16:21:37] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3271570 (10BBlack) I wonder if Chrome (which is the dominant browser now, not MSIE as indicated in that nginx source comment) sends the close notify? [16:36:19] 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10ema) [16:36:44] 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271633 (10ema) p:05Triage>03Normal [16:38:05] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10ema) [16:46:28] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3271654 (10BBlack) On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below)... [16:51:23] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10BBlack) So, a couple points: 1. Probably the reason for a lack of neighbors is that some (most?) of the switches don't blanket-enable LLDP for all ports. They explicitly list certain groups like `i... [17:30:33] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10faidon) It's worse than that I'm afraid :( LLDP regularly crashes on some of our older switches (running ancient JunOS): ``` # asw2-a5-eqiad> show system core-dumps # fpc0: # ------------------... [17:40:51] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271850 (10BBlack) Answering for @ema I think this mostly came up as a consequence of trying to map out the data in T150256#3271004 using lldpcli to confirm port connections. That led to an in-depth conversati... [19:24:30] 10netops, 06DC-Ops, 06Operations, 10ops-ulsfo: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3272118 (10RobH)