[05:15:56] 10netops, 06DBA, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10026182 (10Marostegui) >>! In T370852#10010096, @Ladsgroup wrote: > This should have the map: https://fault-tolerance.toolforge.org... [05:47:48] fabfur: everything is back to normal in france, you can revert your geoip change [05:51:52] ack! [05:51:55] thanks [06:18:43] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10026230 (10ayounsi) We can potentially re-use the `move_server.MoveServer` script but make the server selection a `MultiObjectVar` as input and make the rack U... [06:55:06] ready to apply this: https://gerrit.wikimedia.org/r/c/operations/dns/+/1058026 [06:55:17] anyone for a +1 ? [06:59:02] fabfur: +! [06:59:04] 1 [06:59:08] tnx! [07:00:49] 06Traffic: Route FR to esams - https://phabricator.wikimedia.org/T371216#10026268 (10Fabfur) 05In progress→03Resolved Revert applied, FR now back to drmrs [08:21:06] 06Traffic, 10conftool: support the haproxy silent-drop hysteresis gadget from requestctl - https://phabricator.wikimedia.org/T371144#10026447 (10Joe) Thanks for the thorough explanation! I know the traffic folks were a bit worried about controlling stick tables from requestctl but I think this format is ok. I... [08:40:22] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10026510 (10MatthewVernon) @NBaca-WMF thanks for the update :) If you could let me know when you've got planned timescales for this, that'd be helpful, pl... [10:56:13] vgutierrez: wb! [10:56:26] when/if you get a moment could I ask you to have a look at this task? [10:56:27] vgutierrez [10:56:32] sry [10:56:40] https://phabricator.wikimedia.org/T370635 [10:56:58] TL;DR wanted to get your view on moving all the LVS vlans to the primary link in codfw [11:00:20] * vgutierrez looking [11:22:50] topranks: so just using 1 NIC per LVS box? [11:23:15] vgutierrez: yeah, basically leveraging the fact we can trunk them all from the top of rack [11:23:16] we're in the middle of a heatwave here.. so give me some extra CPU time :) [11:23:36] I ran it by s.ukhe and b.black last week and they were ok with it, but wanted to get your input [11:23:39] enable_liquid_cooling: true [11:23:41] heh ok no probs :) [11:23:53] first day it's not been freezing cold and raining here in about 2 weeks :P [11:24:12] topranks: will this reduce the bandwidth available or we plan to trunk the existing nics? [11:24:18] sorry, I'mm out context [11:24:37] and maybe we already don't use much of it, being in direct routing [11:25:01] technically speaking we will have "only" 10G available to feed realservers [11:25:06] it'll reduce the bandwidth to the one 10G link. But peaks are like ~2Gb/sec so I don't think that's an issue [11:25:49] at the same time... given the asymmetry between inbound and outbound interfaces that's always been the case, right? [11:26:13] I guess we had limited inbound capacity always - on the single NIC [11:26:18] the LVS only gets inbound traffic via 1 interface.. effectively limiting the input BW on 10G [11:26:25] and so yeah - that limits the amount can be sent out after rewriting the MAC header [11:26:32] yeah good point [11:26:53] so we are applying the same limit on outbound traffic from the LVS to the realservers [11:26:54] ah right, from the router to the lvs we have alredy just 10g [11:26:58] the new switches can support 25G also, so that's an option, but when I looked at usage didn't seem required in the short term [11:27:06] makes sense [11:27:10] topranks: what about PPS? [11:27:33] don't think that will be affected, switches etc. can do line-rate at 64-byte frames so no issue network side [11:27:44] secondary question, what about all the puppetization we have to fine-tune for performance all the NIC queues? [11:27:49] probably on the LVS-side there is no potential of crossing the NUMA bridge so potentially better there? [11:28:16] topranks: I was thinking about the LVS side, any chance of hitting the PPS limit of the 10G NIC? [11:28:53] I think the NIC is gonna be ok to deliver line rate, it's more the CPU processing the packets that may be the limit [11:29:18] and so I don't think it's capacity should change, in fact may be more optimized as it's just processing the same N input queues [11:29:31] in terms of the optimizations / queues it's not something I looked at though [11:29:57] in general setting the number of inbound queues to equal the number of CPU cores makes sense - but I'm not that familiar with what we have already [11:30:21] nitpick: CPU cores on the same NUMA node as the NIC [11:30:46] topranks: modules/interface/manifests/rps.pp and modules/interface/files/interface-rps.py [11:31:49] called from the more generic modules/cacheproxy/manifests/performance.pp [11:32:22] but that's only for cache hosts I think [11:32:23] not lvs [11:32:23] * topranks looking [11:32:28] (the latter) [11:33:05] yeah the lvs one is modules/profile/manifests/lvs/interface_tweaks.pp I think [11:34:31] topranks: overall it looks good and it also closes the gap between the old and the new [11:35:07] ok [11:35:21] yeah I'm looking at the optimization stuff here in a bit more detail [11:35:36] but yeah - in terms of closing the gap - I guess the IPIP changes are gonna end up with something similar [11:35:45] in terms of NIC usage, cpu affinity etc [11:36:18] I guess with the current optimizations that means only CPU cores on the same NUMA as the NIC are gonna get used [11:36:33] potentially limiting the number of CPU cores dedicated to transmit [11:36:49] *but* the other side of that is no packets crossing bw-limited NUMA bridge [11:36:59] and we probably have enough cpu cores to do it without problem [11:40:21] topranks: usually setting up a CPU core in one NUMA node to handle a queue of a NIC in another NUMA node won't work [11:41:25] you can run packets across NUMA nodes at a higher level, aka setting a network daemon to run in a different NUMA node as the NIC [11:42:17] but my tests with the broadcom NICs in our servers showed that there is no communication between the NIC and the second NUMA node in our lvs boxes [11:43:32] vgutierrez: yeah that makes sense, it all depends what NUMA node the NIC is connected to [11:44:33] the optimizations look to prevent NIC queue processing being assigned to cores on the opposite numa node as the NIC [11:45:13] looking at CPU usage for the busiest one - lvs2013 - the busiest CPU is hitting 25% usage roughly over the past week or so [11:45:13] https://grafana.wikimedia.org/goto/QoNboGrIg?orgId=1 [11:46:05] that usage on that core could obviously be anything - isn't necessarily network usage [11:46:37] but overall the picture looks good in terms of usage so I don't think the change is going to run into any limits [12:00:07] 10netops, 06DBA, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10027115 (10Ladsgroup) Yeah. If we can build a public API from zarcillo, it'd would make the whole easier. [12:44:25] 06Traffic, 10Continuous-Integration-Config: Migrate docker-registry.wikimedia.org/releng/operations-dnslint from Buster to Bookworm - https://phabricator.wikimedia.org/T371001#10027378 (10hashar) [12:58:07] 10netops, 06Infrastructure-Foundations, 06SRE: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375 (10cmooney) 03NEW p:05Triage→03Low [13:11:36] vgutierrez: I'll move forward with that work to migrate the codfw lvs to new switches then? [13:11:47] topranks: sounds good [13:12:11] cool, I distributed the patch reviews amongst the traffic team members to try and be fair [13:12:29] sukh.e did the first one for me last week (lvs2012) so I'll start there [13:12:30] thanks! [13:52:26] 06Traffic, 10conftool: support the haproxy silent-drop hysteresis gadget from requestctl - https://phabricator.wikimedia.org/T371144#10027717 (10CDanis) >>! In T371144#10026447, @Joe wrote: > I think the easiest way to do this is to do as follows: > * Add the `concurrency_limit (bool)` and `concurrency_thresh... [14:07:23] 7~/13 [14:07:27] err :) [15:17:42] 10netops, 06DBA, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10028353 (10Marostegui) @ABran-WMF please coordinate with @cmooney for this. [15:33:46] hi traffic -- anyone willing to review and deploy https://gerrit.wikimedia.org/r/547929 for today's Puppet request window? [15:34:14] (doesn't have to actually be during the window if that's not a good time, I just want to find someone for it) [15:34:37] rzl: we are in a meeeting right now [15:34:41] but [15:35:31] I think this will need some time to review as well and maybe tests [15:36:04] will comment on the patch shortly, thanks [15:36:36] that sounds right to me -- appreciate it [15:51:47] rzl: responded, you can consider this as an assignment to us and we will take care of the deploy [15:53:57] thank you <# [15:53:59] <3 [17:15:42] Folks I'm proceeding to move row c/d vlans to the primary int on lvs2012 (T370862) [17:15:42] T370862: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 [17:16:05] step 1 is to disable the bgp session on the switch, I'll monitor then and continue provided no issues once connections die off [17:18:09] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028892 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22b0edee-c7a6-4b0f-9fea-2095ec62... [17:18:56] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028893 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6ff7dee3-4248-4c63-812a-befb7aa3... [17:22:40] 10Wikimedia-Apache-configuration, 06serviceops, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830#10028896 (10Pppery) [17:55:32] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028999 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dd309020-6739-44e3-aae7-1db7e069... [17:55:45] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e014b03e-5922-4caa-80c4-c950cc41... [18:11:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp1102:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:13:26] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029064 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a53d3f9e-80ae-429e-b814-01f035f8... [18:16:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:21:40] FIRING: [29x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:26:40] FIRING: [32x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:30:30] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029125 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=428f84f9-4ca7-4d64-ba2f-941c3927... [18:30:48] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029126 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fea7df87-a776-4ad1-b5ea-1c4c47a6... [18:41:40] FIRING: [17x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:46:40] RESOLVED: [16x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:59:58] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029209 (10cmooney) Work on this one is completed, all that remains is to remove the old cross-rack links which a... [19:26:48] FYI I'm going to move on and do the same move for lvs2011 now T370891 [19:26:49] T370891: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891 [19:28:56] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029484 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdb9ae19-db19-42c1-a837-d30eff23... [19:29:28] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029498 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0cfea209-8c6a-4d44-8fbf-96f5cd79... [20:09:16] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029715 (10cmooney) Work completed on this one on the network & LVS side. @papaul we can now remove the cross-ra... [20:27:20] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10ops-codfw: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434 (10RobH) 03NEW [20:27:23] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10ops-eqiad: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435 (10RobH) 03NEW [20:27:32] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10ops-codfw: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10029796 (10RobH) [20:27:45] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10ops-eqiad: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10029801 (10RobH)