[01:44:28] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11426657 (10CDanis) >>! In T411503#11425771, @daniel wrote: >>>! In T411503#11424246, @taavi wrote: >> I don't think we currently have any places outside of https://wikitech.wikimedia.org/wik... [07:35:43] FIRING: [4x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:40:43] FIRING: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:50:43] RESOLVED: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:48:25] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11427023 (10daniel) >>! In T411503#11426657, @CDanis wrote: > We already support the Googlebot JSON format, which has become something of a de facto standard. Do you have a link to an exampl... [09:29:44] 06Traffic: Refresh trafficserver_backend_requests_seconds histogram - https://phabricator.wikimedia.org/T411584 (10Vgutierrez) 03NEW [09:30:20] 06Traffic: Refresh trafficserver_backend_requests_seconds histogram - https://phabricator.wikimedia.org/T411584#11427165 (10Vgutierrez) a:03CDobbins [09:34:19] 06Traffic: Refresh trafficserver_backend_requests_seconds histogram - https://phabricator.wikimedia.org/T411584#11427184 (10Vgutierrez) p:05Triage→03Medium [11:12:10] 06Traffic, 10bot-traffic-requests: FY 25/26 WE 5.4.6 Classify the top 30 spiders by traffic as known bots - https://phabricator.wikimedia.org/T408061#11427595 (10Fabfur) Ranked bots paste has been superseded by the shared doc: https://docs.google.com/spreadsheets/d/1PKfAhcc2jXl72CbF73JXTeZMTbw_RtQnYJ6YZ4Fozyk/... [11:28:20] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11427653 (10taavi) Ok, we now publish https://meta.wmcloud.org/cloudvps-ips-all.json (which is now documented at https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space#Machine-readable_d... [11:46:52] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11427760 (10daniel) >>! In T411503#11427653, @taavi wrote: > Ok, we now publish https://meta.wmcloud.org/cloudvps-ips-all.json (which is now documented at https://wikitech.wikimedia.org/wiki/... [12:31:47] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11428006 (10taavi) Sure, done at https://tools-static.wmflabs.org/admin/meta/worker-ips.json and documented at the same place. [12:32:00] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617 (10cmooney) 03NEW p:05Triage→03Medium [12:33:51] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617#11428021 (10cmooney) [12:36:56] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11428049 (10daniel) >>! In T411503#11428004, @taavi wrote: > Sure, done at https://tools-static.wmflabs.org/admin/meta/worker-ips.json and documented at the same place. Awesome, thank you! [14:26:22] ori: hi. following up on our discussion, vgutierrez is around for the NUMA discussion in case you are around [15:21:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11428720 (10RobH) Day 12 Update (in progress, will edit as day progresses): * alert1002 migration complete * 306 of 308 hosts migrated. * lvs1019 will migrat... [15:45:30] 06Traffic, 10Math, 06MediaWiki-Platform-Team, 10PageViewInfo, 07OKR-Work: Fix external calls to AQS in Wikimedia extensions - https://phabricator.wikimedia.org/T411641 (10Tgr) 03NEW [16:16:05] sukhe: vgutierrez: I'm around sporadically, not sure I have anything to contribute, was just curious to learn more about the NUMA issues you were facing on the caches [16:16:37] if anything is written down anywhere that could be easiest. And of course you don't owe me a reply at all, I'm just idly curious. [16:21:52] ori: in a meeting but if you are curious, we can link to some of the stuff we are doing and then happy to discuss [16:23:18] cool [16:41:23] ori: in theory, we could treat a NUMA machine like two machines with one kernel - run separate instances of all our stack on each side, listening on different IPs/ports? but then we're halving the ram-cache space or disk space on the machine between those two instances, so we'd at least need to build them differently for that kind of idea (double up those resources, and make sure they're [16:41:29] attached correctly in terms of the mem split and the PCIe<->cpu-socket natural mapping, etc)... [16:42:05] or the better idea we actually did explore at one time, was trying to pin processes and reserve memory to avoid the NUMA inefficiencies. e.g. run the front edge TLS + mem cache on numa node 0 and the backend disk cache node 1. [16:42:32] so that, in theory, they're just communication cross-socket for what's effectively unix domain socket traffic between them. [16:42:50] but again, half the ram on each socket makes that problematic. one side wants all the mem, one side wants all the disk. [16:43:17] but also, when we tried related pinning and exclusion stuff in the past, inevitably we ran into inexplicable crashes that were difficult to debug [16:44:41] when we treat it naively (pretend NUMA doesn't exist)... well, half the ram is attached to each socket, and varnish has thousands of threads accessing a giant shared memory cache on both sides, and you're just hoping the linux abstractions can efficiently migrate threads and/or memory back and forth efficiently? buts it's a lot of ping pong, and no generic algorithm is going to make it work [16:44:47] optimally. [16:45:46] NUMA works great if you have several discrete chunks of workload with separate state, all of which are <= a whole NUMA domain in terms of CPU+mem+IO and can be kept there. [16:47:19] otherwise it's just inefficient and a PITA to engineer around, or just inefficient usage of the purchased hardware resource, either way. [16:51:55] the theory I guess is we'll get more workload-per-$ out of the solution if we drop the second CPU socket and everything is attached to the only socket (non-numa, effectively), and then maybe spend a little more on making that one socket a little beefier. [16:52:28] the shape of the hardware needs to match the shape of the software (which is not a great shape, but it's the shape we have) [17:15:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11429389 (10ssingh) >>! In T408892#11426444, @Papaul wrote: > @ssingh yes we have to depool the site, yes 10 AM CT Thanks, that works. Will send an invite. [17:26:12] FIRING: [2x] LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [17:27:09] FIRING: [2x] LVSHighRX: Excessive RX traffic on lvs1016:9100 (enp4s0f0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [17:27:09] FIRING: [8x] LVSHighCPU: The host lvs1016:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1016 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:31:09] FIRING: [5x] LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [17:32:38] FIRING: LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:37:09] FIRING: [8x] LVSHighCPU: The host lvs1016:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1016 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:37:38] RESOLVED: LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:42:09] RESOLVED: [8x] LVSHighCPU: The host lvs1016:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1016 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:46:09] RESOLVED: [5x] LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [17:47:09] RESOLVED: [2x] LVSHighRX: Excessive RX traffic on lvs1016:9100 (enp4s0f0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [18:07:02] re: treating a NUMA machine like two machines with one kernel. That's ~basically how Google's fleet operates. There are asterisks to that -- e.g. we let CPU and memory allocations spill across socket boundaries under extreme resource pressure scenarios, but for the most part every NUMA node is a discrete scheduling domain. [18:09:42] re: halving the ram-cache space or disk space -- fair enough, but it also maximizes density by amortizing shared infra like PSUs, fans, chassis, management interfaces, etc. [18:13:14] re: pinning and exclusion leading to inexplicable crashes. That's surprising, these are generally pretty basic/reliable mechanisms in the kernel but I guess application code can do surprising things [18:13:41] re: no generic algorithm is going to make it work. yeah 100% [18:15:13] (also re: halving ram/disk -- if you're anyway intending to limit yourself to what could be directly attached to a single socket, isn't that effectively the same?) [18:16:56] but single-socket sounds sane, for the WMF the complexity overhead of managing NUMA might be more significant than the potential savings [18:18:12] also hi and thanks for humoring my questions :) [18:18:49] yes, that's the NUMA-based model that would work ideally: if we doubled the RAM we're purchasing today (split between sockets), and doubled the disks. The disks are PCIe card NVMes, and so it also matters which socket which slot is homed to. We currently deploy 2x6.4TB cards for the storage. I don't think the small chassis we use today could fit 4, or even 2 double-sized ones, and keep them [18:18:55] where the PCIe attaches to the correct socket, too? It could be explored, but seemed problematic or a bigger chassis either way in the past. [18:18:58] hi :) [18:20:37] part of the problem is the extra custom engineering for all of this too, since nothing else in our fleet really operates this way in terms of machine config at that point. it's just simpler to have a 1:1 of the full CDN stack to the hardware in some sense. [18:21:19] (ditto scaling and redundancy: if we replaced the 8 nodes of a cluster with 4 of these beefed up split-numa nodes... failure domains grow, because we're small) [18:21:56] good point [18:24:19] but really at the heart of all of this, is that our CDN stack is poorly-shaped in multiple ways to be easy to scale out/up/down to other shapes [18:25:22] varnish's (silly IMHO) model of thread-per-active-connection vs something closer to thread-per-core, wanting to use most of the RAM in one giant blob for frontend cache, vs the bottom layer of the stack wanting ~all the disk storage it can, etc, etc... [18:26:06] and we can't easily get away from Varnish, yet. We've tried before, we're kinda locked into it for the foreseeable future. [18:26:11] I haven't kept up with the evolving CDN software stack. Did Varnish survive because it's worst-except-for-all-the-others? [18:26:20] it has lots of pain points, but lots of value we can't find elsewhere, either, basically. [18:26:26] yeah, basically [18:28:12] we tried at one point to move towards a two-layer model that was ATS as frontend TLS termination + frontend ram cache, and ATS again as the backend disk cache layer (as a transitional step to just 1x ATS daemon covering the whole stack) [18:29:46] but shaking varnish is hard: VCL logic can do a lot of things that are harder in the ATS+Lua model, and we had problems with ATS as a TLS terminator too. So for now our stack is haproxy (TLS termination and early-stage defenses)->Varnish(frontend cache, "business logic", deeper defenses)->ATS(large backend disk cache + routing to applayer) [18:30:56] 06Traffic: 404 error when using action=raw on an admin-level hidden revision - https://phabricator.wikimedia.org/T351688#11429688 (10Pppery) [18:31:07] 06Traffic, 10MediaWiki-Revision-deletion: 404 error when using action=raw on an admin-level hidden revision - https://phabricator.wikimedia.org/T351688#11429692 (10Pppery) [18:42:11] 06Traffic, 10MediaWiki-Revision-deletion: 404 error when using action=raw on an admin-level hidden revision - https://phabricator.wikimedia.org/T351688#11429760 (10taavi) As far as I can tell this is MW serving a blank page: `lang=shell-session taavi@deploy2002 ~ $ curl -v --connect-to ::mw-web.discovery.wmnet... [19:00:39] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429856 (10BCornwall) [19:07:00] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@... [19:29:53] cdanis: any concerns from your side in pushing out the DNS patches today for the gerrit-addrs changes? [19:30:35] sukhe: None [19:30:51] mutante: ^ [19:31:11] :)) [19:31:14] cdanis: ok, I will take care of those and then will leave the fun ones for you for the service addition for tomorrow or later :> [19:31:20] thanks both [19:31:21] thank you! [19:31:25] since you are out next week, I can pick up then [19:31:31] or someone from Traffic basically but yes [19:51:34] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumi... [19:54:53] 06Traffic, 06Infrastructure-Foundations: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11430089 (10HCoplin-WMF) Getting the IP ranges documented is a great first step -- thanks, @taavi ! @KCVelaga_WMF -- this step might be useful for your current API traffic analysis work, es... [20:01:02] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430098 (10cmooney) [20:06:26] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430105 (10Jclark-ctr) [20:14:32] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430132 (10cmooney) [20:32:10] 06Traffic: druid-public-coordinator: no backend servers pooled - https://phabricator.wikimedia.org/T411675 (10ssingh) 03NEW [20:32:49] 06Traffic: druid-public-coordinator: no backend servers pooled - https://phabricator.wikimedia.org/T411675#11430188 (10ssingh) [20:43:28] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430232 (10BCornwall)