[08:13:12] 10netops, 06Operations, 10fundraising-tech-ops, 10ops-eqiad: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3288263 (10ayounsi) As ETA is very short for the new routers and switches, let's wait for them and plan/rack everything at the same time. [08:14:03] 10netops, 06Operations, 10fundraising-tech-ops, 10ops-eqiad: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3288264 (10ayounsi) a:03Cmjohnson [09:03:51] so most machines are still at 0 dropped [09:04:44] then there are a few (e.g. cp4007) with a relatively small amount of dropped, 663 in the specific case [09:05:05] all drops one one single queue [09:05:47] ema: see _security, we're depooling esams if that doesn't interfere with your work [09:06:04] XioNoX: go ahead and thanks for the heads up [09:34:42] 10Traffic, 10Monitoring, 06Operations, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3288519 (10fgiunchedi) a:03ema [10:47:34] https://grafana.wikimedia.org/dashboard/db/qdisc-stats?orgId=1 [10:47:47] 10netops, 06Operations, 13Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3288680 (10akosiaris) A `tcpdump -vvvv -ttt -i eth0 icmp6 and 'ip6[40] = 134'` on cp3036 shows RAs still being received by the box with i... [10:48:02] no drops so far, requeues in eqiad increased since esams depool [10:49:17] (see cp1074, that was the only host publishing qdisc metrics before the depool) [11:01:49] ema: more BGP unit tests added, and moved it into a separate test sub package running as part of the pybal test suite [11:04:12] mark: +1 [11:05:00] mark: (the sub-package thing, I'll look into the new tests later) :) [11:05:07] hehe [11:05:17] i hope coveralls etc will also run them then [11:05:22] no idea how that works otherwise [11:07:50] mark: yeah I think coveralls should pick up the ip/bgp tests with that change [11:08:13] mark: re: https://gerrit.wikimedia.org/r/#/c/354680/, I'm not sure what the alternative to eval could be to get a list of strings from config? [11:08:29] i probably used eval at the time [11:08:34] but we're trying to move away from that [11:10:30] i guess if other lists do the same it's fine (I didn't check tbh), but let's replace that at some point :) [11:15:13] 10Traffic, 10Monitoring, 06Operations, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3288748 (10ema) 05Open>03Resolved Fixed! We now have per-IPv4/IPv6 backend metrics available: ``` node_ipvs_backend_connections_active{local_address="2620:0:863:ed1... [12:03:24] mhh I just depooled thumbor1001 but lvs1003 doesn't agree it seems [12:03:31] i.e. https://config-master.wikimedia.org/pybal/eqiad/thumbor [12:03:41] but [12:03:42] lvs1003:~# ipvsadm -L | grep -i thumbo [12:03:43] TCP thumbor.svc.eqiad.wmnet:8800 wrr -> thumbor1001.eqiad.wmnet:8800 Route 10 581 2600 -> thumbor1002.eqiad.wmnet:8800 Route 10 537 2284 [12:04:42] godog: likely the thread that updates from etcd is borked [12:04:55] and thus it's not responding to any more pool/depool [12:05:33] ah that's T134893 isn't it? [12:05:34] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [12:05:37] thanks I'll update that [12:07:39] 10Traffic, 06Operations, 06Operations-Software-Development, 10Pybal, 13Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2281050 (10fgiunchedi) Something similar just happened on lvs1003, where thumbor1001 isn't bein... [12:11:46] I wish we had more time to do more rigorous and artificial testing of all the possible tuning scenarios we could do related to linux networking bits [12:12:15] while digging into BBR and FQ and all related issues, I stumbled on a number of papers and presentations and results that seemed interesting to follow up [12:13:39] one is that our FQ may be somewhat-hobbled by the offload features of our NICs. Or not. It's hard to say, as the advice becomes outdated quickly. We should try to look into whether our current kernels' bnx2x and our cards do adaptive tso sizing, etc, at least, and whether some NIC features need turning off (at minor CPU cost) to let FQ do its job better [12:14:48] (the idea is that traditional aggressive TSO means dumping large chunks of data from OS->card->wire and letting it offload tcp segmentation, and that can really hamper the efforts of something like FQ that's trying to fine-tune an output queue at the level of individual packets and nanoseconds) [12:16:13] 10netops, 06Operations: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037#3288863 (10ayounsi) Pushed to all cr* in AMS. BGP sessions and advertised routes haven't change. Will roll it to more sites shortly. [12:16:29] another one of the more-relevant/interesting/probable things I stumbled on, is that we should test taking interface-rps a bit further and making it NUMA-aware with respect to the cards [12:17:16] in the LVS case I don't know that it matters much (but maybe), because there's nothing else to do, and spreading interrupt load is at a premium there vs other things [12:17:49] but for the cache nodes, the idea would be that if eth0 is physically attached to numa node 0 (cpu die 0, basically), we should confine its IRQ routing to that NUMA node too [12:18:01] right now we spread it across both dies [12:19:04] so on a typical machine where we have 2x cpu dies with 12 physical cpu cores each, right now we try (if the driver lets us) to configure up to 24x IRQs and fan them out over the 24x physical cores. Then we also start up nginx workers pinned to every core in the system (well 2 per HT pair) [12:20:17] it might be better if interface-rps notices eth0 is attached to numa node 0 (via /sys/class/net/eth0/device/numa_node), and looks at the numa nodes the cpu cores belong to (via /sys/devices/system/node/node0/cpulist)... [12:20:57] and then only ask for 12x IRQs, and map them only to cores in the first numa node (which would be every other core in the linear linux numbering, e.g. cpus 0, 2, 4, 6, ...) [12:21:08] and then bind nginx only on those cores as well [12:21:32] so wire->nginx, it all stays in the same numa domain it started on where the ethernet card exists [12:22:01] (and of course we have varnish-frontend, varnish-backend, and a billion other ancillary things that can use up the other cpu die) [12:25:35] there's also still some modern sources out there advocating against enabling HT if you really care about fast/low-latency networking. but maybe they didn't think to have their IRQ stuff be smart enough to only use half of each HT pair either, or have different hw, etc. It's one of those things you'd only really know by testing I think. [12:26:10] I think for the caches it's not even worth looking into. we have enough other things happening CPU wise that at the end of the day the HT is probably still going to make sense, I think. [12:30:38] bblack: drops seem rare enough, there have been only a few in ulsfo-upload https://grafana.wikimedia.org/dashboard/db/qdisc-stats?orgId=1&var-cluster=upload [12:32:17] so far at least :) [12:34:36] so i haven't read all this [12:34:42] what's the executive summary on BBR? :) [12:37:17] mark: it's currently enabled on all the caches. Instead of relying mostly on the congestion window to send data at the appropriate rate, BBR on linux uses the tc-fq scheduler [12:37:32] yes [12:37:40] what's the observed effect so far? [12:38:12] so far we've been trying to find the appropriate tc-fq parameters to reduce packet drops [12:38:38] performance-wise I don't think we have a clear understanding of the situation yet (or at least I don't) :) [12:39:36] ok :) [12:40:04] the idea is to leave bbr enabled for a bit and compare with last week's performance metrics [12:41:58] nice [12:42:14] that was a good executive summary [12:42:17] maybe we should swap jobs ;p [12:43:06] but then I'd have to learn how to use google docs! I don't think I'm ready for that :P [12:44:31] bummer [12:51:13] > Having a good estimate of the bottleneck bandwidth, unlike other congestion control algorithms, BBR can decide on the optimal TSO size required to saturate the link, at acceptable CPU usage, without causing collateral damage and congestion [12:51:21] (from https://netdevconf.org/1.2/papers/bbr-netdev-1.2.new.new.pdf) [12:53:01] it sounds like BBR does influence TSO's aggressiveness? [12:54:31] oh but we need to check if we have adaptive sizing enabled [12:59:25] `ethtool -k eth0 | grep tcp-segmentation-offload` says it's enabled on all cache hosts [13:00:17] and automatic TSO sizing should be enabled by default since kernel 3.12 https://lwn.net/Articles/564979/ [13:20:59] bblack: do we know if DR/Equinix in Singapore provide OOB internet access? I'd guess yes, but I haven't seen any mention of it [14:04:15] XioNoX: I'd have to double-check DR. I think earlier on they offered it to us (back then I was trying to ignore it and get them to tell us about neutral carriers) [14:04:24] XioNoX: Equinix certainly does [14:04:53] thanks! [14:07:12] ema: yeah looking good on drops I think. nice stats :) [14:08:02] I've even managed to add a dropdown to the template! :) [14:08:13] bblack: tcp_tso_win_divisor seems interesting re: TSO [14:08:20] tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. The setting of this parameter is a choice between burstiness and building larger TSO frames. Default: 3 [14:09:28] mark: alternative executive summary: BBR is a congestion control alg (replaces cubic) developed by Google (who also developed FQ scheduler and a bunch of other recent Linux TCP perf hacks it works well with). It's supposed to be much better at filling the available BDP to a client, even slow/mobile clients, and especially in the face of occasional packet loss, carrier throttling/dropping/queuein [14:09:34] g, etc. [14:10:27] thanks, now please put that in a google doc [14:10:52] it does that by building a little mathematical model of each flow that models the actual unbuffered bdp/rate of the connection's links, and then telling the FQ packet scheduled to pace outbound TCP flows down to that exact rate (and thus avoids bufferbloat stuff on the nextwork too) [14:11:16] :P [14:12:35] ema: I also looked into the FQ maxrate thing. I don't think it interferes with BBR working correctly. If FQ maxrate ends up actually limiting a fast user's rate, it will appear to BBR just like any other such rate control and be part of the model. [14:13:07] yeah, I know what BBR is, was just wondering if you knew about impact yet [14:13:10] but I guess I should wait :) [14:13:10] and it has the potential to smooth things out further [14:13:30] (to avoid giant-rate bursts in the transmit queue) [14:13:49] the only downside is any reasonable maxrate we'd want for clients is probably too low for inter-cache [14:14:40] so next we should gather those rate stats per network block and use it for geodns [14:14:55] but still, without one our effective maxrate is the whole 10G. We could set it somewhere conservative like 1G. even if a big refill of remote cache contents is happening, it happens over many separate machines/daemons, many separate threads / tcp conns, etc. [14:15:06] and then you have EXACTLY what I wrote those pybal bgp classes for, originally :) [14:15:24] (not part of pybal then) [14:15:34] heh [14:16:08] yeah you could theoretically log out the rates, and then also have geodns randomly mis-target N% of users so we get data on all netblocks to all sites eventually [14:16:14] exactly [14:17:01] although I was gonna use latency stats then [14:18:12] you can get that from BBR too [14:18:36] for each connection is has an ephemeral, evolving model of both the bottleneck_bandwidth and min_rtt for the client. [14:19:15] i want to build this now [14:19:28] you could even build up both datasets, and in some potential corner cases send their text traffic to the best rtt destination and their upload traffic to the best BW destination :) [14:20:04] (but I'd be willing to be they rarely differ) [14:20:23] (especially as usually the limit is in the last two miles: either carrier throttling or the user link itself) [14:20:38] should send this into analytics cluster [14:21:12] yes, and have them dump out new mappings once a month that parse into geodns "nets" files [14:21:19] using long-term averages [14:21:59] part of my plan was actually to also react to bgp updates [14:22:07] but it's tricky [14:22:38] well the most basic thing there is you can look at bgp path length as some kind of information [14:22:45] path length alone isn't a good indicator of closeness [14:22:57] sure, but the fact that a path change happens is good info [14:23:02] but a recent/sudden increase in path length might be a reason to fall back to your next choice on the regular geoip map [14:23:07] unfortunately you have only one way [14:23:11] yup [14:23:56] Heya traffic team, is ema around? [14:24:02] he was a short while ago [14:24:34] bblack: cool, I'll wait for him :) [14:24:51] ema: ^ [14:26:35] joal: o/ [14:26:54] hi ema \o [14:27:18] ema: I have other weird things to discuss around cookies in wikidata [14:27:31] :/ [14:28:47] ema: I have 302 hits on www.wikidata.org that have Last-Access and Last-Access-Global cookies set, without any previous hit on this domain for the same fingerprint-hash [14:29:17] ema: clearer: 302 on www.wikidata.org that redirects to 200 on m.wikidata.org [14:30:25] joal: do you see those only for wikidata or for other domains as well? [14:30:43] ema: I have not investigated on other domains yet [14:31:16] joal: can you send me the pivot link? [14:31:36] ema: not in pivot that time, requesting webrequest for an issue in our global counts of uniques [14:34:50] joal: ok, can you share those logs? [14:35:07] ema: currently running a query trying to show that [14:35:19] alright [14:41:02] joal: I'll go afk soon for a bit, can you create a ticket with your findings? We'll discuss it later on today :) [14:41:37] ema: I'm trying to document stuff here: https://phabricator.wikimedia.org/T143928, but it's WIP [14:41:43] later ema [15:01:56] ema: re: qdisc monitoring, think you could add the "throttled" counter and such as well from FQ? [15:07:42] I don't know that "inactive" or total matter, and flows_plimit probably isn't useful since we have overall drops [15:07:53] but "throttled" might be helpful if we try to tune maxrate [16:49:18] so back on the maxrate thing, I've done some poking at some caches' tables of connection info with BBR stats [16:49:50] there are public outliers that get into the several-hundred megabits territory (at least briefly as a sending pace!) [16:49:56] but they're not the norm [16:50:13] once you start looking at 1Gbps+, it's mostly intercache or cache<->app [16:50:25] e.g. this is an inter-cache flow within eqiad for upload: [16:50:27] ESTAB 0 0 10.64.32.81:3128 10.64.0.101:23533 bbr wscale:9,9 rto:204 rtt:0.129/0.033 ato:40 mss:1448 cwnd:81 ssthresh:392 send 7273.7Mbps pacing_rate 4133.3Mbps reordering:300 rcv_rtt:22.5 rcv_space:28960 [16:50:47] I think even then, those aren't sustained real send rates [16:50:57] they're snapshots of tiny rate spikes into buffers, etc [16:52:18] and we do have some high-rate public ones, and they tend towards being known aggregators of traffic that are probably sitting in DCs [16:52:23] e.g. this one is from zscaler (eww): [16:52:30] bbr wscale:5,9 rto:204 rtt:0.774/0.089 ato:40 mss:1448 cwnd:87 send 1302.1Mbps pacing_rate 1732.0Mbps rcv_space:28960 [16:52:56] and this one is supposedly from some dsl-like end-user IP, but wtf: [16:52:57] bbr wscale:9,9 rto:204 rtt:1.653/0.035 ato:40 mss:1448 cwnd:378 send 2649.0Mbps pacing_rate 1751.9Mbps rcv_space:28960 [16:53:43] some of them are from opera proxies, too, and other such things [16:54:45] given we operate 10G interfaces (both at the per-cache host level, and also at the per-transit-link level and such when it comes to public traffic) [16:55:02] I think it's reasonable for us to configure FQ to set a maxrate of 1Gbps [16:55:42] and shouldn't harm inter-cache traffic either, to let a single TCP flow (of the hundreds coming into a given cache) to not stealing more than 10% of the outbound bandwidth even in a brief spike [16:55:58] s/let/limit/ [16:56:47] and by eliminating these spikes/oddities/superfast-clients/intercaches to not even briefly entertaining >1Gbps rates, it should have some positive effect on FQ doing its fairness job for everyone else. [17:07:10] so to that end: https://gerrit.wikimedia.org/r/#/c/355451/ [17:13:29] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Security-Team: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3289487 (10fgiunchedi) [17:13:56] 10Traffic, 06Operations, 10Phabricator: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3289489 (10fgiunchedi) [17:16:14] 10Traffic, 06Operations, 10Phabricator: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3285916 (10jcrespo) T104735 [17:17:32] 10Traffic, 06Operations, 10Phabricator: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3285916 (10Dzahn) fwiw, class phabricator::redirector has: ``` 16 $alt_host = 'fab.wmfusercontent.org' ``` but "Host fab.wmfusercontent.org not found: 3(NXDOMAIN)".... [17:28:32] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Security-Team: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3288964 (10Legoktm) There's a code comment that says: ```lang=php // 500 so it shows up in browser's developer console. $this-... [17:53:29] 10Traffic, 06Operations, 10Phabricator: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3289633 (10mmodell) @dzahn: good catch but that's overridden: In modules/role/manifests/phabricator/main.pp $altdom = hiera('phabricator_altdomain', 'phab.wmfusercontent.or... [18:57:31] 10netops, 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3289951 (10faidon) RIPE responded with a new USB image; I sent that to Chris over a separate medium. [19:55:22] bblack: we definitely can add throttled and the other fq-specific metrics, but it'll take a bit of work [19:55:55] see the TODO here https://github.com/ema/qdisc/blob/master/get.go#L102 :) [19:56:43] all that is stored into iproute2 struct tc_fq_qd_stats (include/linux/pkt_sched.h) [19:58:20] so yeah the question is how to cleanly distinguish between scheduler-specific stats in ema/qdisc and how to report those in node_exporter without introducing the churn of reporting all those metrics as 0 whenever the specific scheduler in use does not emit those [20:42:18] skimming through sch_fq.c, there's also the notion of "internal packets" [20:43:05] which apparently don't obey flow_plimit: [20:43:13] if (unlikely(f->qlen >= q->flow_plimit && f != &q->internal)) { [20:43:18] q->stat_flows_plimit++; [20:43:21] return qdisc_drop(skb, sch, to_free); [20:43:24] } [20:44:31] oh, ok internal would be packets with high priority [20:44:57] if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL)) [20:44:57] return &q->internal; [20:50:25] so I don't seem to be able to get flows_plimit from tc (iproute2) [20:51:35] there's some confusion between flow_limit, which is the per-flow packet threshold, and flows_plimit, which is the counter of packets going above the threshold [20:51:55] in particular, tc prints the counter like this: [20:52:00] tc/q_fq.c: fprintf(f, ", %llu flows_plimit", st->flows_plimit); [20:52:36] but I don't get any output with `sudo tc -s qdisc show | grep flows_plimit` [20:53:59] oooh, it's only emitted when non-zero [20:54:14] fun times [20:58:39] 10netops, 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3191129 (10Cmjohnson) @faidon the usb with the image is attached. [21:09:25] bblack: ok so in the interest of pragmatism I've just added gcflows, throttled and flows_plimit straight to the generic QdiscInfo struct I'm returning from ema/qdisc [21:09:29] https://github.com/ema/qdisc/commit/e2e5ae489bf8b6b1796ad921b91d5ff559414b06 [21:10:13] it's not really clean, but whatever [21:26:09] (in the sense that it would be nicer to come up with a proper API returning the right datastructure according to the scheduler, and blah blah) [21:34:18] bblack, godog: I've added node_qdisc_{flows_plimit,gcflows,throttled}_total to my node_exporter qdisc-fq-metrics branch in case you want to build and updated version before Monday https://github.com/ema/node_exporter/commits/qdisc-fq-metrics [21:34:40] s/and updated/an updated/ :) [22:14:08] I'm testing out https://gerrit.wikimedia.org/r/#/c/350493/4/modules/varnish/manifests/common/vcl.pp but it seems I'm unable to get the new version of errorpage.html active. Already tried manual vcl-reload (since puppet doesn't do it for the html file). I did understand that std.fileread is cached somewhere, but not sure what will make it refresh. [22:14:14] "Please note that std.fileread is only read once and is cached until varnish is reloaded." [22:14:54] I have it triggered with a small hack on this url - https://en.wikipedia.beta.wmflabs.org/--errorpage-noise [22:15:03] It's using the new vcl, but the old version of the html file still. [22:19:36] also tried sudo service varnish reload, force-reload and restart. [22:19:40] Stil no luck :/ [22:39:13] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3290652 (10ggellerman) [23:50:30] Krinkle: try "sudo service varnish-frontend restart"? [23:50:49] (it should be varnish-frontend instance for your files, and what the docs mean is the whole daemon has to restart to get past std.fileread caching I think) [23:51:31] which is one of the reason we're avoiding relying on std.fileread() for anything even remotely volatile.