[00:07:41] 10Traffic, 10MediaWiki-API, 10Operations, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314 (10Tgr) 05Open→03Declined Given we have a REST API now, which should probably be the preferred way to implement cached endpoints, and that... [08:26:17] 10Traffic, 10Operations: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) [08:29:36] 10Traffic, 10Operations: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) cp4027 has been running fine since yesterday with Varnish 6.0.6. Performance-wise there's no impact either, I've added a panel with p75 response time comparison to [[https://grafana.wikimedi... [09:05:28] 10Traffic, 10Operations: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839 (10fgiunchedi) Yes this is still valid IMHO despite the lack of activity. Specifically as a defense in depth measure, swift ACLs being the primary line of defense. [09:21:10] hi traffic - I would like to fiddle with PyBal/LVS again. This time I would like to add a new service (more like a TLS version of an existing service) [09:21:54] jayme: ok! [09:47:19] 10Traffic, 10MediaWiki-REST-API, 10Operations: Route requests to the REST MediaWiki API to the api cluster - https://phabricator.wikimedia.org/T263729 (10Joe) [10:10:23] 10netops, 10Operations: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10ayounsi) This is now pushed to eqiad and codfw. Result can be seen on: https://librenms.wikimedia.org/graphs/id=16333/type=port_bits/ and https://librenms.wikimedia.org/graphs/id=16552/type=port... [11:05:08] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178 (10Physikerwelt) From the math perspective, the change to the new MW Rest API is already implemented but not yet reviewed. Thereafter, restbas... [11:15:47] 10Traffic, 10netops, 10Operations, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10ayounsi) [11:20:00] 10Traffic, 10Operations: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10jbond) 05Open→03Resolved a:03jbond This has been deployed and every thing looks good, closing, please re open if you see any issues [11:20:02] 10Traffic, 10Operations: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10jbond) [12:23:51] 10netops, 10Operations: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10BBlack) >>! In T263212#6490669, @ayounsi wrote: > Ideally we would take the links state into consideration: If the twin link is down alert at 80%, if it's up alert when the sum is at 80% of the i... [12:36:33] 10Traffic, 10netops, 10Operations: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Some thoughts/idea: * enable IPFIX on all/most of the routers interfaces In the current state of our setup this means double/triple accounting a flow as packets cross interface... [12:50:40] bblack: spring cleaning yesterday? :) [12:52:15] Autumn? [12:52:16] :) [12:54:51] ema: a first pass anyways [12:55:26] I think I only managed to quick-close like 28 tickets or so, and then a handful more got "Is this still useful?" sort of commentary [12:56:36] I think next I'm going to migrate what is probably the bulk of the stale-ish tickets into some kind of wishlist/backlog columns. at least anything complicated and/or blocked that we don't have immenent plans to work on. [12:58:20] and then hopefully what's left after that is simpler to deal with in a basic kanban-ish setup, with maybe a few excess columns around blockage/triage, or "trivial external requests", etc [12:59:00] and then maybe go back again to the bulk that got shoved off to wishlist/backlog, and try to consolidate/update those into some less-messy state. [13:00:59] imminent I think, hmm [13:01:13] mourning speling :) [13:24:34] 10Traffic, 10Operations: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10MoritzMuehlenhoff) > Security - are we ok with ssh bastions inside ganeti alongside other public service instances? Sounds fine to me. As long as we have two baremetal bastions in eqiad/codfw wh... [13:43:12] \o/ spring cleaning seems awesome [13:43:26] maybe we should make it a thing and do it across the board [13:46:37] 10Traffic, 10Operations: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) ulsfo upgraded! [13:48:56] # TO DEPOOL ESAMS: [13:48:56] # Don't use this file, instead edit the file "config" and switch the geo-maps part there! [13:48:56] Might not be needed anymore now that we have 2x10G links for Telia [13:50:27] 10netops, 10Operations: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10CDanis) >>! In T263212#6490669, @ayounsi wrote: > This is now pushed to eqiad and codfw. Result can be seen on: > https://librenms.wikimedia.org/graphs/id=16333/type=port_bits/ > and > https://l... [13:50:48] XioNoX: because of the VRRP change? [13:51:18] cdanis: https://phabricator.wikimedia.org/T261723 [13:51:59] ah nice [13:52:42] yeah, that might be enough, then [13:52:58] well, hm [13:53:59] 10netops, 10Operations: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Open→03Resolved This is all done. [13:54:39] yeah I *think* eqiad barely has the outbound link capacity now, assuming nothing else is down, and also assuming reasonable distribution between the transits [13:54:53] plus all the peering requests Faidon sent :) [13:55:00] :) [13:57:25] re: swift and link capacity, I think we're okay to keep swift discovery record pooled in eqiad but I have this ready to go: confctl --object-type discovery select 'dnsdisc=swift,name=eqiad' set/pooled=false [13:57:44] I was wondering if we wanted to lower the TTL in the meanwhile, but I don't feel strongly either way [13:58:33] of which record(s) ? [13:59:05] of the swift dnsdisc [14:00:04] ah, yeah I don't feel strongly either way tbh [14:07:50] XioNoX: re: the esams depool stuff and "now that we have 2x10G Telia links", I haven't been following the link stuff closely lately... [14:07:59] the 2x10G Telia, that's transit? [14:08:44] yeah we got one more transit port in eqiad [14:09:07] as when we were depooling esams, we saw that Telia 10G port saturating [14:10:03] right [14:10:52] so the 2x10G telia transit, is it some kind of bonded virtual 20G link, or is it really just 2 separate bgp peerings that will have virtually identical views of the neighboring stuff? [14:11:32] bblack: bonded, yeah [14:11:33] and yeah maybe ECMP could help balance outbound (as much as I hate sounding like ECMP is the answer to everything when we know it's not ideal in a number of ways) [14:11:37] so it's a 20G port now [14:11:38] oh even better [14:12:55] cdanis: what do you think about, after the dc switchback, to do a planned voluntary esams depool the legacy way just to find out what the new situation is? [14:13:35] seems reasonable to me! [14:14:21] ok, let's aim for that maybe. it'd at least be nice to know where we stand on ability to undo the geoip loadshifting hack (worst case, it will probably go away when we get another EU DC though) [14:14:46] +1, if we want to make it more "safe" we could still use the loadshifting hack and gradually remove the hacks there [14:14:49] yeah I do see that as the 'real' fix; esams shouldn't be too big to fail [14:14:53] until it's back to the original one [14:22:03] we can use that window to upgrade cr2-esams as well :) [14:41:27] yeah, the intention for that 2nd Telia port was to address some of these concerns we had in the past with regards to Telia congesting when esams was drained [14:41:32] and we got that for free btw ;) [14:42:32] we also a) peered with Charter, which lifted ~2G or so from transit to peering, plus a bunch of other networks (Rogers, Internet2 etc.) b) changed the peering policies to prefer peering over transit (T259614) [14:42:33] T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 [14:43:24] of course the flip side now is that we may end up congesting our IXP port [14:44:37] we fixed that in esams (with another IXP port) and in codfw/eqsin we're adding secondary IXPs as well; in Ashburn it's trickier, there aren't any good alternative IXPs, but we could always add a 2nd 10G port if it comes to it [15:01:49] 10netops, 10Operations, 10ops-eqiad: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) [15:01:51] 10netops, 10Operations, 10Sustainability (Incident Followup): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) [15:01:54] 10netops, 10Operations, 10ops-eqiad: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) [15:02:03] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [15:02:06] 10netops, 10Operations, 10ops-eqiad: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) 05Open→03Resolved All done * we briefly (<5s) lost `D1` * some disabled ports automatically re-enabled themselves, causing some latency issues `1/1 Auto-Configured -... [15:02:26] 10netops, 10Operations, 10Sustainability (Incident Followup): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) 05Open→03Resolved Solved in T256112. [15:03:00] 10netops, 10Operations, 10ops-eqiad: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) 05Stalled→03Resolved Solved in T256112. [15:05:04] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) a:05ayounsi→03Cmjohnson [15:27:09] 10Traffic, 10netops, 10Operations: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10faidon) I wonder as what kind of ASN would these flows show up as, as well as whether we could have a dimension to be able to differentiate between internet traffic, and backhaul traffi... [15:30:10] 10Traffic, 10netops, 10Operations: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10CDanis) >>! In T263277#6491680, @faidon wrote: > The per-ASN views we have right now for front-facing traffic are priceless, and it would be a pity to make navigating these more difficu... [15:35:28] bblack: ema: if one of you has a minute, this is ready for review https://gerrit.wikimedia.org/r/c/operations/puppet/+/629717 [15:37:04] XioNoX: re T263210: DE-CIX will be peered at codfw and not eqdfw, and furthermore we'll prefer it as the 'underdog'? is that right? [15:37:16] cdanis: correct [15:37:20] 👍 💯 [15:37:33] thanks, sounds great [15:37:49] also, feels nice to be in more underdog IXes [15:38:54] indeed! [16:10:38] bblack: ema: if I don't hear anything re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/629717 in another half-hour or so, I'll just self-merge :) [16:13:52] cdanis: added one nitpick [16:14:38] bblack: ah, that was intentional -- I wanted to not fuss with adding e.g. (www\.)?(m\.)? or something to catch all the language variants [16:14:44] oh but I see what you wrote [16:14:47] yeah okay, that sounds good [16:14:51] I swear I can read [16:15:15] so that you're intended match of e.g. ca.m.wikipedia doesn't also match e.g. abca.m.wikipedia [16:15:22] ugh, "your" :P [19:00:40] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) assigning to @mforns [19:53:20] 10Traffic, 10Operations: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) [19:55:24] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) 05Open→03Stalled Pending feedback or confirmation from trwiki editors. [20:09:34] 10Traffic, 10Operations: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [20:12:22] 10Traffic, 10Operations: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [20:12:27] 10Traffic, 10Operations, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [20:12:39] 10Traffic, 10Operations: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) p:05Triage→03Medium a:03ssingh [21:13:18] 10Traffic, 10Operations: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) Turns out this was trivial. `1:1.31-1+deb10u1` is now in buster-wikimedia. I'll test on some backup LVS machine tomorrow or early next week. [21:14:01] 10Traffic, 10Operations: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10CDanis) [21:14:08] 10Traffic, 10Operations: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) [21:14:10] 10Traffic, 10Operations: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10CDanis)