[01:36:01] 10Traffic, 10DNS, 10Operations: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Dzahn) Really wildcard or more like "populate DNS (langlist.tmpl) with all language codes from [[ https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab | ISO-693-3... [04:04:03] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) I've debugged locally what we seen yesterday on production with the following lua script: `lang=lua WEBSOCKET_SUPPORT = nil function __init__(argtb) dofil... [04:14:30] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10BBlack) @Vgutierrez - I really think, reading the Lua plugin code, that `__reload__` in 8.0.x might not do what you'd sanely expect (although it is undocumented). I thin... [04:15:31] bblack: interesting [04:15:46] I was checking the source code right now [06:44:29] 10netops, 10Operations, 10ops-esams, 10Patch-For-Review: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [06:47:42] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) >>! In T233274#5691926, @Vgutierrez wrote: > I've debugged locally what we seen yesterday on production with the following lua script: > `lang=lua > WEBSOCKET_SUP... [07:04:03] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) Tried this on a test instance instead: `lang=lua function read_config() local confffile = ts.get_config_dir() .. "/default.lua.conf" ts.error("Load... [09:27:13] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Addshore) [09:27:35] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Addshore) 05Open→03Resolved a:03Ladsgroup I'll close this one along with the subtask then :) [09:27:37] 10Traffic, 10Operations: ats-be on the text cluster is experiencing broken connections - https://phabricator.wikimedia.org/T236988 (10Addshore) [10:43:25] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10fgiunchedi) >>! In T224888#5690188, @CDanis wrote: > I've a proposal for doing this: > > - Add some special tag like `#NRPE` or `#page` to the names of any... [11:08:42] (reading backlog) I'm assuming the logging spike on centrallog hosts was the ats logging spam [11:09:04] this thing: https://grafana.wikimedia.org/d/000000596/rsyslog?orgId=1&from=1574697409436&to=1574712077343&var-server=acmechief-test1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=acmechief [11:10:22] yep [11:12:22] kk, thanks! [11:47:52] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10jbond) p:05Triage→03Normal [11:51:05] 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10jbond) p:05Triage→03Normal [11:53:19] 10Traffic, 10DNS, 10Operations: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10jbond) p:05Triage→03Normal [11:53:27] 10netops, 10Operations, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) p:05Triage→03Normal [13:30:35] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10Gilles) Seems like the varnish re-imaging and repooling of cp3064 helped: https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&var-met... [13:31:36] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Seems like the varnish re-imaging and repooling of cp3064 helped: https://grafana.wikimedia.org/d/000000230/navigation-ti... [13:32:51] probably https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552862/ will help more :) [13:34:07] 10netops, 10Operations, 10ops-esams: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10ayounsi) 05Open→03Resolved This is all done. [13:34:37] 10netops, 10Operations, 10ops-esams: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10ayounsi) 05Resolved→03Open I guess we can keep it open until we return the old one. [13:38:19] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) 05Open→03Resolved Seems good so far, has been up a few days and in full service for about a day, without incident. Calling this resolved until anything changes! [13:56:59] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5693304, @Gilles wrote: > Seems like the varnish re-imaging and repooling of cp3064 helped: Interestingly, th... [13:57:14] I'm gonna disable puppet on cp-ats nodes to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552862/ [13:58:37] may the force be with you! [14:02:45] which host do you think has the best-sounding name for testing things? I like cp3050 [14:05:16] is that even a text node? :) [14:05:36] oh it is [14:05:40] nope! text_ats :) [14:05:43] I was gonna say, testing on cache_upload is kinda cheating :) [14:09:13] 10Traffic, 10DNS, 10Operations: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Bugreporter) >>! In T238825#5691807, @Dzahn wrote: > Really wildcard or more like "populate DNS (langlist.tmpl) with all language codes from [[ https://iso639-3.sil.org/sites/iso63... [14:10:09] service descriptions that don't make sense when read after systemd's actions are my pet peeve for today [14:10:15] Nov 26 14:07:46 cp3050 systemd[1]: Reloading Apache Traffic Server is a fast, scalable and extensible caching proxy server.. [14:21:53] 10Traffic, 10DNS, 10Operations: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10BBlack) I could go either way on the subject of explicit langlist vs wildcard, really, so long as we're confident the MediaWiki layer handles all unknown language codes sanely, inc... [14:34:23] not looking good, things get cached despite calling ts.http.config_int_set(TS_LUA_CONFIG_HTTP_CACHE_HTTP, 0) [14:37:56] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10fgiunchedi) First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targe... [14:46:12] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) >>! In T238909#5693597, @fgiunchedi wrote: > From my POV there's great value in having a single solution for load b... [14:55:59] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10herron) Will this Ganeti cluster use vlan tagged interfaces and associated public/private bridges, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switch... [14:56:55] so it turns out that both patches should be reverted [14:57:14] the one unsetting TS_LUA_CONFIG_HTTP_CACHE_HTTP just does not work [14:57:38] the one disabling request coalescing is wrong because we do it upon read_response, which is too late :) [14:58:06] reverting [14:59:05] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10BBlack) I think we'll keep them private-vlan only and no tagging, and for the rare cases of "public" service instances we'll use LVS to route the traffic (same for all the edge-site ganeti). [14:59:41] the naming of the related variables seems confusing at best [15:01:21] actively hostile at worst [15:13:41] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10herron) Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites? [15:29:26] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) >>! In T224888#5692672, @fgiunchedi wrote: > Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e... [15:39:48] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10fgiunchedi) >>! In T224888#5693759, @CDanis wrote: > > Any preferences or thoughts re: the special tag? Right now I'm leaning towards `#page` as that seems t... [15:50:05] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188 [16:11:15] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ayounsi) That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones. Note tha... [16:13:55] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) >>! In T224888#5693928, @ayounsi wrote: > That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but ca... [16:58:31] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) After a conversation with @Volans an extended ask is having the generator able to add and remove files (eg, override completely the c... [17:17:51] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) The experiment with request coalescing disabled on cp3050 is running. Meanwhile I've noticed something likely interesting abo... [17:21:38] ema: interesting... didn't we historically reuse conns for varnish-be->app (and vfe->vbe, and vbe->vbe), just not nginx->vfe? [17:21:48] why aren't we reusing ats-be->app? [17:22:48] we are, but not all of them! [17:23:28] what do you mean? [17:23:53] wait let me guess. it slots reuseable connections by the SNI and/or Host: header :) [17:23:55] I mean: ats-be-> app connection reuse is in place, but not all requests go through reused connections [17:24:24] not sure about the reason yet! :) [17:24:26] because who would have 36,789 different hostnames for the same backend service, right? :) [17:28:54] on cp3050, right now, I see we had 758 misses in one second, of which 82 had no reuse [17:29:15] 1 out of 10 seems a lot [17:29:26] yeah it does [17:29:46] lots of potential factors, I don't know enough to help [17:30:05] well I'm happy we have something rational to look into! :) [17:30:39] could be ATS is closing up potentially-reuseables at a higher rate than vbe because it's being stricter about calling the responses erroneous-enough to be connection-fatal in some sense. [17:31:11] I can't even recall if varnish closes on all 5xx's, or keeps the conn if the 5xx is well-formed [17:31:23] could be the hostname thing mentioned earlier [17:31:56] could be that ATS is scaling "smarter" by having reuse pools be per-thread, and we have way more threads than we need and the reqs rotate through relatively-fresh threads a lot [17:33:14] could be that ATS and/or envoy on the other side are enforcing some automatic limits on reusability (close after X requests and/or Y time), none of which applies to the non-TLS case for varnish-be. [17:34:22] ah yes envoy, there's that too [17:34:53] (nginx for the appservers, really) [17:35:38] more research is needed, and ttfbeer is getting lower and lower :) [17:35:41] 10Traffic, 10Operations, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) @cchen copying you in for visibility. iPad iOS 13 is a desktop UA, in case that's useful info in o... [17:35:42] see you tomorrow! [17:36:53] cya [17:48:48] 10netops, 10Operations: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) p:05Triage→03Normal [18:19:03] I am looking at the existing setup in trafficserver/backend.yaml for "thorium" (various analytics web sites) and it uses "https://thorium.eqiad.wmnet" as replacement for several targets. then i look at the SSL cert on thorium itself and it contains all the target names as SANs but not the replacement name. So the replacement name does not actually have to be on the cert apparently. It just gets [18:19:09] the right names via SNI before making the request? [18:19:33] (but didn't we have to do just that and add missing replacement names to certs a couple times now) [18:26:56] 10netops, 10Operations, 10ops-esams: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10RobH) IRC Update: * @mark is going to take home defective chassis for later reutn. * https://support.juniper.net/support/rma-locations/ lists the address, but the support case doesn't have a return label I ha... [21:09:41] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) I don't know if this is indeed done: `Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502, Cannot find server. at 2019-11-26 21:08:03 GMT` w... [21:13:20] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10BBlack) I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure r... [21:17:47] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content) - https://phabricator.wikimedia.org/T237319 (10CDanis) [22:40:53] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Status update: the blended authdns+recdns(+ntp) role is now nearly-complete in `role::dnsbox`. There's a hieradata flag `profile::dnsbox::include_auth`...