[01:58:39] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) a:03Dzahn [02:15:27] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) @hashar It's fixed for me now. It was missing the releases-jenkins.w... [02:16:07] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) 05Open→03Resolved [02:16:15] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) [02:50:52] 10Traffic, 10Operations, 10ops-codfw, 10ops-eqiad: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10BBlack) [07:53:19] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10hashar) Magic! Danke Schon! [10:03:56] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1083.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [10:28:22] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5702143, @Krinkle wrote: > I do have a gut-feeling, though, that these two example you mention cannot (should... [10:31:25] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1083.eqiad.wmnet'] ` and were **ALL** successful. [11:13:30] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: https://releases-jenkins.wikimedia.org yields a 502 unreachable - https://phabricator.wikimedia.org/T239629 (10Aklapper) [13:30:13] 10Traffic, 10Operations: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack) [13:51:13] https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&var-metric=ssl&var-location=All&var-prop=p75&fullscreen&panelId=54&from=1568035360929&to=now [13:51:57] likely the main thing to watch for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554291/ [13:56:41] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) >>! In T189333#5645365, @EBernhardson wrote: >>>! In T189333#5488005, @Krinkle wrote: >>>>! In T189333#5483346, @fgiunchedi w... [13:58:24] ema o/ i updated the SAN to include stream.wm.org (in staging, will deploy the rest in just a bit) [13:58:27] how's https://gerrit.wikimedia.org/r/c/operations/puppet/+/551247 looking? [14:02:49] bblack: cp3050 looks good [14:02:55] curl -v https://en.wikipedia.org 2>&1 | grep issuer [14:02:56] * issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA [14:03:41] (no restart needed, just puppet agent run) [14:03:55] looks right [14:04:09] I'll point my browser to it just to err on the side of caution [14:05:53] my chrome is happy too, pooling cp3050 [14:25:58] ok deployed, ema you look busy but the public routing thing I think is ready to go if you find some time! [14:25:59] ty! [14:26:56] ack ottomata [15:17:34] thanks ema, switched to the regular http port [15:18:19] ottomata: running pcc! [15:29:59] ottomata: +1, seems fine! [15:30:05] coool [15:30:10] shall I merge then? [15:30:12] or do you want to? [15:30:19] * ottomata is afraid of breaking things [15:30:54] ottomata: have you not won the tshirt yet? :) [15:31:03] haha, i do have one of those tshirts [15:31:16] ottomata: then we're fine, please go ahead [15:31:21] ok! [15:34:28] ran puppet on cp1075, looks ok [15:34:32] running on all cache texts [15:34:59] ack [15:37:05] hm vcl compile failon just cp1089? [15:37:25] Name of VCL object, 'eventgate-logging-external', contains illegal character '-' [15:37:26] ? [15:37:47] ottomata: likely because cp1089 is still varnish-be (not migrated to ATS yet) [15:37:52] huh ok [15:37:53] ottomata: let's replace - with _ [15:37:57] the rest succeeded [15:39:04] ema: for just varnish? [15:39:13] directory? [15:39:15] director*? [15:39:26] ottomata: s/eventgate-logging-external/eventgate_logging_external/ as the director name, please change both text.yaml and text_ats.yaml for consistency [15:40:55] ok [15:42:23] hm is there no reference to the director for ats? i only see the definition [15:43:18] ottomata: nope, the data structure is just copy-pasted from role::cache::text to role::cache::text_ats. The latter only uses it for varnish-fe, which does not use the 'director' key [15:43:28] ah ok [15:44:24] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/554312 [15:44:50] ottomata: +1 [15:47:43] looks good, running puppet on all texts again [15:48:30] ottomata: you can limit the puppet run to role::cache::text actually [15:49:10] ottomata: cumin 'O:cache::text' [...] [15:49:21] ya that's what i'm doing [15:49:23] actually i had [15:49:24] O:cache::text or O:cache::text_ats [15:49:26] but i guess they are the same? [15:49:30] nope! [15:49:34] ah [15:49:37] you can skip O:cache::text_ats [15:49:46] because it isn't being routed to anyway? [15:50:02] use the aliases when possible ;) [15:50:24] ottomata: O:cache::text_ats are varnish-fe + ats-be, so puppet did run successfully there already before [15:50:31] aye right [15:50:36] it only failed on O:cache::text, which is varnish-fe + varnish-be [15:50:39] ok looks good [15:51:18] trying it... [15:51:47] hmm [15:53:57] ema how can I tell which backend it is being routed to? [15:53:58] i'm getting 404 [15:54:07] not sure if that's because routing is wrong or something else [15:54:20] bblack: about w/rest.php and it seemingly not being cached well in varnish/ATS, the tasks around this work are part of https://phabricator.wikimedia.org/T231338 and https://phabricator.wikimedia.org/T229662 , which is primarily for the Wikipedia iOS app right now tracked at https://phabricator.wikimedia.org/T228783. [15:55:12] https://stream.wikimedia.org/produce/logging/?doc [15:57:21] ottomata: does https://eventgate-logging-exte [15:57:24] sorry [15:57:43] does https://eventgate-logging-external.discovery.wmnet:43192/ return the right thing? [15:58:11] curl https://eventgate-logging-external.discovery.wmnet:43192/_info [15:58:12] yes [15:58:19] https://stream.wikimedia.org/produce/logging/_info [15:58:21] does not [15:58:33] but there's no _info in your request [15:58:45] https://stream.wikimedia.org/produce/logging/?doc [15:59:08] sorry switched to _info instead of ?doc [15:59:14] easier to test (json response instead of html) [15:59:18] either should work [15:59:31] ?doc works with discovery url [16:00:39] to which request against the origin server (eventgate-logging-external.discovery.wmnet) do you want https://stream.wikimedia.org/produce/logging/_info to be rewritten? [16:01:12] https://eventgate-logging-external.discovery.wmnet:43192/_info [16:01:19] it should be [16:01:57] /produce//(.*) -> eventgate--external.discovery.wmnet:/$1 [16:02:24] we are using produce/ to just route everything to the correct eventgate backend [16:02:54] so anything after produce//(.*) should just be rewritten to the top level path of the backend [16:04:37] well we need to instruct varnish and ats to do so, they can't guess it :) [16:05:14] hm i thought we did that [16:05:33] OHHH i need the regex? [16:05:34] ahhhh oops [16:05:37] missing .+ [16:05:41] like eventstreams has [16:06:00] ah i have it for varnish [16:06:01] missed it for ats [16:07:09] in the case of varnish backends BTW things won't work as eventgate-logging-external does not listen on port 33192, but we knew this [16:07:17] it should listen on that port... [16:07:22] it has both exposed [16:07:53] I guess not in LVS, but it's fine [16:08:00] we're gonna convert the remaining varnish-be soon enough [16:08:02] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554318 [16:08:56] ema: does the port matter in LVS? i guess just for failover? [16:10:05] hm interesting. [16:10:31] yes, the port matters to LVS, it's a layer 4 router :) [16:10:59] oook hm, got confused about something there ooook [16:11:00] anyway [16:11:05] (think of the common use-case of ports 80 and 443 for the same public IP going to different clusters for TLS vs regular HTTP) [16:11:19] hm. [16:11:19] k [16:11:34] ottomata: is there any specific reason for rewriting the url? can't we just map stream.wikimedia.org/produce/logging to eventgate-logging-external.discovery.wmnet:43192/produce/logging ? [16:11:34] ema: i think https://gerrit.wikimedia.org/r/c/operations/puppet/+/554318/1/hieradata/role/common/cache/text_ats.yaml should fix [16:11:51] yes, the app there doesn't know anything about /produce/logging [16:11:54] it is just /v1/events there [16:12:06] yeah I share that concern as well [16:12:14] we've been down this road with e.g. restbase and still paying for it [16:12:16] we are going to have multiple eventgate endpoints (2 exposed rn) [16:12:22] ema: bblack: looks like a full recovery of the TLS handshake overhead. but only half of a full recovery for responseStart [16:12:28] https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&fullscreen&panelId=54&from=1568035360929&to=now&var-metric=responseStart&var-location=All&var-prop=median [16:12:42] so we decided to resuse stream.wm.org as endpoint, with path based routing to proper backend [16:12:47] https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&fullscreen&panelId=54&from=1568035360929&to=now&var-metric=ssl&var-location=All&var-prop=median [16:13:02] happy to change, would have loved some input before though! :p [16:13:36] ottomata: path-based routing is fine, the issue we're taking is with the URI not staying consistent between the public and private views. [16:13:58] https://phabricator.wikimedia.org/T236386#5667115 [16:14:49] ? [16:15:24] hmm actually maybe a it too early to tell based on daily patterns [16:15:42] bblack: if we have multiple endpoints, we'd need to vary the paths if we do a single domain [16:15:49] why? [16:15:55] how else to route to the right one? [16:16:01] or maybe I don't understand your statement [16:16:11] eventgate-logging-external vs eventgate-analytics-external [16:16:17] two different app backends [16:16:20] ok [16:16:21] both need public routing [16:16:27] and their public URIs are? [16:16:32] in current plan: [16:16:41] stream.wikimedia.org/produce/logging and stream.wm.org/produce/analytics [16:16:45] right [16:17:16] gilles: nice, I see that p75 responseStart(Europe) was 353ms one month ago 2019-11-03T17:00, 440 right now [16:17:19] aside from configuration differences, they are the same app software though [16:17:25] so same HTTP API [16:17:33] so what we're saying is, if those are the public URIs to the two services, then eventgate-logging-external's applayer should accept URIs of the form stream.wikimedia.org/produce/logging and eventgate-analytics-external should accept URIs of the form stream.wm.org/produce/analytics [16:17:48] most software has some kind of $baseURI argument or something to fix that [16:18:04] ema: can this stay on for 24 hours or more? to rule out the daily pattern [16:18:16] or if you can't change the software's scheme to match the public scheme, then maybe change the public scheme to match the software's capability [16:18:30] hm we could do that, but it seems a bit awkward to modify the HTTP API for that...i'd almost rather do separate public domains at the point [16:18:49] what we're asking is that you don't modify the API [16:18:55] gilles: fine by me, bblack? [16:19:32] gilles: I can do 24h, but we really need to pull it back soon after that. [16:19:45] good enough :) [16:20:36] hm, i guess it just seems weird to vary the API uri because of a services configuration. would be like making elasticsearch HTTP API change baseUri depending on the installation of it [16:20:48] but maybe we should consider difffernt domains [16:20:52] well [16:21:06] it is weird to vary the public API based on internal service config details [16:21:22] the ideal is the other way around, but we also lack a whole lot of other things to fit that proper model, properly [16:21:27] i guess so, that just seems like routing stuff but i get your point [16:21:38] sooo back to naming! [16:21:52] the gripe about mangling URIs in transit is two-fold: [16:21:58] would prefer not to put 'eventgate' in the public domain name, so naming is hard [16:22:10] 1) We don't want the traffic layers having to deal with URI-mangling, it makes their logic more difficult [16:22:41] 2) It makes debugging and analysis and logging and analytics harder for everyone involved when the same request has different URIs at different infra layers. [16:22:49] aye [16:23:17] bblack should i just KISS and make the public domain match the internal one? [16:23:22] eventgate-logging-external.wm.org? [16:23:27] or maybe with subdomain? [16:23:32] that's a huge domainname :) [16:23:34] logging-external.eventgate [16:23:39] please no subdomains, they break all kinds of TLS/DNS stuff [16:23:42] or maybe external is redundant in pubplic URI [16:23:44] oh. [16:23:46] sigh ok [16:24:05] ok will think of something... [16:24:24] to us, there's very little difference between e.g.: eventgate.wm.o/logging/v1/foo and evengate-logging.wm.o/v1/foo [16:24:33] you're just shifting things around within the namespace of the URI [16:24:46] yes, but making it so the API paths match between public and internal app [16:25:07] ... by using a base path to shift the internal applayer view of the URI to match the public shared view [16:25:46] wait, you were saying you don't like [16:25:47] eventgate.wm.o/logging/v1/foo and evengate-logging.wm.o/v1/foo [16:25:58] sorry don't like [16:25:58] eventgate.wm.o/logging/v1/foo [16:26:00] no, I'm saying I don't like those if you have to rewrite them [16:26:05] right? [16:26:05] if it routes to /v1/foo [16:26:24] aye, and i don't like modifying the internal API with a baseUri...soooooo i will do it via a domain [16:26:53] whether you use the baseuri or a domain, the result is basically the same [16:26:59] ok [16:27:10] as long as you don't use subdomains :) [16:27:45] but you could make them eventgate-(analytics|logging).wm.o/v1/foo or eventgate.wm.o/(analyics|logging)/v1/foo and it would make little difference [16:27:55] it moves a word to a different part of the URI space [16:28:05] either way the important thing is it's consitent inside and out [16:28:10] *consistent [16:28:47] aye [16:29:02] ok, i want to find a good domain name without 'eventgate' in the name [16:29:14] so am heading over to the bikeshed paint store... [16:52:06] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org', 'dns5002.wikimedia.org'] ` The log can be fo... [17:39:16] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org', 'dns5002.wikimedia.org'] ` and were **ALL** successful. [19:10:13] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns2002.wikimedia.org', 'dns1002.wikimedia.org'] ` The log can be fo... [19:13:25] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10EBernhardson) In summary, it seems we need to merge the patch[1] for the /entity/ endpoint, and this should be resolved? [1] ht... [19:39:34] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns1002.wikimedia.org', 'dns2002.wikimedia.org'] ` and were **ALL** successful. [20:28:14] 10Traffic, 10Operations, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10faidon) [22:43:09] 10Traffic, 10Operations, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) [22:43:12] 10Traffic, 10Operations, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) [22:48:29] 10Traffic, 10Operations, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) 05Open→03Resolved Please note sub-tasks have been created in the private S4 #procurement space, and quotes requested from Dell for these hosts....