[03:40:40] 10Traffic, 10Operations: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) [03:40:54] 10Traffic, 10Operations: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) p:05Triage→03Normal [04:01:14] 10Traffic, 10Operations: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) Further testing shows that the issue is apparently not related to OCSP stapling: ` vgutierrez@cp5001:~$ openssl s_client -connect < /dev/null CONNECTED(00000003) write:errn... [05:22:14] 10netops, 10Operations, 10cloud-services-team: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) 05Resolved→03Open Re-opening and it doesn't look like it can connect: ` marostegui@tools-sgebastion-07:~$ telnet dbproxy1019.eqiad.wmn... [05:36:35] 10netops, 10Operations, 10cloud-services-team: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10ayounsi) `lang=diff [edit firewall family inet filter cloud-in4 term labsdb from destination-address] { ... } + 10.64.3... [05:38:41] 10netops, 10Operations, 10cloud-services-team: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) 05Open→03Resolved And now it works! ` marostegui@tools-sgebastion-07:~$ telnet dbproxy1019.eqiad.wmnet 3306 Trying Conn... [07:04:45] !log repooling cp5001 - T231262 [07:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:51] T231262: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 [07:07:31] !log depooling cp5001 - T231262 [07:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:46] 10Traffic, 10Operations, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) from a tcpdump capture, it looks like ATS is actually dropping connections: ` 1007 153.027859 → TCP 74 60211 → 443 [SYN] Seq=0 Win=43690... [07:36:59] 10Traffic, 10Operations, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) Further analysis of ats-tls metrics shows that connections were actually being dropped without being logged: ` vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/... [07:58:06] yeah, see ./modules/trafficserver/files/*.stp for some systemtap probes used in the past to debug ATS-related things [07:58:30] cdanis: every time I look at the eBPF userspace tools for debugging, they seem "almost ready" :) [08:28:48] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1075.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [08:53:37] 10Traffic, 10Operations, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) 05Open→03Resolved After disable the connection throttling, cp5001 behaves as expected and no longer drops connections: ` vgutierrez@cp5001:~$ sudo -i traffic_ct... [08:53:40] 10Traffic, 10Operations, 10Patch-For-Review: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [08:57:36] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1075.eqiad.wmnet'] ` and were **ALL** successful. [09:02:20] 10netops, 10Operations: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) p:05Triage→03High [09:17:10] 10netops, 10Operations: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) It seems that the session is misconfigured on their side: ` Aug 27 09:10:38 cr2-eqord rpd[13953]: bgp_process_open:4072: NOTIFICATION sent to 2001:418:0:5000::a34 (External AS 2914):... [09:25:09] 10Traffic, 10netops, 10Operations, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) >>! In T102099#5433561, @jcrespo wrote: > Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (o... [09:33:18] vgutierrez: re: T231262 out of curiosity are the metrics in prometheus too? should they be? [09:33:19] T231262: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 [09:33:27] err, the last one wasn't a question [09:33:45] that particular metric isn't in prometheus right now [09:33:51] we need to upgrade the trafficserver exporter [09:33:59] and ship a custom metrics.yaml that includes those metrics [09:34:05] (and more TLS related) [09:34:26] ah! got it, thanks for the info [09:35:00] hopefully the debian maintainer is kinda close O:) [09:35:04] s/hopefully/gladly/ [09:36:34] lolz [09:38:25] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) 05Open→03Resolved [09:38:32] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [09:38:49] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [09:38:51] 10Traffic, 10Operations: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) 05Open→03Resolved [09:44:21] 10Traffic, 10Operations: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10Vgutierrez) [09:45:57] 10netops, 10Operations: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) 05Open→03Resolved a:03Volans It was a maintenance, tracked with GIN-1-2116159603, that was not present to the calendar because sent to noc@ and not the maint announce ML. We need... [09:48:12] 10Traffic, 10Operations: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) [10:06:55] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10ema) p:05Triage→03Normal [10:15:57] 10Traffic, 10netops, 10Operations, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jcrespo) > Sorry for the lack of clarity, once all servers have the mapped ipv6 address i plan to move this to the base profile with s... [11:25:36] 10Traffic, 10Operations: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) Triggering the issue is relatively easy browsing https://maps.wikimedia.org with Chrome 76: ` t=264968 [st=29427] HTTP2_SESSION_RECV_GOAWAY --> active_streams = 2... [11:29:52] 10Traffic, 10Operations: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) [12:37:40] we're now serving prod text traffic with ATS [12:37:58] wow, congrats Ema [12:38:54] mutante: \o/ [12:39:21] awesome news! [12:43:40] ema: yay! [12:56:51] \o/ [13:17:22] 10Traffic, 10Operations, 10observability, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi) Per-backend metrics are in place now via mtail, specifically: * request count: by backend, method, and status * total time spent took by requests: by b... [13:36:04] ema: \m/ [13:56:56] 10Traffic, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479 (10fgiunchedi) There is still some churn due to the fact that multiple VCLs are loaded at the same time, and we're generating new uuids via `reload-... [13:57:18] 10Traffic, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [14:20:33] ema: nice :) [14:21:07] :D [14:21:07] ema: I assume this is the split model? true "text" going via cache_upload's ATS backends (+cp1075 in that set now?), and the misc VCL still using varnish-be? [14:30:04] bblack: so, cp1075 now is out of the cache_text set, so that codfw/esams varnish-be do not use it [14:30:20] other than that, it is just a normal cache_text node, meaning that all eqiad frontends use it [14:33:57] ema: so, cp1075 is running varnish-fe + ats-be, it's not in the set of be's for indirect use from codfw/esams/etc, but it's part of the chashed pool for eqiad varnish-fe to use? [14:34:05] (but there's no misc/text split?) [14:34:10] exactly [14:34:17] we got all the miscs with TLS? [14:34:38] not yet, but they're in the same DC [14:35:46] as in, ats-be and the non-tls misc origins [14:36:23] right [14:36:40] well, except the possible corner case of .discovery.wmnet if the eqiad side is dns-depooled via confctl :) [14:37:00] ah yes [14:37:18] anyways, let's see how it goes! [14:37:24] yup! [14:38:01] entirely unrelated: varnishkafka-webrequest.service and varnishkafka-statsv.service failed on cp1081 [14:38:35] the units say Restart=on-failure, but it does not seem like they got restarted [14:43:10] 10Traffic, 10Operations: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) Further analysis shows that actually ATS is rate limiting PRIORITY frames even when they are disabled: ` proxy.config.http2.stream_priority_enabled: 0 proxy.config.http2.max_priority_f... [14:43:21] sigh [14:43:57] so ats-tls is able to handle all the traffic in cp5001 now [14:44:13] but I'm seeing issues with the HTTP/2 rate limiting [14:44:24] specifically with priority frames [14:44:46] taking into account that ATS ignores those frames right now (priority support is experimental) [14:45:19] I think I'll disable that specific rate limiting (tomorrow) [14:47:52] 10Traffic, 10Analytics, 10Operations: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10ema) [14:48:03] 10Traffic, 10Analytics, 10Operations: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10ema) p:05Triage→03Normal [14:57:39] oh.. and finally I got https://github.com/apache/trafficserver/pull/5726 merged :D [15:00:33] I want to link to a wikimedia grafana dashboard to for general network stats, would you have any suggestion? [15:00:56] 10Traffic, 10Analytics, 10Operations: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) CPU at 100%: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cp1081&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache... [15:01:16] searching network on grafana I see some proves and performance per host, but nothing general overview [15:01:34] jynus: there is none, they're all in LibreNMS [15:01:40] jynus: do you mean links or hosts? [15:01:40] I guess private? [15:01:50] ^XioNoX: ? [15:01:57] you can get some aggregates by clusters/hosts in grafana I think [15:02:05] but our external links, that's all in librenms, which yes is private [15:02:28] it is ok, it was documentation, didn't need anything functional [15:02:35] jynus: the data itself is not private [15:02:38] oh [15:03:10] yeah, but I just wanted a web link for ilustration pursposes [15:03:22] don't worry, I will figure something [15:03:27] thanks for the help! [15:03:32] we export it to grafana but don't do anything with it as the format is not convenient [15:03:43] yeah, I 100% undestand [15:03:59] jynus: let me know what you need and I can probably find something :) [15:04:06] it is not important [15:04:24] just in the future you don't have anything to do (unlikely :-P) [15:04:31] just an aggregation like we do with [15:04:38] https://grafana.wikimedia.org/d/000000278/mysql-aggregated [15:04:55] "there are some aggregation for our network stats" [15:05:05] *this is [15:05:23] but it may not make sense as it would duplicate many in and out stats [15:05:23] jynus: I'm sharing a librenms dashboard named "global view" [15:05:48] you should be able to see it in librenms [15:06:06] that's my go-to if someone asked anything about a network issue [15:06:58] I see it now [15:07:10] that is cool, BTW [15:07:19] it has kibanda and icinga and others stuff [15:07:23] *kibana [15:07:32] yeah [15:08:49] I never realized in fact how different our up and down is [15:09:04] up as in in and out [15:09:59] I think I will link to Faidon's network diagram that he created some time ago, even if it is outdated [15:10:26] X made some better ones on wikitech [15:10:37] X? [15:10:51] https://wikitech.wikimedia.org/wiki/Network_design [15:10:59] X == XioNoX [15:11:04] ah, nice, thank you! [15:11:20] those are nice indeed [15:46:08] bblack: I was thinking, now that we've got ncredir, can we get rid of the https_recv_redirect regex in VCL for canonical domains and return 301 regardless of Host? [15:46:42] ema: if we've moved all such domains to ncredir for sure (probably need to audit that!) [15:47:16] bblack: ack, will look into it tomorrow [15:48:08] not all of them /o\ [15:48:46] I guess I can migrate more domains this week to ncredir if ATS gives me a break [15:48:48] :) [16:07:02] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > OO, when we reimage these, let's use Buster! :) I take it back, use Stretch. Buster ships with J... [16:07:13] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [19:48:02] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 5 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Smalyshev) a:05Smalyshev→03Gehel [19:49:25] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 5 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Smalyshev) [20:00:19] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you move these servers as evenly as you can into r... [20:04:09] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10LGoto) [20:15:13] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 9 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10LGoto) [21:21:06] 10Traffic, 10Operations, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10ovasileva) [21:34:31] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Jdforrester-WMF) Tagging in Traffic; this is the server (cp1075) running ATS not Varnish, right? [21:41:43] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) a:03ema Assigning to @ema to investigate (yes, this is the live test server for ATS backends for these servers). Most likely the problem is specific to ATS<->docker-regist... [21:44:48] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10greg) p:05Triage→03Unbreak! This is blocking CI runs. [21:47:10] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10ayounsi) Note that it's breaking Jenkins on the Puppet repo (goes straight to -1). https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/20234/console [21:51:57] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) Depooled cp1075 `ats-be` service via confctl, can someone retry and confirm mitigated? [21:58:43] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Clarakosi) >>! In T231388#5443941, @BBlack wrote: > Depooled cp1075 `ats-be` service via confctl, can someone retry and confirm mitigated? It works! [21:59:38] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) Please leave this open for now so @ema can look at a more-permanent fixup tomorrow! [22:00:28] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Jdforrester-WMF) p:05Unbreak!→03Normal De-prioritising.