[01:07:16] 10HTTPS, 10Traffic, 10Gerrit, 10Patch-For-Review, 10Security: Disable TLS 1.0 and 1.1 in apache for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10Reedy) [01:10:43] 10HTTPS, 10Traffic, 10Gerrit, 10Operations, and 2 others: Disable TLS 1.0 and 1.1 in apache for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10Reedy) Looks like this has been done {F31706596 size=full} Do we still want to change cipher stuff? Should we move that to another task? [05:35:07] 10Wikimedia-Apache-configuration: Can't access the page &_; via the path /wiki/%26_; - https://phabricator.wikimedia.org/T248728 (10Aklapper) (Making a guess on a project tag, please correct me) [05:42:12] 10Wikimedia-Apache-configuration: Can't access the page &_; via the path /wiki/%26_; - https://phabricator.wikimedia.org/T248728 (10Ammarpad) [05:42:17] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Ammarpad) [05:52:23] 10HTTPS, 10Traffic, 10Gerrit, 10Operations, and 2 others: Disable TLS 1.0 and 1.1 in apache for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10hashar) The old TLS has been disabled by the traffic team. @Krenair patch propose to elevate the set of cypher to a stronger set which seems go... [08:20:43] hi! [08:20:54] cp1089 seems acting weird, some alerts fired [08:21:07] the ATS tls instance seems not doing much after midnight UTC [08:21:08] https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=now-24h&to=now&var-site=eqiad%20prometheus%2Fops&var-instance=cp1089&var-layer=tls [08:23:11] and it seems pooled yes, just checked [08:25:25] Cc: ema, vgutierrez --^ [08:25:54] checking [08:26:52] thanks! [08:27:10] trafficserver-tls has some interesting ssl failures registered [08:27:15] but other than that didn't find much [08:29:51] yep [08:30:02] the node is actually depooled [08:30:13] (by pybal) [08:32:47] something happened with ats-tls yesterday around midnight [08:32:50] Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 17:50:07.069] [ET_TASK 1] NOTE: ssl_multicert.config done reloading! [08:32:51] Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 22:56:32.353] [ACCEPT 0:443] WARNING: accept thread received transient error: errno = 24 [08:32:51] Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 23:05:52.570] [ET_TASK 0] WARNING: Unable to finalize sync of cache to disk /srv/trafficserver/tls/var/run/host.db: Bad file descriptor [08:33:20] hmmm [08:33:21] Mar 27 23:00:14 cp1089 traffic_manager[27021]: [Mar 27 23:00:14.216] [LOG_FLUSH] ERROR: Error opening logging directory /srv/trafficserver/tls/var/log to perform a space check: Too many open files. [08:33:46] it looks like ats-tls run out of FDs [08:35:53] nice :) [08:36:08] I'll open a task to track this on Monday [08:36:13] thanks for pinging elukey <3 [08:39:41] <3 [08:40:07] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) [08:44:00] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) cp1089 [[ https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad%20prometheus%2Fops&var-instance=cp1089&var-layer=tls&from=1584780221656&to=1585385021657 | shows... [08:54:18] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) [[ https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=1584780826363&to=1585385626363&var-site=eqiad%20prometheus%2Fops&var-instance=cp1089&var-layer=tls | even though... [08:54:35] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) p:05Triage→03Medium [08:58:34] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) @ema we need to check if this can be related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/583295 [10:11:02] 10HTTPS, 10Traffic, 10Gerrit, 10Operations, and 2 others: Disable TLS 1.0 and 1.1 in apache for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10Reedy) >>! In T221499#6006826, @hashar wrote: > The old TLS has been disabled by the traffic team. @Krenair patch propose to elevate the set of... [11:30:05] vgutierrez: sorry to bother again, it seems that cp1077 has the same issue :( [11:30:08] to perform a space check: Too many open files. [11:30:19] duh... [11:36:31] maybe we could quickly check via cumin if other ats-tls processes have an unusual list of open files? [11:36:42] I checked that already on eqiad [11:36:51] cp1081 is showing the same behaviour [11:37:38] lovely [11:40:14] so sockets are piling up in CLOSE_WAIT state [11:42:51] cp3062 could be experiencing the same issue as well [11:55:50] back sorry [11:57:18] np [11:58:57] vgutierrez: and the socket in that state is between client and ats-tls right? [11:59:01] indeed [12:00:00] the first server showing the issue is cp3062 the 25th beginning at ~15:00 [12:00:27] this https://gerrit.wikimedia.org/r/c/operations/puppet/+/583295 has been merged 30 minutes sooner than that [12:01:32] ah interesting, so ats-tls might get stuck into some state and not close its end of the tcp conn [12:01:47] (meanwhile the client already sent a FIN) [12:01:48] yep [12:03:22] to avoid forcing you to spend your weekend time on this, maybe we could keep this monitored once in a while and then restart the investigation on monday? [12:05:59] (lunch, will read in a few) [12:06:47] yep.. checking CLOSE_WAIT sockets on text nodes is pretty easy [12:18:21] elukey: quick tracking with https://grafana.wikimedia.org/d/M-GAD-9Zz/t248736 [12:20:19] vgutierrez: that's perfect, will check it once in a while [12:20:39] in case you are not around, is it ok to restart the ats-tls instance if it shows these issueS? [12:21:09] yep [12:21:14] I'll keep an eye as well [12:21:22] thx again elukey <3 [12:22:06] thanks you! have a good weekend! [17:50:44] 10Traffic, 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [17:53:01] 10Traffic, 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) @Krinkle Are there any more items mentioned in T203899#4895417 that need to be done? Is the item above about purging j... [19:25:46] 10Traffic, 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Krinkle) * JobQueue: I defer to CPT. I'm not sure how to check that. * T203899#4895417: The only thing mentioned there not yet addres...