[07:12:31] 10Traffic, 10Operations: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) after depooling cp4029, the issue moved to cp4030, and upon the restart of varnish-fe on cp4030, now the number of fds is increasing on cp4031 (22k right now) [10:42:42] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jijiki) [12:20:08] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jijiki) [14:06:13] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10ArielGlenn) I can see that the challenges get set on the dns hosts by e.g. dig @208.80.154.238 -t txt _acme-challenge.wiki-pedia.org a little past the hour and get... [15:04:26] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) we have several bugs here: 1. acme-chief should refresh the OCSP stapling response even if he is unable to renew the certificate 2. acme-chief should i... [15:22:19] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jcrespo) If I can add a 4th and 5th, with lower priority, and feel free to disagree- "Ensure acme-chief-backend is running only in the active node" check should no... [15:28:10] 10Domains, 10Traffic, 10Operations: nameserver change for wikimedia.sk - https://phabricator.wikimedia.org/T241084 (10Luky001) 05Open→03Resolved Everything worked out well. Thanks a lot for the help! [15:40:26] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) hmm actually I'm wrong, the prevalidation works as expected for wiki-pedia.org, it's the actual DNS challenge validation that fails on acme-chief side... [16:37:50] 10Traffic, 10Operations: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) I've ran a manual OCSP refresh for non-canonical-redirects-3 running: ` sudo http_proxy=http://webproxy.eqiad.wmnet:8080 python3 ~vgutierrez/ocsp.py no... [18:16:49] 10Traffic, 10Operations: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) ulsfo is the only DC where we are seeing this issue, and at the same time it's the DC where we are testing the cache nodes buster upgrades (T242093). To discard that cp4032 (buster text node) could b... [18:42:30] 10Traffic, 10Operations: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) after finishing the rolling restart, this is current amount of fds on varnish-frontend per node: ` ===== NODE GROUP ===== (1) cp4028.ulsfo.wmnet ----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -... [19:17:40] 10Traffic, 10Operations: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) in ~30 minutes cp4029 has gone from 1400 to ~8600 so it doesn't look like cp4032 is at fault here: `(1) cp4029.ulsfo.wmnet ----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' ----- 8570 ============... [23:40:44] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) questions: why are there 2 yaml files for apache traffic...