[01:34:00] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) [01:34:21] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) [02:00:11] sigh what a mess.. [02:43:53] hmm this is weird [02:44:07] keyholder instances on acmechief servers are consuming a lot of memory [02:44:31] a 13.5% of the system memory for acmechief1001 and acmechief-test1001 [02:44:57] I don't see anything similar on other instances like cumin1001 [02:49:44] so the last time acme-chief reported anything before going awol was on November 25th at 09 AM UTC, notifying that the unified cert is staged till 2019-11-30 07:11:00 and refreshing the OCSP staples as well for unified [02:50:21] no exceptions, no errors.. nothing weird [02:51:03] acme-chief got some issues renewing unified cert... apparently LE servers were having some issue at that moment [02:52:16] Nov 23 08:00:12 acmechief1001 acme-chief-backend[8837]: acme.messages.Error: urn:ietf:params:acme:error:serverInternal :: The server experienced an internal error :: Error creating new order [02:56:19] finally the cert got issued and it's looking good [03:01:57] regarding keyholder [03:02:19] even on acmechief2001 where acme-chief is not running (active/passive) it's consuming a lot of memory [03:02:24] actually even more than in acmechief1001 [03:02:33] 13.5% VS 14.9% [03:05:54] upon restart keyholder memory consumption goes back to "normal" [03:06:15] I left it untouched on acmechief-test instances for debugging purposes [03:17:11] weird... a quick check agains instances with puppet class keyholder present, acmechief ones are the only ones showing big memory consumption by keyholder agent proxy (/usr/local/bin/ssh-agent-proxy) [03:19:23] and of course acmechief boxes are the only ones using buster (and python 3.7.x) [03:23:11] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-publish: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315 (10Krinkle) >>! In T181315#4116607, @gerritbot wrote: > Change 425045 at [[https:... [03:29:28] What does HTTP 502 mean in our context? I've not seen these before until recently so I guess it's ATS. [03:29:39] seeing a few affecting load.php [03:29:40] https://logstash.wikimedia.org/goto/331782f9800a9085915426f5284abf35 [03:29:49] Krinkle: a 502 is ATS having issues reaching an origin server [03:30:29] origin being the varnish frontend behind the ATS layer, or from ATS bckend to MW? [03:30:52] Does that mean there was an Apache/PHP/MW issue or could it be something closer to ATS/LVS? [03:31:06] Krinkle: actually, both [03:31:12] both ATS can trigger a 502 [03:31:24] Also, do we have similar limits on ATS<->MW the way Varnish has in that it will throw 503 if it has more than X reqs concurrently? [03:32:09] hmm not than I'm aware [03:32:25] ok, one less thing to worry about I suppose :) [03:32:37] but double check with e.ma [03:32:40] How would I begin investiging why it is emitting 502? [03:36:28] 10Traffic, 10Operations, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10Krinkle) [03:36:29] Filed a task for now. [03:36:31] I think is still in the TODO submitting 5xx reports from ats to logstash [03:41:48] Krinkle: correct me if I'm wrong, but only GETs are expected against load.php, right? [03:45:22] Yes [03:45:29] and HEAD/OPTIONS [03:51:23] ats-be is complaining from time to time about CONNETION_CLOSED and CONNECTION_ERROR against 10.2.2.1 trying to fetch load.php [03:51:44] of course that doesn't automatically translate into a 502, cause those requests can be retried [04:08:16] but apparently it happens [05:54:43] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Nuria) I think the collection that is heavy on cookies and tracking should have been reviewed by our privacy engineer @JFis... [07:12:13] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) Also to clarify further: Pybal does **none** of the meaningful load-balancing. Load-balancing between pods is done... [07:35:47] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) [08:06:17] 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Vgutierrez) [09:37:20] 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I'm doing a quick debug attempt on `acmechief-test2001` [09:56:35] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) [09:59:20] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) [10:29:28] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) p:05Triage→03Normal [10:39:14] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [10:57:36] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) 05Open→03Resolved Reloading global Lua scripts now works. Closing. [11:02:09] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I was able to debug the issue using `tracemalloc`: ` # at the top of the file import tracemalloc tracemalloc.start(5) # in the SshAgentProx... [11:26:20] 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) For information, due to [[https://bugzilla.mozilla.org/show_bug.cgi?id=1002724|this bug in Firefox]], when the user type the URL without the "https://" prefix F... [11:30:11] 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) Hmm with HSTS the browser shouldn't even try port 80. [11:56:43] 10netops, 10Operations, 10ops-esams, 10procurement: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10mark) [12:06:32] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10DannyS712) Just came across this at https://en.wikipedia.org/wiki/Template_talk:; when I couldn't access the pa... [13:36:36] 10netops, 10Operations: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10Volans) p:05Triage→03Normal [13:38:42] 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The nl-ams-as14907 anchor is now fully online and has ID #6671. [13:49:49] 10netops, 10Operations, 10ops-esams, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [13:50:02] 10netops, 10Operations, 10ops-esams, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) 05Open→03Resolved a:03faidon All done! [14:54:57] 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) Yes, indeed, I have to precise my test was with a non-HSTS site, and it seems there is no issue with HSTS-preloaded sites according to [[https://bugzilla.mozill... [15:55:02] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) We could also think of writing a sort of HTTP router that returns a list of PyBal AP... [16:01:15] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) >>! In T239392#5700283, @ema wrote: > We could also think of writing a sort of HTTP r... [16:30:07] 10Traffic, 10Operations, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10jbond) p:05Triage→03Normal [16:31:01] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10jbond) p:05Triage→03Normal [23:10:29] 10Traffic, 10Operations: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10jijiki) [23:18:09] 10Traffic, 10Operations: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Volans) It might be another occurrence of T238305 (model matches)