[01:34:00] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) <https://grafana.wikimedia.org/d/000000143/navigation-timing?refresh=5m&orgId=1&from=now-30d&to=now&var-source=navtiming2...
[01:34:13] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle)
[01:34:21] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle)
[02:00:11] <vgutierrez>	 sigh what a mess..
[02:43:53] <vgutierrez>	 hmm this is weird
[02:44:07] <vgutierrez>	 keyholder instances on acmechief servers are consuming a lot of memory
[02:44:31] <vgutierrez>	 a 13.5% of the system memory for acmechief1001 and acmechief-test1001
[02:44:57] <vgutierrez>	 I don't see anything similar on other instances like cumin1001
[02:49:44] <vgutierrez>	 so the last time acme-chief reported anything before going awol was on November 25th at 09 AM UTC, notifying that the unified cert is staged till 2019-11-30 07:11:00 and refreshing the OCSP staples as well for unified
[02:50:21] <vgutierrez>	 no exceptions, no errors.. nothing weird
[02:51:03] <vgutierrez>	 acme-chief got some issues renewing unified cert... apparently LE servers were having some issue at that moment
[02:52:16] <vgutierrez>	 Nov 23 08:00:12 acmechief1001 acme-chief-backend[8837]: acme.messages.Error: urn:ietf:params:acme:error:serverInternal :: The server experienced an internal error :: Error creating new order
[02:56:19] <vgutierrez>	 finally the cert got issued and it's looking good
[03:01:57] <vgutierrez>	 regarding keyholder
[03:02:19] <vgutierrez>	 even on acmechief2001 where acme-chief is not running (active/passive) it's consuming a lot of memory
[03:02:24] <vgutierrez>	 actually even more than in acmechief1001
[03:02:33] <vgutierrez>	 13.5% VS 14.9%
[03:05:54] <vgutierrez>	 upon restart keyholder memory consumption goes back to "normal"
[03:06:15] <vgutierrez>	 I left it untouched on acmechief-test instances for debugging purposes
[03:17:11] <vgutierrez>	 weird... a quick check agains instances with puppet class keyholder present, acmechief ones are the only ones showing big memory consumption by keyholder agent proxy (/usr/local/bin/ssh-agent-proxy)
[03:19:23] <vgutierrez>	 and of course acmechief boxes are the only ones using buster (and python 3.7.x)
[03:23:11] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-publish: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315 (10Krinkle) >>! In T181315#4116607, @gerritbot wrote: > Change 425045 at [[https:...
[03:29:28] <Krinkle>	 What does HTTP 502 mean in our context? I've not seen these before until recently so I guess it's ATS.
[03:29:39] <Krinkle>	 seeing a few affecting load.php
[03:29:40] <Krinkle>	 https://logstash.wikimedia.org/goto/331782f9800a9085915426f5284abf35
[03:29:49] <vgutierrez>	 Krinkle: a 502 is ATS having issues reaching an origin server
[03:30:29] <Krinkle>	 origin being the varnish frontend behind the ATS layer, or from ATS bckend to MW?
[03:30:52] <Krinkle>	 Does that mean there was an Apache/PHP/MW issue or could it be something closer to ATS/LVS?
[03:31:06] <vgutierrez>	 Krinkle: actually, both
[03:31:12] <vgutierrez>	 both ATS can trigger a 502
[03:31:24] <Krinkle>	 Also, do we have similar limits on ATS<->MW the way Varnish has in that it will throw 503 if it has more than X reqs concurrently?
[03:32:09] <vgutierrez>	 hmm not than I'm aware
[03:32:25] <Krinkle>	 ok, one less thing to worry about I suppose :)
[03:32:37] <vgutierrez>	 but double check with e.ma
[03:32:40] <Krinkle>	 How would I begin investiging why it is emitting 502?
[03:36:28] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10Krinkle)
[03:36:29] <Krinkle>	 Filed a task for now.
[03:36:31] <vgutierrez>	 I think is still in the TODO submitting 5xx reports from ats to logstash
[03:41:48] <vgutierrez>	 Krinkle: correct me if I'm wrong, but only GETs are expected against load.php, right?
[03:45:22] <Krinkle>	 Yes
[03:45:29] <Krinkle>	 and HEAD/OPTIONS 
[03:51:23] <vgutierrez>	 ats-be is complaining from time to time about CONNETION_CLOSED and CONNECTION_ERROR against 10.2.2.1 trying to fetch load.php
[03:51:44] <vgutierrez>	 of course that doesn't automatically translate into a 502, cause those requests can be retried
[04:08:16] <vgutierrez>	 but apparently it happens
[05:54:43] <wikibugs>	 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Nuria) I think the collection that is heavy on cookies and tracking should have been reviewed by our privacy engineer @JFis...
[07:12:13] <wikibugs>	 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) Also to clarify further: Pybal does **none** of the meaningful load-balancing. Load-balancing between pods is done...
[07:35:47] <wikibugs>	 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe)
[08:06:17] <wikibugs>	 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Vgutierrez)
[09:37:20] <wikibugs>	 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I'm doing a quick debug attempt on `acmechief-test2001`
[09:56:35] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe)
[09:59:20] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe)
[10:29:28] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) p:05Triage→03Normal
[10:39:14] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles)
[10:57:36] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) 05Open→03Resolved Reloading global Lua scripts now works. Closing.
[11:02:09] <wikibugs>	 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I was able to debug the issue using `tracemalloc`:  ` # at the top of the file import tracemalloc tracemalloc.start(5)  # in the SshAgentProx...
[11:26:20] <wikibugs>	 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) For information, due to [[https://bugzilla.mozilla.org/show_bug.cgi?id=1002724|this bug in Firefox]], when the user type the URL without the "https://" prefix F...
[11:30:11] <wikibugs>	 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) Hmm with HSTS the browser shouldn't even try port 80.
[11:56:43] <wikibugs>	 10netops, 10Operations, 10ops-esams, 10procurement: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10mark)
[12:06:32] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10DannyS712) Just came across this at https://en.wikipedia.org/wiki/Template_talk:; when I couldn't access the pa...
[13:36:36] <wikibugs>	 10netops, 10Operations: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10Volans) p:05Triage→03Normal
[13:38:42] <wikibugs>	 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The nl-ams-as14907 anchor is now fully online and has ID #6671.
[13:49:49] <wikibugs>	 10netops, 10Operations, 10ops-esams, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon)
[13:50:02] <wikibugs>	 10netops, 10Operations, 10ops-esams, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) 05Open→03Resolved a:03faidon All done!
[14:54:57] <wikibugs>	 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) Yes, indeed, I have to precise my test was with a non-HSTS site, and it seems there is no issue with HSTS-preloaded sites according to [[https://bugzilla.mozill...
[15:55:02] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) We could also think of writing a sort of HTTP router that returns a list of  PyBal AP...
[16:01:15] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) >>! In T239392#5700283, @ema wrote: > We could also think of writing a sort of HTTP r...
[16:30:07] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10jbond) p:05Triage→03Normal
[16:31:01] <wikibugs>	 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10jbond) p:05Triage→03Normal
[23:10:29] <wikibugs>	 10Traffic, 10Operations: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10jijiki)
[23:18:09] <wikibugs>	 10Traffic, 10Operations: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Volans) It might be another occurrence of T238305 (model matches)