[12:58:49] 10Traffic, 10Operations: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473#3594540 (10elukey) [13:36:40] 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3594564 (10Aklapper) p:05Triage>03High [13:49:51] 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3594587 (10elukey) p:05High>03Normal The 503s seems to be down to zero now, lowering down the priority to Normal since we are aware of this issue (https://phabricator.wikimedia.org/T174932) and everything looks good at... [13:50:51] hello people, I've restarted cp1053's varnish backend for --^ and then cp1073's varnish backend too (upload) since it was alarming as well.. [13:55:58] moreover some hours ago in #operations I saw several alarms like [13:56:01] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:42] for cp300[5-8,10], self recovered in the same minute of the alarm [15:03:50] the mailbox lag alerts are unfortunately pretty routine at this point. I usually scan for them once or twice a day in icinga and restart any that are crit even if the 503s haven't shown up yet. [15:04:03] but lately on the weekends I'm not around much, so :/ [15:04:27] the cp3 ping flag, probably a switch issue in esams, hopefully rare [15:04:33] s/flag/flap/ [15:07:00] currently our best-odds chances of fixing mailbox lag in the short term is https://gerrit.wikimedia.org/r/#/c/376751/ (and/or more tweaking of keep times and TTLs) and in the medium term (~1Q out) it's Varnish5 upgrades. [15:08:18] well, I should add that there's another short-to-medium possibility, which is that we can deploy the NUMA isolation work on some of the previous-gen caches, which may make the mailbox locks more efficient and solve it from a different angle. [20:26:31] 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3595030 (10Samtar) 05Open>03Resolved a:03elukey Overarching issue tracked in T174932, still minimal 503s in logstash and no further reports. Closing resolved (503s "resolved" by restart of varnish backends)