[05:27:31] 10Traffic, 10Operations, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) Since yesterday, memory increased up to 10Gb in cp1085, current snapshot: ` Allocated | In-Use | Type Size | Free List Name --------------------|------... [08:50:06] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10Urbanecm) I think this should have it's priority increased to UBN. I'm receiving many reports from community that pur... [09:27:21] 10netops, 10Operations: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10faidon) [10:26:40] is this normal/expected? https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?panelId=2&fullscreen&orgId=1 [10:48:09] jynus: yeah, purged jobs now have 0% availability as I paused testing the system during Easter holidays. That will be fixed in the afternoon: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/588660/ [10:50:54] ok [10:51:08] no worry, just in case it was to be handled [10:51:39] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) [13:46:34] could we please get some team updates in the weekly SRE meeting etherpad soon? [13:46:36] thanks :) [13:55:40] <_joe_> hi there! do we have anywhere where we wrote down our retry policy to the backend from ats-be? [14:04:00] _joe_: ats-be does not retry requests. varnish-fe does retry 503s once [14:04:19] <_joe_> ema: for POSTs too? [14:04:27] <_joe_> ema: does it retry 504s? [14:04:35] if (beresp.status == 503 && bereq.retries == 0 && bereq.method ~ "^(GET|HEAD|OPTIONS|PUT|DELETE)$") [14:04:38] <_joe_> I would expect any code > 501 [14:05:01] <_joe_> but not retrying on backend timeouts makes sense [14:05:08] <_joe_> why not 502? [14:05:56] yeah I think we could retry 502s too potentially [14:09:15] _joe_: did some specific bug hunting trigger the questions or were you just wondering about the logic in general? [14:09:57] <_joe_> ema: I wanted to repro the same logic in envoy when used as a service proxy [14:10:10] <_joe_> so it's just the logic I was reasoning about :P [14:11:49] the main takeaway from the previous process (there was an RFC process and a bunch of debate) [14:12:08] was that retries on the inside are always going to a potential source of error amplification [14:12:35] that's why we moved the single retry to the outermost point. this provides a blanket solution to paper over any truly-transient blip of an error at any deeper level. [14:12:56] (without causing exponential growth in a tree of retries) [14:13:47] there are other ways we can do things (like the discussion last week about tracking some of the fanout through headers, and timeout-remaining, etc) [14:14:01] but I think just adding more internal retries without those global mechanisms is a pathway to pain [15:03:56] <_joe_> this is completely a separate discussion, it's for services that need to call another one and depend on it to return a good response, else they throw a 500 [15:04:03] <_joe_> which doesn't get retried on the fronted [15:04:19] <_joe_> also, we are going to add circuit breaking [15:04:51] <_joe_> I just didn't want to deviate too much from the retry logic on the frontend. [15:05:29] <_joe_> so say restbase is calling parsoid, on errors 502 and 503, we'll retry, with a backoff once we reach a certain concurrency [15:07:21] 10Traffic, 10Analytics, 10Operations, 10Research: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) closing as this is happening as part of our monthly sync up. [15:08:53] 10Traffic, 10Analytics, 10Operations, 10Research: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) 05Open→03Resolved [15:40:59] 10Traffic, 10Operations, 10Pybal: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597 (10fgiunchedi) p:05High→03Medium We've been routinely reboot lvs hosts multiple times and IIRC this issue hasn't come up again (?) Lowering priority [15:49:58] 10Traffic, 10Operations, 10Pybal: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi AFAICT this issue hasn't reoccurred, boldly resolving [16:09:49] <_joe_> bblack: do we have a general task for reducing the purge rates? [16:13:56] _joe_: there's https://phabricator.wikimedia.org/T249325 [16:14:14] I should edit the description [16:14:31] it's just copied and pasted from bblack's comment when we first realized the probable extent of the issue, and before we had proper monitoring [16:14:34] <_joe_> cdanis: ok, no, I wanted a task specifically to that end. This is about one issue we're having [16:14:45] ah! no, I do not think there is that [16:14:47] <_joe_> I will create a task and mention it [16:14:54] +1 please cc me [16:15:43] <_joe_> I'm trying to translate the recommendations we made in that doc into tasks [16:16:55] _joe_: please cc aaron on it as well -- the idea of disabling rebound purges turned out to be misguided/not possible [16:17:17] _joe_: see commentary on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/586390 [16:18:02] <_joe_> yeah :/ [16:23:55] 10netops, 10Operations: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10Volans) The structure looks good to me, we could optionally skip the duplicate `import_policy` and `export_policy` if we don't have cases of override, but it's fine. [16:49:20] _joe_: did you file a placeholder task ? [16:49:36] <_joe_> cdanis: nope, I was taking a break :) [17:07:23] _joe_: I created T250205 so I could reference it from another task, assigned to you, please edit that one when you do :) [17:07:24] T250205: placeholder: reduce rate of purges emitted by Mediawiki - https://phabricator.wikimedia.org/T250205 [17:07:55] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) [17:08:24] <_joe_> cdanis: ok :) [17:24:31] 10Traffic, 10Operations, 10SRE-tools, 10Continuous-Integration-Config, and 4 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) 05Open→03Resolved This has been complete for some time. [17:24:36] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) [17:26:59] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, 10serviceops: Reduce rate of purges emitted by Mediawiki - https://phabricator.wikimedia.org/T250205 (10Joe) [17:34:19] 10Varnish, 10Commons: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Pigsonthewing) [17:35:02] 10Varnish, 10Commons: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Pigsonthewing) [19:59:28] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, 10serviceops: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [20:00:23] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, 10serviceops: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [20:01:12] 10Traffic, 10Core Platform Team, 10Operations, 10serviceops, 10Performance-Team (Radar): Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Gilles) [20:12:34] 10Traffic, 10Core Platform Team, 10Operations, 10serviceops, 10Performance-Team (Radar): Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10daniel) p:05Triage→03Medium @Joe You are assigned to this ticket, is this something you are going to work on in the code?... [20:48:21] 10Traffic, 10Varnish, 10Commons, 10Operations, 10Wikimedia-General-or-Unknown: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Reedy) [21:17:02] 10Traffic, 10Commons, 10Operations, 10Wikimedia-General-or-Unknown: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Aklapper) [21:39:36] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10QEDK) >>! In T249325#6054065, @Urbanecm wrote: > I think this should have it's priority increased to UBN. I'm receivi...