[01:07:41] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2829562 (10fgiunchedi) Looks like it can be any number really, ATM I'm seeing three different UUIDs ``` varnish_backen... [09:22:31] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2829981 (10doctaxon) @Paladox @elukey : no, it's not fixed. These Gateway errors happen as before the api server restart. The URLs: * https://de.wikipedia.org/w/api.php?action=query&titles=Geschichte%20des%20Verkehrs&... [09:33:49] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830010 (10doctaxon) as I can receive, it looks like the errors become more and more every day [10:08:37] upload 404s went back to reasonable values apparently [10:08:41] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=3&fullscreen&var-site=All&var-cache_type=upload&var-status_type=4&from=1479684350119&to=1480414057306 [10:13:03] ema: I am still seeing varnish-frontend(s) dying for the Workspace assert issue, and I am wondering if the 502s could be triggered by nginx calling varnish when the assert fails (or right after) [10:14:08] so other requests experiencing issues because one triggered the workspace assert failure knocking down varnish [10:21:38] (in the meantime, nginx's unified.error.log might contain some good info) [10:22:36] elukey: yeah I've skimmed through nginx logs yesterday and couldn't really find anything useful [10:23:18] but we should probably keep on looking :) [10:25:28] ema: take a look to cp1065 for example.. journalctl -u varnishlog-frontend shows that varnish died at ~2:40 and 9:08 UTC today [10:26:09] /var/log/nginx/unified.error.log and /var/log/nginx/unified.error.log.1 shows errors in these time frames [10:26:47] mostly connection refused [10:27:49] and also upstream timed out (110: Connection timed out) while reading response header from upstream [10:28:05] interesting [10:29:41] we should probably try doubling workspace_backend and see if that helps [10:33:44] I still can't figure out why they put an assert like that in there [10:34:25] why don't just abort and return an error code like 503 or similar? [10:36:36] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830171 (10doctaxon) i'll try a monitoring with an api query traffic loop and will report here [10:38:29] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830172 (10ema) @doctaxon: thanks. Please also include request and response headers. I haven't managed to reproduce the issue yet. [11:48:28] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830390 (10mark) [12:24:46] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830424 (10Joe) >>! In T143132#2830385, @Joe wrote: > Please note that exposing the service via restbase doesn't mean it's a good idea to call it via... [13:29:31] 10Traffic, 06Operations, 13Patch-For-Review: Varnishkafka seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2830578 (10ema) This is the list of hosts affected by the issue in the last 3 days sorted by number of crashes. 28 cp1065.eqiad.wmnet 21 cp1068.eqiad.wmnet 18 cp3032... [13:38:49] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830584 (10Joe) For comparison, I just confirmed that when using OCG, MediaWiki issues `Cache-control: no-cache`; that's because OCG is caching conten... [14:47:58] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830907 (10GWicke) @joe: The traffic we are talking about here is very low. OCG currently sees about 2 req/s. [14:54:54] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830931 (10ema) [15:10:15] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830968 (10ema) 05duplicate>03Open [15:32:43] ema: did we make other parameter changes last week? [15:33:35] (re various vsl/vsm/workspace sizing) [15:33:47] nope [15:40:05] how bad is the crash rate? [15:40:19] it seems wierd a rare varnish crash would be the cause of the 502s, unless it's not so rare [15:42:11] I'm slowly catching up on the tickets now that already answer my questyions [15:43:34] ema: so in repros like: T151563#2830921 [15:43:53] are you actually able to see a crash timestamp somewhere + 502 timestamp line up pretty close? [15:44:31] varnish has historically had issues with this whole workspace thing, we've hit it before. supposedly v4 re-designed to fix the old issues, but I guess left us with new ones. [15:45:08] (previous behavior was worse: it wouldn't even fail an assert, it would corrupt memory and then whatever happens happens (sometimes a crash)) [15:46:24] bblack: yes, varnishd crashed and I got the 502 back [15:48:34] elukey: so that would explain the assert. The alternative is memory corruption. ^ [15:49:31] that's the reason we have that other parameter in there from long ago: -p http_req_size=24576 [15:49:59] was to help eliminate one possible source of exhausting workspace under what should be "normal" conditions, by limiting the maximum size of a request (headers, etc) [15:50:22] the default at the time was 32k I think [15:51:08] anyays, +1 on doubling the workspace [15:52:13] alright, merging [15:52:14] things that were factors in workspace consumption when I looked at this under v3: size of request url+headers, size of response headers (I think?), +size of all temporary memory used by VCL (vmod workspace allocations, also temporary internal headers, etc) [15:52:34] maybe the crashing URLs have huge response headers? [15:52:54] we could just be closer to the limit in general from more consumption by our VCL/vmods somehow [15:55:13] qq - do we have nginx metrics and related alarms for these kind of situations? [15:55:27] I'm trying one of your crashers directly against api.svc to see what it looks like [15:55:31] elukey: nope :( [15:55:40] it has been a sore point before [15:57:26] so that contributors URL that induces a crash, when I run it against api.svc, it's taking a long time so far, haven't seen a response yet [15:57:30] over a minute though for sure [15:59:01] it waits 2 minutes and then gives a 503 (from MW) [15:59:12] ok [15:59:20] a 503 with 50KB of content [15:59:40] perhaps, unlike with 2xx content, 503 contents end up in workspace somehow, at least under v4? [16:00:07] you could image some explanation to do with error handling and synthetics and blah blah [16:00:20] so you got a 503 but 50k of partial content? [16:00:36] no, MW's normal error page outout is ~50KB [16:00:49] it's fancy and has built-in translations for many languages, etc [16:01:46] probably worth a separate ticket if there isn't one, about why in MediaWiki-land that URL hangs for exactly 2 minutes and then 503s out [16:02:11] I currently have a get request hanging for > 7 minutes (going through cp1008) [16:03:04] also worth mentioning that when I reproduced the 502 I got no varnishlog output whatsoever [16:12:47] > 17 minutes [16:13:30] to varnish or mw? [16:13:37] varnish [16:13:48] did we leave extrachance defaulting to zero? [16:14:00] that's a great question [16:14:25] because the other one I tested hung for 2m, which sounds like it could induce those internal retries, etc [16:14:31] 1 [16:15:27] even with extrachance=1 and extrachance-- at the bottom of the loop, a 2 minute hang could become 32 minutes from esams-client POV [16:15:40] I think [16:15:43] that's cp1008 though [16:16:09] oh cp1008 would be theoretically... 16 mins? [16:16:11] hmmm [16:16:23] 21m and counting [16:16:43] if we assume somehow the 3m timeout applies in at least some cases, maybe longer, I donno [16:17:01] 24 I guess tops if that whole set of assumptions works [16:17:18] same query? [16:17:22] anyways, no reason to CTRL-C such a hard-working curl [16:17:54] second-last mentioned in T151563#2831109 [16:19:28] bblack: the whole set of assumptions does seem to work, I got a 504 after 00:24:01 [16:19:47] (Gateway Time-out) [16:20:05] score one for random mental guesswork! [16:20:21] :) [16:20:34] getting rid of extrachance=1 should reduce the total time to failure pretty dramatically [16:20:50] more like either 4 or 6 minutes total (if you change it on both cp1008 be + fe) [16:22:19] oh, meanwhile, should I varnishadm-salt the workspace_backend change? That's a runtime parameter, no need to restart [16:24:02] ema: yeah good idea [16:24:20] even though it's a runtime param, threads may need recycling before they pick it up [16:24:28] the easiest route to that would probably be a VCL reload [16:24:39] (a bit after the runtime param change) [16:27:21] ok I'll wait a bit and do that [16:30:14] reload done [16:38:07] so workspace set to 128k everywhere right? [16:39:15] should be [16:42:55] yes [16:43:19] ema: is extrachance runtime param as well? [16:44:13] seems like it [16:44:16] bblack: nope, I haven't changed that, it's still set to 1 [16:44:23] oh [16:44:25] but it's runtime-settable right? [16:44:28] yeah it is a runtime param yes [16:45:00] we should switch it to zero in systemd + runtime I think [16:45:12] * ema agrees [16:45:31] (and prepares the CR) [16:52:24] bblack: https://gerrit.wikimedia.org/r/#/c/324229/ [17:04:36] ema,bblack - can we also chat briefly about https://phabricator.wikimedia.org/T151643 whenever you have time? [17:06:23] I think the PURGE thing is really just a side note about efficiency, not an answer to the problem, right? [17:06:52] I don't think we set X-Analytics on PURGE anyways, so it should get skipped you'd think [17:07:20] yeah, filtering out PURGEs helps just a little bit [17:09:43] oh I'm wrong [17:09:53] we end up with "X-Analytics: nocookies=1" [17:10:09] and wmf-last-access set too heh [17:10:21] we should probably skip over a bunch of those output/mangling things on PURGE [17:14:17] there is also the question if vls space might need to be bumped, or if we are hitting a limit of the python scripts, or something else that I can't think of [17:15:06] or we ditch varnishreqstats in favor of the prometheus exporter [17:15:20] is that something doable or am I saying something super wrong? [17:16:31] I don't think you can get that stuff out of varnishtat [17:16:37] eg: number of GET requests [17:18:07] ah ok I thought we had the same metrics out of the two [17:18:16] nevermind [17:18:32] nice try :) [17:19:48] or sorry I was mixing up reqstats and webrequest in my head above [17:20:06] but still, we probably should exclude purge responses from complex vcl output mangling stuff [17:20:21] at least analytics_deliver [18:53:14] bblack: there's a discussion on varnish-dev about geolocation vmods [18:53:34] bblack: since you basically rewrote one and it currently is better than what's out there (AIUI), you might be interested :) [19:05:42] I might be if I had time for it :) [19:06:50] the obvious things to do right in a geoip vmod are: make it efficient for high request volumes, make sure it can handle arbitrary ip or ipstring (not just client.ip), and expose all of the available data via an efficient API to VCL. [19:06:55] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2832114 (10GWicke) The PR is now merged, and I also checked with @bblack about object sizes & Varnish cache times. With expected volume & sizes (< 100... [19:07:06] "make it efficient" covers a lot of ground I know :P [19:07:31] on an unrelated note, do we have something that is like pinkunicorn but behind LVS? [19:07:46] I'm troubleshooting interactively with a user who's reporting to me IPv6 PMTU issues [19:08:02] pinkunicorn isn't reproducible, but the lack of an LVS layer there makes the test case quite different than prod [19:08:26] hm maybe a service with only a few backends where I can tcpdump across all of them [19:08:42] right [19:08:55] one can envision a tcpdump-across-multiple-machines, can't be that hard [19:08:55] well really even pinkunicorn *should* be behind LVS, just never bothered doing it [19:09:02] it doesn't hurt TLS testing, and helps with other testing [19:09:10] sure, looking for something kinda now though :P [19:09:10] and private is better for sec anyways [19:09:24] what would you suggest? maps? misc? [19:09:31] maps would be simpler [19:09:51] and only 4 caches to look at manually (find whichever one his IP pops up on) [19:09:52] maps1001-1004? [19:10:09] no, the caches [19:10:09] well I guess depends where he's coming from [19:10:17] oh duh [19:10:19] right, I figured you knew that part [19:10:21] sorry, dunno what I was thinking [19:10:43] whichever DC he's hitting, you have 4x caches there to look at with maps, and the reqs are simplistic [19:10:48] nod [19:10:49] thanks :) [20:27:56] 10netops, 06Operations, 10fundraising-tech-ops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2832558 (10Jgreen) 05Open>03declined We might as well close this task since we plan to replace the firewalls very soon. [22:41:35] 10Traffic, 06Operations, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2373208 (10fgiunchedi) I can confirm all my requests to upload today were cookie free, anything left to do?