[13:37:14] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224974 (10jcrespo) [13:42:23] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224978 (10jcrespo) https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-cli... [14:56:44] ema: ping, what's up with upload this morning? [14:58:21] looks like since ~18h ago, we've been seeing increasing 5xx spike patterns, possible mailbox or similar again [14:58:39] I'm looking at a couple other things but will circle back to this in a little while if you don't [15:15:44] ema: I restarted 2002 just now. Next on my list (with some spacing, and in worst-lag-first order) is 2022, 2014, 2017 [15:34:25] 10Traffic, 10Citoid, 10ContentTranslation, 10ContentTranslation-CXserver, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#3225276 (10Jdforrester-WMF) [16:38:44] I fell behind my own timeline a bit, starting 2022 [16:47:41] and starting 2014 now (timing's a little tight, but better than leaving it going right now) [17:05:17] restarting 2017 [17:05:51] also re: the codfw mailbox issues, I just merged (but didn't force-push) the puppet change to revert esams inter-cache routing [17:05:57] so that's going to take the pressure off eventually [17:06:22] (today was the planned date for that) [17:29:22] restarting 2026 (only one left showing lag in icinga presently) [18:28:18] hey yaaalllll, anybody around today? [18:28:33] we have some data loss warnings from webrequest upload logs [18:28:53] know of anything that might have caused that off the top of a head? [18:32:03] same issues as before, 5xx's from mailbox stuff [18:32:20] but what are "data loss warnings" anyways? [18:32:47] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5&from=now-7d&to=now [18:33:16] is "data loss warning" the stuff we've seen before that's driven by interrupted serial#'s from varnishkafka or whatever? [18:34:02] (because it might not be the 5xx themselves, it could be that you're losing windows of data due to something about our depooled restarts, too) [18:37:39] it could be related bblack, but i *think* we account for that [18:38:03] we keep some stats on # of requests vs. number of expected requests [18:38:11] and if it is a varnishkafka restart, sequence_min is 0 [18:38:29] but, 5xx's shouldn't be webrequest loss, since we log those [18:39:18] in a single hour today (hour 13), there was about 1.62% loss [18:39:28] spread over all active upload caches [18:39:33] evenly [18:40:30] could be related to kafka itself, too [18:40:36] could be ya [18:40:37] (e.g. dropouts of traffic from caches<->kafka?) [18:40:48] usually varnishkafka logs about dropping messages though [18:41:14] but, hour 13 today was one of the most-notable broad swipes of 5xx from the mailbox issue, too [18:41:17] its from all datacenters (except eqiad) [18:41:19] hmmm [18:41:19] so it seems like it's probably related [18:41:26] interesting [18:41:36] i don't know the mailbox issue, is it reltaed to caches? [18:41:38] the 5xxs? [18:42:10] the mailbox issue is just something that's generating some 5xx spikes lately in general [18:42:23] ya [18:42:23] hm [18:42:23] but these "spikes" are quite small statistically [18:42:36] but the 5xxes are from 'mailbox' not due to some cache infrastructure issue, right? [18:43:02] e.g. during 13:xx the 503s from the mailbox lag were ~0.03% of all requests [18:43:08] (globally) [18:43:31] the "mailbox" issue is a cache infrastructure issue - it's an internal varnish issue [18:44:30] oh interesting [18:45:50] basically, varnish has some internal issues dealing with managing its storage (keeping up with objects expiring or being evicted to make room for new ones) [18:46:10] the architecture builds in some slack so that, those things can fall behind a bit asynchronously without any ill effects [18:46:57] but we've had an ongoing issue, of varying severity, for some time now where it really falls behind, to the point that it starts affecting varnish's ability to fetch new cache misses at all, resulting in a low rate of 503 errors on cache miss [18:47:55] given your timing, it's probably related [18:47:57] hm [18:48:22] perhaps during the same timeframes we're having 503 errors due to the mailbox issue, it's also causing varnish to fail at some VSL/shmlog stuff that varnishkafka is reading from [18:48:34] would those 503s possibly not be logged? or, are they delayed for long time? [18:48:49] delayed meaning time between initial requestn and response ? [18:48:50] they're definitely logged, or else we couldn't show them in the graphs :) [18:48:56] right [18:49:12] I don't know how to fit it all together [18:49:28] the logs (for our graphs, and for varnishkafka data) come from the frontend instances [18:49:48] the mailbox lag issue is really a backend-instance issue, but it shows up in frontend stats as a low rate of 503s when frontend fetches from backend and the backend fails [18:50:28] (so I guess what I'm saying is the mailbox lag on the backend instance doesn't really have a chance to cause a varnish-internal problem mucking up the frontend instance where VK is listening...) [18:51:06] hm [19:19:09] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3226013 (10ayounsi) @Cmjohnson you're free to decommission/unrack asw-d-eqiad. [19:23:01] intersesting, the logs around a big chunk of missing logs from cp2005 have a few missing fields too [19:23:13] no tiemstamp, no IP, no time_firstbyte [19:23:20] could be a weird varnishkafka problem [20:10:17] 10netops, 06DC-Ops, 06Operations, 10ops-codfw: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3226100 (10ayounsi) Diff from paste above pushed to mr1-codfw. @papaul, let me know when we can sync-up to configure the AP. [23:10:03] 10Wikimedia-Apache-configuration, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1347928 (10Dzahn) 05Resolved>03Open re-using the ticket again for the same issue. we have an alert https://icinga.wikimedia.org/cgi-bin/i... [23:10:17] 10Wikimedia-Apache-configuration, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#3226610 (10Dzahn) p:05High>03Normal