[16:53:22] so I checked memcached slabs for mc1028 [16:53:24] and https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=60&fullscreen&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1028&var-slab=All [16:53:29] doesn't look good [16:53:41] slab 136? [16:54:15] yeah [16:54:54] looks like SqlBlobStore_blob in WANcache? and that can contain large things so would explain the traffic? [16:55:24] https://usercontent.irccloud-cdn.com/file/mUFexIKX/image.png [16:55:52] have you figured out which key elukey ? [16:56:54] cdanis: I still haven't checked, I was trying to remember if memkeys shows bw usage for both tx / rx or only rx [16:57:16] addshore: could be! [16:57:58] if memkeys checks also tx, WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992 [16:58:04] addshore: those do look like they line up [16:58:09] let's see if it is in slab 136 [16:58:45] although the first increase, just before 09:00, is on other servers (mc1024 and mc1025) [16:59:15] and things get really bad at the second spike just before 13:00, which is slurping from mc1028 [16:59:17] elukey@mc1028:~$ echo "stats cachedump 136 1000" | nc -q 1 localhost 11211 | grep WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992 [16:59:20] ITEM WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992 [80352 b; 1587833905 s] [16:59:24] It might not just be 1 key, any general traffic increase hitting none cached pages would cause an increase to that key [16:59:39] sqlblobstore is wikitext at a given revision? [16:59:43] yup [17:00:13] https://usercontent.irccloud-cdn.com/file/CoDXKwKY/image.png [17:02:13] could it be related to a template change? [17:02:38] It could be, I can go and try to find out what the article for that cache key iss too [17:02:39] what page is that revision ID for? [17:02:59] well, its text table entry 963546992, let me do the back tracking now [17:03:40] if it's easy for you, please show your work when you're done; this is a part of the infra I don't know nearly well enough [17:03:55] it would be great to start collecting these tricks in a wiki page so even SREs could figure them outt [17:03:57] I had to do this once before for a similar reason, i dont think I wrote my method down, and now wish I had [17:04:12] +1 elukey, or automate part of the process [17:08:30] The only way I could think of doing it last time (and what im running now) is select * from content where content_address = "tt:963546992"; and start backwards [17:08:54] but that tables does not have an index on that field..... doing it on the analytics slave now [17:11:21] +1! [17:11:31] hammer those, I know their SRE, no problem [17:11:57] addshore: remember to use the analytics-mysql thing [17:12:00] on enwiki, what goes in the main text table instead of in external storage? [17:12:06] elukey: I am :) [17:12:18] cdanis: the test table just refers to external storage [17:12:30] text looks something like | 1 | DB://rc1/5801939/0 | external,utf-8 | [17:12:36] ah okay, mediawiki.org confused me about it then :) [17:12:44] the content table then looks something like this [17:12:44] | 1 | 46246 | 5k7m2k7q99piiub2aga5t398kh6xuxn | 1 | tt:863785073 | [17:13:03] tt: [17:14:25] interesting that tkos from mcrouters are zero https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1 [17:14:35] so we are good-ish atm [17:16:26] interesting, still a bunch of saturation on that host [17:17:12] * addshore twiddles thumbs while the query continues to run [17:17:26] so mcrouter marks a shard as tko after 10x1s timeouts, it is probably affecting some calls to mc1028 but not as much to trigger a tko [17:21:29] this should be easier to backtrack and identify, hopefully there's already a better way and it's merely a documentation problem [17:21:44] | 943285896 | 186750 | 17ti8zx1nn1o0u381j04ctr9nfjwvis | 3 | tt:963546992 | [17:21:44] +------------+--------------+---------------------------------+---------------+-----------------+ [17:21:44] 1 row in set (14 min 34.84 sec) [17:21:47] onto the next bit [17:23:07] The queries are really not meant to be run in this direction :D [enwiki]> select * from slots where slot_content_id = 943285896; [17:23:36] also i guess the good news is that the local memcached idea would help with this [17:27:14] also mcrouter supports this natively https://github.com/facebook/mcrouter/wiki/Two-level-caching [17:30:36] while we wait to track down the issue, what if we cannot "stop it", like if it is a big template that was changed etc.. [17:31:25] (in the meantime, another big bursts of TKOs) [17:32:02] | 951705319 | 1 | 943285896 | 951705319 | [17:32:02] +------------------+--------------+-----------------+-------------+ [17:32:02] 1 row in set (7 min 26.09 sec) [17:32:14] heh, the headings would be useful, but enwiki revid 951705319 [17:32:33] * addshore gets a link [17:32:46] https://en.wikipedia.org/w/index.php?oldid=951705319 [17:32:49] Module:Citation/CS1 [17:33:05] curprev 12:50, 18 April 2020‎ Trappist the monk talk contribs block‎ 186,750 bytes +2,143‎ sync from sandbox; rollback: 2 editsundothank [17:33:15] * addshore is off for dinner [17:33:54] thanks! [17:34:48] diff is https://en.wikipedia.org/w/index.php?title=Module:Citation/CS1&diff=prev&oldid=951705319 [17:35:39] I assume that this module is used in a ton of places [17:37:30] cdanis: we are in a mild degraded mode, I have no idea in mind about how to mitigate this. We could simply keep it monitored, and take actions if it gets worse [17:38:31] one "action" that is very invasive and risky is to swap mc1028 with mc-gp1001, to get 10G, but the host runs memcached 1.5.x etc.. so not really tested with prod traffic [17:38:40] possibly to use only in case of a big fire [17:38:48] this doesn't seem the case [17:40:03] going afk for some time, will re-check later :) [17:40:42] elukey: it's a slight latency increase, nothing major; we have other, worse latency spikes all the time :) so mild degraded mode is about right [17:41:19] "This Lua module is used on approximately 4,390,000 pages." [17:41:21] lol [17:41:24] :D [17:41:37] It seems like the reparses of those pages are coming through in waves almost [17:42:08] cdanis: any nice ideas of where i should doc those queries? (even though they take 30 mins to complete) ? [17:42:30] uff, good question [17:43:37] for now how about adding a new section to https://wikitech.wikimedia.org/wiki/Memcached ? not a great place, but a somewhat-natural place, given how these issues manifest [17:43:49] will do! [17:47:49] thanks! for now I'm off [17:47:55] Added https://wikitech.wikimedia.org/wiki/Memcached#WANCache%E2%9C%8C%EF%B8%8Fglobal:SqlBlobStore-blob:%3Cwiki%3E:%3Ccontent_address%3E [17:53:19] 4 million? that's all? peanuts :-P [17:54:10] I will be reading that addition to wikitech the next time I'm in work mode (just wrapping up something right now). have tab open. [18:50:39] addshore: <3 [19:58:05] <_joe_> so I doubt what we're seeing are reparses of linked pages coming from refreshlinks [19:58:28] <_joe_> else we should see the increase in get requests from mcrouter in jobrunners [20:00:06] <_joe_> instead we see https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=1587176242978&to=1587239969420&var-source=eqiad%20prometheus%2Fops&var-cluster=jobrunner&var-instance=All&var-memcached_server=All [20:00:25] <_joe_> so what's happening is the cdn was purged probably of all those pages already [20:00:38] <_joe_> and we're not replenishing it bit by bit [20:01:28] <_joe_> https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=1587176242978&to=1587239969420&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All all the new memcached traffic comes from the appservers [20:01:37] <_joe_> but why *so many more* requests? [20:14:49] _joe_: re: already been purged -- `purged` is deployed in esams now, and there's not a backlog there [20:15:06] so if the app layer was much more responsive about sending them, they'd also get purged from esams in minutes, not days [20:34:45] <_joe_> yep