[16:53:22] <elukey>	 so I checked memcached slabs for mc1028
[16:53:24] <elukey>	 and https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=60&fullscreen&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1028&var-slab=All
[16:53:29] <elukey>	 doesn't look good
[16:53:41] <cdanis>	 slab 136?
[16:54:15] <elukey>	 yeah
[16:54:54] <addshore>	 looks like SqlBlobStore_blob in WANcache? and that can contain large things so would explain the traffic?
[16:55:24] <addshore>	 https://usercontent.irccloud-cdn.com/file/mUFexIKX/image.png
[16:55:52] <cdanis>	 have you figured out which key elukey ?
[16:56:54] <elukey>	 cdanis: I still haven't checked, I was trying to remember if memkeys shows bw usage for both tx / rx or only rx
[16:57:16] <elukey>	 addshore: could be!
[16:57:58] <elukey>	 if memkeys checks also tx, WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992
[16:58:04] <cdanis>	 addshore: those do look like they line up
[16:58:09] <elukey>	 let's see if it is in slab 136
[16:58:45] <cdanis>	 although the first increase, just before 09:00, is on other servers (mc1024 and mc1025)
[16:59:15] <cdanis>	 and things get really bad at the second spike just before 13:00, which is slurping from mc1028
[16:59:17] <elukey>	 elukey@mc1028:~$ echo "stats cachedump 136 1000" | nc -q 1 localhost 11211 | grep WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992
[16:59:20] <elukey>	 ITEM WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A963546992 [80352 b; 1587833905 s]
[16:59:24] <addshore>	 It might not just be 1 key, any general traffic increase hitting none cached pages would cause an increase to that key
[16:59:39] <cdanis>	 sqlblobstore is wikitext at a given revision?
[16:59:43] <addshore>	 yup
[17:00:13] <addshore>	 https://usercontent.irccloud-cdn.com/file/CoDXKwKY/image.png
[17:02:13] <elukey>	 could it be related to a template change?
[17:02:38] <addshore>	 It could be, I can go and try to find out what the article for that cache key iss too
[17:02:39] <cdanis>	 what page is that revision ID for?
[17:02:59] <addshore>	 well, its text table entry 963546992, let me do the back tracking now
[17:03:40] <cdanis>	 if it's easy for you, please show your work when you're done; this is a part of the infra I don't know nearly well enough
[17:03:55] <elukey>	 it would be great to start collecting these tricks in a wiki page so even SREs could figure them outt
[17:03:57] <addshore>	 I had to do this once before for a similar reason, i dont think I wrote my method down, and now wish I had
[17:04:12] <cdanis>	 +1 elukey, or automate part of the process
[17:08:30] <addshore>	 The only way I could think of doing it last time (and what im running now) is select * from content where content_address = "tt:963546992"; and start backwards
[17:08:54] <addshore>	 but that tables does not have an index on that field..... doing it on the analytics slave now
[17:11:21] <elukey>	 +1!
[17:11:31] <elukey>	 hammer those, I know their SRE, no problem
[17:11:57] <elukey>	 addshore: remember to use the analytics-mysql thing
[17:12:00] <cdanis>	 on enwiki, what goes in the main text table instead of in external storage?
[17:12:06] <addshore>	 elukey: I am :)
[17:12:18] <addshore>	 cdanis: the test table just refers to external storage
[17:12:30] <addshore>	 text looks something like |      1 | DB://rc1/5801939/0 | external,utf-8 |
[17:12:36] <cdanis>	 ah okay, mediawiki.org confused me about it then :)
[17:12:44] <addshore>	 the content table then looks something like this
[17:12:44] <addshore>	 |          1 |        46246 | 5k7m2k7q99piiub2aga5t398kh6xuxn |             1 | tt:863785073    |
[17:13:03] <addshore>	 tt:<old_i for text table>
[17:14:25] <elukey>	 interesting that tkos from mcrouters are zero https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1
[17:14:35] <elukey>	 so we are good-ish atm
[17:16:26] <cdanis>	 interesting, still a bunch of saturation on that host
[17:17:12] * addshore twiddles thumbs while the query continues to run
[17:17:26] <elukey>	 so mcrouter marks a shard as tko after 10x1s timeouts, it is probably affecting some calls to mc1028 but not as much to trigger a tko
[17:21:29] <cdanis>	 this should be easier to backtrack and identify, hopefully there's already a better way and it's merely a documentation problem
[17:21:44] <addshore>	 |  943285896 |       186750 | 17ti8zx1nn1o0u381j04ctr9nfjwvis |             3 | tt:963546992    |
[17:21:44] <addshore>	 +------------+--------------+---------------------------------+---------------+-----------------+
[17:21:44] <addshore>	 1 row in set (14 min 34.84 sec)
[17:21:47] <addshore>	 onto the next bit
[17:23:07] <addshore>	 The queries are really not meant to be run in this direction :D [enwiki]> select * from slots where slot_content_id = 943285896;
[17:23:36] <cdanis>	 also i guess the good news is that the local memcached idea would help with this
[17:27:14] <elukey>	 also mcrouter supports this natively https://github.com/facebook/mcrouter/wiki/Two-level-caching
[17:30:36] <elukey>	 while we wait to track down the issue, what if we cannot "stop it", like if it is a big template that was changed etc..
[17:31:25] <elukey>	 (in the meantime, another big bursts of TKOs)
[17:32:02] <addshore>	 |        951705319 |            1 |       943285896 |   951705319 |
[17:32:02] <addshore>	 +------------------+--------------+-----------------+-------------+
[17:32:02] <addshore>	 1 row in set (7 min 26.09 sec)
[17:32:14] <addshore>	 heh, the headings would be useful, but enwiki revid 951705319
[17:32:33] * addshore gets a link
[17:32:46] <addshore>	 https://en.wikipedia.org/w/index.php?oldid=951705319
[17:32:49] <addshore>	 Module:Citation/CS1
[17:33:05] <addshore>	 curprev  12:50, 18 April 2020‎ Trappist the monk talk contribs block‎  186,750 bytes +2,143‎  sync from sandbox; rollback: 2 editsundothank
[17:33:15] * addshore is off for dinner
[17:33:54] <elukey>	 thanks!
[17:34:48] <elukey>	 diff is https://en.wikipedia.org/w/index.php?title=Module:Citation/CS1&diff=prev&oldid=951705319
[17:35:39] <elukey>	 I assume that this module is used in a ton of places
[17:37:30] <elukey>	 cdanis: we are in a mild degraded mode, I have no idea in mind about how to mitigate this. We could simply keep it monitored, and take actions if it gets worse
[17:38:31] <elukey>	 one "action" that is very invasive and risky is to swap mc1028 with mc-gp1001, to get 10G, but the host runs memcached 1.5.x etc.. so not really tested with prod traffic
[17:38:40] <elukey>	 possibly to use only in case of a big fire
[17:38:48] <elukey>	 this doesn't seem the case
[17:40:03] <elukey>	 going afk for some time, will re-check later :)
[17:40:42] <cdanis>	 elukey: it's a slight latency increase, nothing major; we have other, worse latency spikes all the time :) so mild degraded mode is about right
[17:41:19] <cdanis>	 "This Lua module is used on approximately 4,390,000 pages."
[17:41:21] <cdanis>	 lol
[17:41:24] <addshore>	 :D
[17:41:37] <addshore>	 It seems like the reparses of those pages are coming through in waves almost
[17:42:08] <addshore>	 cdanis: any nice ideas of where i should doc those queries? (even though they take 30 mins to complete) ?
[17:42:30] <cdanis>	 uff, good question
[17:43:37] <cdanis>	 for now how about adding a new section to https://wikitech.wikimedia.org/wiki/Memcached ?  not a great place, but a somewhat-natural place, given how these issues manifest
[17:43:49] <addshore>	 will do!
[17:47:49] <cdanis>	 thanks! for now I'm off
[17:47:55] <addshore>	 Added https://wikitech.wikimedia.org/wiki/Memcached#WANCache%E2%9C%8C%EF%B8%8Fglobal:SqlBlobStore-blob:%3Cwiki%3E:%3Ccontent_address%3E
[17:53:19] <apergos>	 4 million? that's all?  peanuts :-P
[17:54:10] <apergos>	 I will be reading that addition to wikitech the next time I'm in work mode (just wrapping up something right now). have tab open.
[18:50:39] <elukey>	 addshore: <3
[19:58:05] <_joe_>	 so I doubt what we're seeing are reparses of linked pages coming from refreshlinks
[19:58:28] <_joe_>	 else we should see the increase in get requests from mcrouter in jobrunners
[20:00:06] <_joe_>	 instead we see https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=1587176242978&to=1587239969420&var-source=eqiad%20prometheus%2Fops&var-cluster=jobrunner&var-instance=All&var-memcached_server=All
[20:00:25] <_joe_>	 so what's happening is the cdn was purged probably of all those pages already
[20:00:38] <_joe_>	 and we're not replenishing it bit by bit
[20:01:28] <_joe_>	 https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=1587176242978&to=1587239969420&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All all the new memcached traffic comes from the appservers
[20:01:37] <_joe_>	 but why *so many more* requests?
[20:14:49] <cdanis>	 _joe_: re: already been purged -- `purged` is deployed in esams now, and there's not a backlog there
[20:15:06] <cdanis>	 so if the app layer was much more responsive about sending them, they'd also get purged from esams in minutes, not days
[20:34:45] <_joe_>	 yep