[07:26:51] <elukey>	 _joe_ look how beautiful this graph is
[07:26:52] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&fullscreen&panelId=9&from=now-2d&to=now
[07:26:56] <elukey>	 -.-
[07:28:15] <_joe_>	 doesn't look beautiful to me... looks like we're failing badly?
[07:28:33] <elukey>	 https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&fullscreen&panelId=55&from=now-2d&to=now
[07:28:37] <_joe_>	 what happened in the last couple days? the train went out?
[07:28:46] <elukey>	 that is rx bandwidth, so I guess a huge key
[07:29:18] <_joe_>	 mc1027
[07:29:21] <elukey>	 for mc1027 yeah
[07:29:29] <elukey>	 so hopefully one isolate use case
[07:29:33] <_joe_>	 sigh
[07:29:34] <_joe_>	 still
[07:29:40] <_joe_>	 we now need to investigate it
[07:29:54] <_joe_>	 memcached has horrible instrumentation, we will need to add some at the mediawiki level
[07:32:08] <elukey>	 I disagree, there are a ton of metrics, knowing keys patter is the client's responsability in my opinion
[07:32:18] <elukey>	 and we have metrics for the wancache mw stuff
[07:32:26] <elukey>	 but we need to have more :)
[07:32:39] <elukey>	 *patterns
[07:33:08] <_joe_>	 we have tons of metrics, minus the ones really useful in debugging issues :P
[07:33:22] <elukey>	 yes :D
[07:33:33] <elukey>	 so the slab seems to be the 154, keys are ~200k
[07:33:38] <_joe_>	 horrible doesn't mean scarce
[07:39:32] <elukey>	 dumping thoughts in here
[07:40:14] <elukey>	 - mc1027 was found in /var/log/mcrouter.log on one of the api-appservers affected, found via the mcrouter TKO graph seen before
[07:40:43] <elukey>	 - the slab 154 was found checking this graph https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1027&var-slab=All&fullscreen&panelId=60&from=now-2d&to=now
[07:41:13] <elukey>	 I dumped the slab in a txt file (stats cachedump 154 1000" | nc -q 1 localhost 11211) but there are of course a lot of keys
[07:41:47] <elukey>	 all of them ~200k in size, with a 1g nic is super easy to saturate tx bandwidth if the traffic rises
[07:41:56] <elukey>	 from the graphs:
[07:42:36] <elukey>	 - mcrouter tkos started to show peaks on the 5th at around 4/6 UTC
[07:43:04] <elukey>	 - memcached slabs show first peak at around 8, same day
[07:45:12] <elukey>	 from the sal I don't see a clear suspect but I might need more coffee
[07:53:25] <elukey>	 so ideally, if we wanted to cover missing gaps, we should just look at something like https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=now-2d&to=now to identify a variation on pattern
[07:56:01] <elukey>	 the main problem though is that the size of the key is not taken into account in --^
[08:05:09] <elukey>	 from a quick ifstat on mc1027
[08:05:10] <elukey>	 HH:MM:SS   Kbps in  Kbps out
[08:05:10] <elukey>	 08:04:43  455502.3  982018.6
[08:05:10] <elukey>	 08:04:44  441302.3  982108.1
[08:05:10] <elukey>	 08:04:45  196611.3  730841.9
[08:05:45] <elukey>	 baseline is around 200/300k Kbps out
[08:06:22] <elukey>	 and in ~50/100 Kbps
[08:06:31] <elukey>	 50/100k sorry
[08:08:54] <elukey>	 there are some CAS ops restered whe we see high gets, this seems a similar issue of the old translategroup key one
[08:09:47] <elukey>	 ok coffee then tcpdump sigh
[09:37:31] <_joe_>	 how did the tcpdump after coffee go?
[09:41:25] <elukey>	 still working on it, I collected some dumps but didn't find a clear culprit, then I am waiting for another spike in timeouts that of course is not happening now
[09:41:38] <elukey>	 I noticed though that slab 144 follows also the same pattern
[09:41:48] <elukey>	 and on both we have those :segment: keys
[09:42:11] <elukey>	 could be a coincidence but I'll triple check
[09:42:31] <elukey>	 also the size of the pcap is so huge that I have to split them and either filter for src or dst port
[09:42:34] <elukey>	 :D
[10:13:00] <elukey>	 no repro up to now, will keep monitoring..
[11:04:49] <elukey>	 _joe_ I just seen three new mcrouter tko peaks, but this time for mc1026
[11:05:15] <elukey>	 so it might be something that goes across shards, and I am inclined to investigate the segment stuff
[11:06:00] <_joe_>	 what do you mean with "the segment stuff"?
[11:06:24] <elukey>	 the big key fragmentation in :segment: that we have in the wancache
[11:14:46] <elukey>	 very interesting - mc1036 slab stats seems to show that slab176 was the culprit, and it contains only one key
[11:15:28] <elukey>	 prefix is WANCache:v:global:SqlBlobStore-blob:svwiki (580k!!!)
[11:17:21] <elukey>	 is it another issue or the same one? :D
[11:20:39] <elukey>	 "Service for storing and loading data blobs representing revision content."
[11:20:46] <elukey>	 https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php
[11:21:16] <elukey>	 and I can find a lot of those keys in the mc1027's slabs
[11:21:24] <elukey>	 but they are also probably used a lot
[11:23:09] <_joe_>	 yes they are
[11:25:19] <elukey>	 from the wancache dashboard is difficult to spot changes
[11:27:54] <elukey>	 _joe_ in my opinion it would be helpful in these cases to have a wancache metric about the size of a get/set.. then it would be easy to have a bandwidth usage graph, to easily spot anomalyes and trends
[11:33:08] <elukey>	 last occurrence on mc1033, slab 176 - two keys only, SqlBlobStore related
[11:34:13] <elukey>	 ok a pattern seems to emerge, opening a task
[11:46:54] <elukey>	 https://phabricator.wikimedia.org/T239983
[11:57:54] <_joe_>	 it's not an extremely easy problem to solve tbh
[14:37:11] <cdanis>	 bblack: re: nodejs and DNS, I just found this https://eng.uber.com/denial-by-dns/
[14:37:48] <cdanis>	 not obviously the same issue; not obviously different either
[14:40:32] <bblack>	 wow
[14:40:56] <bblack>	 I skipped like 3/4 of it because it's all noise, but yeah
[14:41:13] <bblack>	 this is what happens when someone realizes DNS is something they have to think about for like the first time ever, I guess
[14:43:08] <bblack>	 at least they correctly identify that it's the language runtime/libraries/whatever that are at fault in their shitty dns issues :)
[14:45:32] <cdanis>	 yeah
[14:48:02] <ema>	 they couldn't resolve localhost anymore, how sweet
[14:54:16] <elukey>	 _joe_ the spikes in api latency follow the mcrouter tko spikes, at least starting from this morning
[14:54:35] <elukey>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m
[14:54:49] <elukey>	 vs
[14:54:49] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=now-12h&to=now&fullscreen&panelId=6
[14:55:11] <elukey>	 why not yesterday it is a mistery
[14:56:28] <elukey>	 something interesting from the mcrouter graphs are the failed deleted
[14:56:31] <elukey>	 *deletes
[14:56:33] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&fullscreen&panelId=5&from=now-24h&to=now
[14:56:50] <elukey>	 that is more close to api latency patterns
[15:10:53] <elukey>	 at the moment the mcrouter issue seems to be again mc1026
[15:11:06] <elukey>	 slab 176, there is one key of ~600K
[15:11:26] <elukey>	 a few gets and it is relatively easy to saturate the rx 1g nic
[15:13:45] <elukey>	 the key is
[15:13:46] <elukey>	 ITEM WANCache:v:global:SqlBlobStore-blob:svwiki:tt%3A42632535 [581794 b; 1576249835 s]
[15:38:33] <elukey>	 ok tcpdump confirms, I was able to capture some traffic for mc1026
[15:40:49] <_joe_>	 it's always /that/ key?
[15:40:54] <_joe_>	 or a series of similar ones?
[15:43:33] <elukey>	 that one
[15:44:11] <elukey>	 multiple apis requesting it, ~600K and mc1026 saturates rx
[15:46:44] <elukey>	 _joe_ any idea about it?
[15:48:24] <_joe_>	 elukey: nope, I can probably look what's inside it
[15:52:58] <elukey>	 I am going to check again tcpdump to see if the key is the same or if we have variations
[15:53:19] <elukey>	 from what reported in the task though it seems that we had issues with smaller SqlBlobStore keys as well
[15:53:28] <elukey>	 but ~200k keys takes more to saturate the nic
[15:58:37] <elukey>	 in the pcap that I am checking it is always that key from multiple api-appservers
[16:03:40] <_joe_>	 so it's a classical modify-delete-get stampede
[16:12:21] <elukey>	 _joe_ but probably from a bot, or similar? I am trying to understand how this happens
[16:13:21] <_joe_>	 a bot doing what?
[16:13:30] <_joe_>	 edit/read multiple times?
[16:13:38] <elukey>	 I have no idea, I am asking
[16:14:32] <elukey>	 if the theory is that the same key gets deleted and then get/set by multiple apis
[16:15:19] <elukey>	 I was trying to imagine what could trigger it
[16:16:06] <elukey>	 when we had the translategroup key issue, part of the problem initally was that once the key disappeared, then every api accessing a page requiring it would have tried to get and set
[16:16:13] <elukey>	 but it should be more resilient now
[16:20:53] <_joe_>	 report this in a task, point it to timo :)