[07:26:51] _joe_ look how beautiful this graph is [07:26:52] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&fullscreen&panelId=9&from=now-2d&to=now [07:26:56] -.- [07:28:15] <_joe_> doesn't look beautiful to me... looks like we're failing badly? [07:28:33] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&fullscreen&panelId=55&from=now-2d&to=now [07:28:37] <_joe_> what happened in the last couple days? the train went out? [07:28:46] that is rx bandwidth, so I guess a huge key [07:29:18] <_joe_> mc1027 [07:29:21] for mc1027 yeah [07:29:29] so hopefully one isolate use case [07:29:33] <_joe_> sigh [07:29:34] <_joe_> still [07:29:40] <_joe_> we now need to investigate it [07:29:54] <_joe_> memcached has horrible instrumentation, we will need to add some at the mediawiki level [07:32:08] I disagree, there are a ton of metrics, knowing keys patter is the client's responsability in my opinion [07:32:18] and we have metrics for the wancache mw stuff [07:32:26] but we need to have more :) [07:32:39] *patterns [07:33:08] <_joe_> we have tons of metrics, minus the ones really useful in debugging issues :P [07:33:22] yes :D [07:33:33] so the slab seems to be the 154, keys are ~200k [07:33:38] <_joe_> horrible doesn't mean scarce [07:39:32] dumping thoughts in here [07:40:14] - mc1027 was found in /var/log/mcrouter.log on one of the api-appservers affected, found via the mcrouter TKO graph seen before [07:40:43] - the slab 154 was found checking this graph https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1027&var-slab=All&fullscreen&panelId=60&from=now-2d&to=now [07:41:13] I dumped the slab in a txt file (stats cachedump 154 1000" | nc -q 1 localhost 11211) but there are of course a lot of keys [07:41:47] all of them ~200k in size, with a 1g nic is super easy to saturate tx bandwidth if the traffic rises [07:41:56] from the graphs: [07:42:36] - mcrouter tkos started to show peaks on the 5th at around 4/6 UTC [07:43:04] - memcached slabs show first peak at around 8, same day [07:45:12] from the sal I don't see a clear suspect but I might need more coffee [07:53:25] so ideally, if we wanted to cover missing gaps, we should just look at something like https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=now-2d&to=now to identify a variation on pattern [07:56:01] the main problem though is that the size of the key is not taken into account in --^ [08:05:09] from a quick ifstat on mc1027 [08:05:10] HH:MM:SS Kbps in Kbps out [08:05:10] 08:04:43 455502.3 982018.6 [08:05:10] 08:04:44 441302.3 982108.1 [08:05:10] 08:04:45 196611.3 730841.9 [08:05:45] baseline is around 200/300k Kbps out [08:06:22] and in ~50/100 Kbps [08:06:31] 50/100k sorry [08:08:54] there are some CAS ops restered whe we see high gets, this seems a similar issue of the old translategroup key one [08:09:47] ok coffee then tcpdump sigh [09:37:31] <_joe_> how did the tcpdump after coffee go? [09:41:25] still working on it, I collected some dumps but didn't find a clear culprit, then I am waiting for another spike in timeouts that of course is not happening now [09:41:38] I noticed though that slab 144 follows also the same pattern [09:41:48] and on both we have those :segment: keys [09:42:11] could be a coincidence but I'll triple check [09:42:31] also the size of the pcap is so huge that I have to split them and either filter for src or dst port [09:42:34] :D [10:13:00] no repro up to now, will keep monitoring.. [11:04:49] _joe_ I just seen three new mcrouter tko peaks, but this time for mc1026 [11:05:15] so it might be something that goes across shards, and I am inclined to investigate the segment stuff [11:06:00] <_joe_> what do you mean with "the segment stuff"? [11:06:24] the big key fragmentation in :segment: that we have in the wancache [11:14:46] very interesting - mc1036 slab stats seems to show that slab176 was the culprit, and it contains only one key [11:15:28] prefix is WANCache:v:global:SqlBlobStore-blob:svwiki (580k!!!) [11:17:21] is it another issue or the same one? :D [11:20:39] "Service for storing and loading data blobs representing revision content." [11:20:46] https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php [11:21:16] and I can find a lot of those keys in the mc1027's slabs [11:21:24] but they are also probably used a lot [11:23:09] <_joe_> yes they are [11:25:19] from the wancache dashboard is difficult to spot changes [11:27:54] _joe_ in my opinion it would be helpful in these cases to have a wancache metric about the size of a get/set.. then it would be easy to have a bandwidth usage graph, to easily spot anomalyes and trends [11:33:08] last occurrence on mc1033, slab 176 - two keys only, SqlBlobStore related [11:34:13] ok a pattern seems to emerge, opening a task [11:46:54] https://phabricator.wikimedia.org/T239983 [11:57:54] <_joe_> it's not an extremely easy problem to solve tbh [14:37:11] bblack: re: nodejs and DNS, I just found this https://eng.uber.com/denial-by-dns/ [14:37:48] not obviously the same issue; not obviously different either [14:40:32] wow [14:40:56] I skipped like 3/4 of it because it's all noise, but yeah [14:41:13] this is what happens when someone realizes DNS is something they have to think about for like the first time ever, I guess [14:43:08] at least they correctly identify that it's the language runtime/libraries/whatever that are at fault in their shitty dns issues :) [14:45:32] yeah [14:48:02] they couldn't resolve localhost anymore, how sweet [14:54:16] _joe_ the spikes in api latency follow the mcrouter tko spikes, at least starting from this morning [14:54:35] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m [14:54:49] vs [14:54:49] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=now-12h&to=now&fullscreen&panelId=6 [14:55:11] why not yesterday it is a mistery [14:56:28] something interesting from the mcrouter graphs are the failed deleted [14:56:31] *deletes [14:56:33] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&fullscreen&panelId=5&from=now-24h&to=now [14:56:50] that is more close to api latency patterns [15:10:53] at the moment the mcrouter issue seems to be again mc1026 [15:11:06] slab 176, there is one key of ~600K [15:11:26] a few gets and it is relatively easy to saturate the rx 1g nic [15:13:45] the key is [15:13:46] ITEM WANCache:v:global:SqlBlobStore-blob:svwiki:tt%3A42632535 [581794 b; 1576249835 s] [15:38:33] ok tcpdump confirms, I was able to capture some traffic for mc1026 [15:40:49] <_joe_> it's always /that/ key? [15:40:54] <_joe_> or a series of similar ones? [15:43:33] that one [15:44:11] multiple apis requesting it, ~600K and mc1026 saturates rx [15:46:44] _joe_ any idea about it? [15:48:24] <_joe_> elukey: nope, I can probably look what's inside it [15:52:58] I am going to check again tcpdump to see if the key is the same or if we have variations [15:53:19] from what reported in the task though it seems that we had issues with smaller SqlBlobStore keys as well [15:53:28] but ~200k keys takes more to saturate the nic [15:58:37] in the pcap that I am checking it is always that key from multiple api-appservers [16:03:40] <_joe_> so it's a classical modify-delete-get stampede [16:12:21] _joe_ but probably from a bot, or similar? I am trying to understand how this happens [16:13:21] <_joe_> a bot doing what? [16:13:30] <_joe_> edit/read multiple times? [16:13:38] I have no idea, I am asking [16:14:32] if the theory is that the same key gets deleted and then get/set by multiple apis [16:15:19] I was trying to imagine what could trigger it [16:16:06] when we had the translategroup key issue, part of the problem initally was that once the key disappeared, then every api accessing a page requiring it would have tried to get and set [16:16:13] but it should be more resilient now [16:20:53] <_joe_> report this in a task, point it to timo :)