[13:15:30] what's up with all the opcache alerts on -operations? [13:16:11] marostegui, jbond42 is discussing it on -serviceops [13:16:17] thanks liw [13:16:19] php-fpm restart [13:16:22] ah ok, thanks [16:17:38] o/ [16:42:39] i could use advice re: T251457 [16:42:39] T251457: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 [17:06:36] brennen: just replied [17:06:50] thanks marostegui [18:43:19] I was checking memcached graphs to see if the new gutter settings for mcrouter were ok, and I noticed this [18:43:22] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&from=now-2d&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All [18:43:31] the tkos are only for appservers [18:44:25] and the good news is that we are using the gutter pool [18:44:26] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-2d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All [18:51:43] and from the logs mc1021 seems to be the shard most affected [18:54:08] but this time it is difficult to find the culprit [19:05:15] one thing that stands out is the following increase in DELETEs after the mw deploy [19:05:18] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=29&fullscreen&orgId=1&from=now-24h&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All [19:10:29] liw: o/ around by any chance? [19:15:55] (brb dinner) [19:45:21] with tcpdump on mc1021 I see a lot of [19:45:22] c@ .....delete rl-deps:mediawiki.ui.icon|vector|zh:lock [19:45:22] .8q.....add rl-deps:mediawiki.ui.button|vector|en:lock 1 30 1 [19:45:30] not sure if it was there before or not [19:48:59] ah that is resource loader [19:54:51] includes/resourceloader/ResourceLoader.php: $scopeLocks[$entity] = $cache->getScopedLock( "rl-deps:$entity", 0 ); [19:54:52] indeed ;) [19:55:09] I guess it wasn't changed by the last train right? [19:55:22] I checked in the changelog but didn't really found anything [19:55:55] whatever happens, that part of the code is probably not the issue [19:56:04] I would guess it is something up in the stack [19:56:29] we have a bunch of resource loader dashboards at least [19:56:46] ah yes makes sense [19:56:51] https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1 [19:57:21] seems like we have way more requests made to load.php [19:57:34] which would explain that there are more locks as well [19:57:44] right after the deployment? [19:57:47] seems weird timing [19:57:57] https://grafana.wikimedia.org/d/000000066/resourceloader?panelId=46&fullscreen&orgId=1&from=now-3d&to=now [19:58:12] so something happened at 13:00 UTC ;] [19:58:30] and we have twice more calls to the resource loader. That would qualify for an ubn [19:58:32] hashar: https://sal.toolforge.org/sal/log/AEYvy3EBj_Bg1xd31H8L :) [19:58:42] this is why I was asking about the last train [19:59:50] elukey: Deployment is when they change and so the CDNs hits RL. [20:00:08] is there a task? [20:00:17] that should be made a train blocker [20:00:21] What? [20:00:24] It's been this way for a decade. [20:00:43] There's a task about whether we should improve it or just live with it, which IIRC got closed last month? [20:01:23] Load is about the same each group2 deploy: https://grafana.wikimedia.org/d/000000066/resourceloader?panelId=46&fullscreen&orgId=1&from=now-30d&to=now [20:02:02] James_F: sure I am super ignorant about it, I am just reporting timings.. [20:02:10] Yeah, no worries. [20:02:25] But unless something is different from the past, it's not UBN, it's just tech debt/architectural failings. [20:02:56] In this case, the "fix" is containerised MW where we can pre-build the modules before deployment. [20:03:36] ahhh [20:03:37] https://grafana.wikimedia.org/d/000000066/resourceloader?from=1586304000000&to=1588291199000 [20:03:45] indeed there is a spike after the deployment ;D [20:05:00] sure but right after the deployment I see this DELETE pattern to memcached [20:05:03] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=29&fullscreen&orgId=1&from=now-7d&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All [20:05:06] Yup, because the modules don't change except when we deploy something, and the vast majority of traffic (and different RL bundles shipped) are for group2 only. Smaller spikes after each deploy. [20:05:07] that is very strange [20:06:30] Hmm, is that normal? I thought we just used LRU cache for those keys. [20:08:37] it seems a strange pattern, but maybe it is only temporary. I started investigating some memcached timeouts from MW and ended up seeing this weird pattern, so I thought to ask [20:09:02] nothing is really on fire, I'll restart checking tomorrow morning to see if this is only temporary or not [20:09:03] * James_F nods. [20:09:07] and in case I'll open a task :) [20:09:15] Is there data older than last week? [20:09:21] This is the first train for a fortnight, so… [20:09:36] But grafana says "no data" if I ask for 30 days' data. [20:10:24] I can check the aggregate metric [20:10:34] James_F: https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=now-30d&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All [20:10:39] and then select "delete" [20:11:22] Aha, thanks. The spike does indeed look bigger than normal. [20:11:39] but less horrible in the 30d view [20:11:42] Though that might be complicated by us reverting the train. [20:11:45] Ack. [20:11:55] nono I was not suggesting that [20:12:21] I was wondering if there was anything hitting more memcached and possibly generating some extra timeouts [20:13:21] anyway, going to log off, thanks a lot hashar and James_F for the brainbounce [20:13:29] o/ [20:13:36] elukey: you are welcome :) [20:13:41] elukey: thanks you for watching stuff! [20:18:15] +1 [20:18:22] Never a mistake to ask if things are normal. :-)