[07:33:21] I checked the memcache alarms for mw, and they all seem related to conns to mc1035 [07:33:25] that line up with [07:33:26] https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1035&var-slab=All&fullscreen&panelId=49 [07:34:49] the slab 163 hosts keys of ~300K [07:36:16] but it is a bit difficult at this point to figure out which one caused the issue [07:48:33] zoom to mc1035:slab-163 [07:48:35] https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1035&var-slab=163 [07:49:04] at ~4:00 UTC there was a big expire-unfetched event [07:49:15] after that, there were spikes of sets [08:00:58] very interesting, there are a lot of keys with the "segment" keywork [08:01:01] *keyword [08:01:12] (https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/495321/) [08:01:57] so one possible working theory is that some big key was segmented in multiple pieces, and a lot of them ended up to the same slab on mc1035 [08:02:53] or even multiple big objets split ended up in this situation [08:20:21] <_joe_> sigh [08:20:34] <_joe_> so much for trusting consistent hashing stochasticity [08:21:51] <_joe_> elukey: so we're now splitting large keys? rather than rejecting them as I suggested? [08:22:22] * _joe_ waits for when we'll find a 150 MB memcache key segmented across 10 backends [08:23:03] _joe_ yes we do it for a couple of use cases as far as I know [08:24:19] apistashedit and parsercache [08:24:35] very ignorant about them [08:25:17] but I guess that a big visual editor temporary change could end up in saving mbs to memcached? [08:31:18] <_joe_> yeah I'd expect us to move that to a better storage model (apistashedit specifically) [08:31:56] <_joe_> If I find the time, this is another thing I should do [13:10:06] godog or someone who knows our promethous set up. im trying to add a scrap endpoint as per https://apereo.github.io/cas/6.0.x/configuration/Configuration-Properties.html#prometheus but i cant seem to find the correct place to add the config in puppet. anyone able to point me in the right direction [13:44:18] jbond42: tl;dr is to add a job in ./modules/profile/manifests/prometheus/ops.pp [13:46:01] effie: ok for me to take say mw2274 for some partman tests ? [14:04:09] godog: let me check my secret list [14:04:21] and I will give you a server to pet [14:05:15] haha sounds great [14:08:01] mw2228.codfw.wmnet would be great [14:08:12] I am putting your name down [14:08:22] thank you :) [14:08:30] let me know how it went [14:08:50] effie: sweet -- thanks! I'll let you know [15:16:02] godog: you are messing my strike of successul reimages [15:16:10] that was not in our deal [15:16:32] <--- supportive colleague [15:17:01] * godog scoundrel [15:17:35] effie: lol I didn't realize we gamified reimages [15:17:50] I am trying to keep it interesting [15:18:48] heheh fair enough [15:23:29] godog: effie wants to level up :-P [15:23:47] * effie 1UP [15:25:18] heheh [15:25:35] <_joe_> effie lmk which bonuses you find [15:26:05] godog: sory forgot to say thakx for the pointer, thanks also for the +1 [15:26:26] jbond42: you are welcome [15:27:13] _joe_: so far that in some rare cases [15:27:31] I think it was just 1 host out of who knows how many far, I will count in a bit [15:27:51] where memcached was installed 2secs before php-fpm restarted [15:28:06] but looking at our code [15:28:27] I am not sure it is easy to fix, I mean the way we install php-* packages [15:33:54] <_joe_> I'm pretty sure it is [15:34:29] <_joe_> but we didn't restart php-fpm from puppet by design [15:34:43] <_joe_> so you might need to restart php-fpm everywhere just to be sure [15:34:53] <_joe_> I mean everywhere you reimage [15:36:47] yes I think the solution lies there and not in puppet [15:37:15] then a handful of puppet taking way longer to finish, which ok, happens [15:37:32] and the icinga alerts with moritz's microcode [15:37:37] that is all really [15:39:52] _joe_: every reimaged host gets rebooted after the first puppet run [15:40:03] so as long as you don't need 2+ puppet runs to get all done there [15:40:09] it should be already restarted by the reboot [15:40:16] <_joe_> volans: so I don't get what effie was saying [15:40:25] <_joe_> we def don't need 2+ puppet runs [15:40:51] as I said, it was 1 host [15:41:12] I honestly didn't spend more time on it, since it didn't happen again [15:42:46] I can go through the logs after this is done [15:42:57] and see what happenes in the second puppet run [15:45:47] pro tip: puppetboard ;) [15:47:39] ah yes [15:51:14] or copy /var/lib/puppet/state/classes.txt between first and second run and diff it [15:53:39] <_joe_> I'm pretty sure 1 puppet run is enough [15:53:51] <_joe_> unless something has changed in the last month [15:54:00] isn't looking at the puppet logs enough ? [15:56:42] <_joe_> effie: it should, yes, and they tell you more than the puppet state [15:56:57] <_joe_> ignore the pro-tips by volans [15:57:15] ok, then we will take a look later, I just want to get eqiad finished [15:57:17] <_joe_> your problem is the actions sequence, non if an object was applied or not [15:57:21] <_joe_> today? [15:57:27] no, but asap [15:57:33] <_joe_> next week ! [15:57:36] I will do codfw in a slower pace [15:57:46] so we can tune any tiny issyes [16:08:08] <_joe_> shouldn't it be the other way around? [16:08:15] <_joe_> in codfw you can do 20 hosts at a time [16:08:52] yes I mean [16:09:17] I can do 20, but I don't have to check and repool them [16:09:35] like I need to do with eqiad, so I can spend some time on it [16:10:12] to check to pool them back* [16:21:32] there's an alert about "mw1281 is not in mediawiki-installation dsh group", maybe that slipped through in getting repooled?