[06:48:55] good morning [06:49:06] mc1036 keeps showing up good results [06:49:08] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-3h&to=now [06:49:43] - with 1.5 and +20G of memory it is now storing ~47M keys (was: 28M) [06:50:28] - evictions started happening when we reached max memory allocation, but their frequency is way lower than the rest of the pool [06:52:48] - get hit-ratio keeps growing, now 0.9566 (it was around this number even before) [06:54:56] from the slab metrics it looks like we are good (at least, afaict) [07:01:35] effie: --^ [07:05:54] _joe_ - ssh to mc2036 and run 'echo "watch fetchers" | nc localhost 11211' [07:06:38] https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L1097 [07:06:40] :O [07:07:14] we could replace that memkeys script in theory [07:07:27] (the one that dumps keys to disk) [07:26:44] <_joe_> oh good [07:30:28] <_joe_> elukey: we should've restarted a jessie machine too [07:30:51] as control? [07:37:27] we have one control [07:37:49] we have mc2037 which replaced the dead memcached shard in sugust [07:37:51] august* [07:38:05] but we have added more memory so it is a slight unfair race [07:40:30] yes yes but I think that Joe meant a control with the same eqiad traffic [07:40:36] or similar [07:41:36] we can try to do it for the next shard, but I think we'll not get surprises [07:41:48] the new memcached is simply way better than the other one [07:51:21] let's declare win and wait for december to roll this out [07:51:34] we will merge the onhost patch too soon [07:51:49] that should lighten the traffic [07:52:31] yep another big one [09:03:41] <_joe_> effie: I doubt we will be allowed to do the upgrades in december though [09:03:57] <_joe_> did we have any news about the deployment freeze? [09:09:14] on memcached? [09:10:52] december will be thin in deployments, but that should not prevent the memcached upgrade should the redis part is sorted [09:11:33] that has been my understanding [09:19:33] <_joe_> we need to clear this stuff up the chain I guess :) [09:54:02] is there a way to search icinga for multiple patterns at once? [09:54:21] i'm currently using something like this: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^db2 [09:54:37] but i'd like to be able to also match db1/pc[12]/es[12] etc [09:55:47] if you find one, let me know [09:56:50] hmm. it does accept _some_ globbing. `^db[12]` does work. [09:59:07] this works! https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\)[12] [10:00:47] the full form for what i want is https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 [10:01:22] which shows all non-OK services for all db hosts, _ignoring_ downtimes, so i can tell if anything will alert if a DT expires \o/ [10:15:13] oh that's good! [10:17:50] maybe I'll just bookmark that, heh [11:23:50] 8FYI all i have merged an update to the pcc cli tool which automaticly posts to the gerrit change been tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/636652 [11:25:06] jbond42: nice! [11:28:06] lovely [12:34:36] if we have a cname pointing to a production host, is there a mechanism for getting that cname included as a SAN in the puppet cert for that host? [12:57:32] jbond42: nice, thanks! [12:59:48] kormat: in theory you can set `profile::base::puppet::dns_alt_names` but i dont think its used anywhere and probably requires manual steps to regenerate the cert [13:00:16] jbond42: mm, ok. thanks. [14:59:49] <_joe_> kormat: what is the problem you're trying to solve? [15:00:29] <_joe_> btw the alt names stuff was used in the past, we've just kinda-moved away [15:00:46] <_joe_> because we prefer to declare certs in code as much as possible via cergen [15:01:05] _joe_: currently there's a bunch of scripts that have `ZARCILLO=db1115` at the top. it would be nice if we could have a `zarcillo` cname that points to the active server. but this will kill the ssl without SAN [15:01:05] <_joe_> esp because you can create ecdsa certs in that case, which is a great perf advantage [15:01:38] <_joe_> kormat: so that's for connecting to mysql, correct? [15:01:43] yes [15:01:55] <_joe_> so you want the cert to contain both the local hostname AND the SAN, correct? [15:01:59] correct [15:02:12] <_joe_> so until jbond42 gifts us the next great thing [15:02:24] <_joe_> I would go with profile::base::puppet::dns_alt_names [15:02:36] <_joe_> but yes, it means you'll have to re-sign the hosts in puppet [15:02:41] <_joe_> and remove their current cert [15:03:10] that doesn't sound awful [15:05:10] <_joe_> kormat: not 100% sure though, maybe you'd be better off managing them with cergen as everything else [15:05:43] <_joe_> kormat: actually sorry, we were driving you offbase I think [15:07:44] worry not. it's stevie-in-2-weeks' problem :) [15:08:00] <_joe_> ack :) [16:56:12] last maps update for a while, promise (does it even matter now that paging is fixed?): eqiad is stable and has a properly synced OSM database. codfw's sick node (maps2002) has been replaced with maps2005 and the cluster is at full strength, all codfw hosts have regained some disk [16:56:45] awesome, thanks hnowlan ! [17:08:43] jbond42: thank you again for adding the 'full diff' feature to pcc <# [17:08:45] <3 [17:12:25] np :) [18:52:07] I'm seeing mw1379 as pooled, but alerting with status 503 for a couple days now. Anyone working on this host? [18:53:47] oh, good catch [18:54:43] https://logstash.wikimedia.org/goto/328d1b73c5e60e068cd52e62360e06f5 [18:54:46] does not really look healthy [18:55:11] it's depooled now [18:55:21] thanks! [18:55:35] I'll file a ticket [18:59:06] ... huh, I didn't expect a restart-php7.2-fpm to clear things, but it looks like it did [18:59:58] it renders enwiki main page fine now, too [19:00:26] repooling [19:01:13] cool :) [19:03:37] hmmm. [20:02:59] effie: regarding memc-on-host-tier, do I understand correctly that the route exists on all appservers today, but is transparent/alias to default behaviour on all but debug/canary? [20:03:22] so we'll roll out out by host from mcr/puppet config rather than e.g. by wiki or by host from mw config [20:03:36] and use appserver latency metrics or some such to gauge parser cache perf impact? [20:05:33] in terms of capacity, how do they compare to main cluster - is there a doc or task with more details on e.g. how it is configured and/or what we're thinking of tuning? the tasks Im finding are mostly about how it works intenrally etc which we're almost done with :) [20:12:31] Krinkle: so the route exists on all servers, but right now it is exactly the same as /*/mw [20:12:58] but we have enabled onhost cache on mwdebug1001 and one codfw server [20:13:28] where this route checks first the local memcached and on a miss fetches from the memcache cluster [20:14:09] what we are looking for is basically to reduce Rx traffic from the mecached cluster [20:14:34] in my tests I didnt observer significant latency improvent [20:14:42] obeserve* [20:15:40] now in terms of capacity [20:16:10] I looked a bit on RAM usage across our mediawiki clusters [20:16:37] sto we will start by dedicating 1/4 of a server's RAM to onhost memcached [20:16:58] after we roll out to canaries (1 canary on monday for starters) [20:17:28] we will have a better idea if works or not [20:17:58] as far as docs go, we have the meeting notes from our meetings [20:18:46] and the tasks [20:19:18] https://phabricator.wikimedia.org/T263958 [20:19:32] https://phabricator.wikimedia.org/T244340 [20:19:57] Krinkle: does that answer your questions? [20:24:24] oh, last but not least [20:24:41] when we deploy the mediawiki config change [20:25:28] it will have effect only on servers with the onhost enabled featureflag [20:25:40] so we can controll via puppet the rollout [20:39:51] * Krinkle had to step out for a minute [20:40:37] effie: thanks, perfect. [20:40:51] :) [20:40:58] effie: ack, I don't expect improvements, although itd be a nice win if it does. we're actually more concerned about latency increases. [20:42:58] we'll see! [20:43:29] once we're through all the stages incl wanobjectcache (which reveives many more calls within a web req), I'd like to toggle it all back off as well to compare side by side. [20:44:27] for now I think its just a matter of monitoring for functional stuff through each stage, parser cache teds to be called at most once in a whole web req so that should be negliclble [20:44:43] where the spellling of negligible is also considered negligible [21:00:27] lol