[06:48:55] <elukey>	 good morning
[06:49:06] <elukey>	 mc1036 keeps showing up good results
[06:49:08] <elukey>	 https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-3h&to=now
[06:49:43] <elukey>	 - with 1.5 and +20G of memory it is now storing ~47M keys (was: 28M)
[06:50:28] <elukey>	 - evictions started happening when we reached max memory allocation, but their frequency is way lower than the rest of the pool
[06:52:48] <elukey>	 - get hit-ratio keeps growing, now 0.9566 (it was around this number even before)
[06:54:56] <elukey>	 from the slab metrics it looks like we are good (at least, afaict)
[07:01:35] <elukey>	 effie: --^
[07:05:54] <elukey>	 _joe_ - ssh to mc2036 and run 'echo "watch fetchers" | nc localhost 11211'
[07:06:38] <elukey>	 https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L1097
[07:06:40] <elukey>	 :O
[07:07:14] <elukey>	 we could replace that memkeys script in theory
[07:07:27] <elukey>	 (the one that dumps keys to disk)
[07:26:44] <_joe_>	 oh good
[07:30:28] <_joe_>	 elukey: we should've restarted a jessie machine too
[07:30:51] <elukey>	 as control?
[07:37:27] <effie>	 we have one control
[07:37:49] <effie>	 we have mc2037 which replaced the dead memcached shard in sugust
[07:37:51] <effie>	 august*
[07:38:05] <effie>	 but we have added more memory so it is a slight unfair race
[07:40:30] <elukey>	 yes yes but I think that Joe meant a control with the same eqiad traffic
[07:40:36] <elukey>	 or similar
[07:41:36] <elukey>	 we can try to do it for the next shard, but I think we'll not get surprises
[07:41:48] <elukey>	 the new memcached is simply way better than the other one
[07:51:21] <effie>	 let's declare win and wait for december to roll this out
[07:51:34] <effie>	 we will merge the onhost patch too soon
[07:51:49] <effie>	 that should lighten the traffic
[07:52:31] <elukey>	 yep another big one
[09:03:41] <_joe_>	 effie: I doubt we will be allowed to do the upgrades in december though
[09:03:57] <_joe_>	 did we have any news about the deployment freeze?
[09:09:14] <effie>	 on memcached?
[09:10:52] <effie>	 december will be thin in deployments, but that should not prevent the memcached upgrade should the redis part is sorted
[09:11:33] <effie>	 that has been my understanding
[09:19:33] <_joe_>	 we need to clear this stuff up the chain I guess :)
[09:54:02] <kormat>	 is there a way to search icinga for multiple patterns at once?
[09:54:21] <kormat>	 i'm currently using something like this: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^db2
[09:54:37] <kormat>	 but i'd like to be able to also match db1/pc[12]/es[12] etc
[09:55:47] <apergos>	 if you find one, let me know
[09:56:50] <kormat>	 hmm. it does accept _some_ globbing. `^db[12]` does work.
[09:59:07] <kormat>	 this works! https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\)[12]
[10:00:47] <kormat>	 the full form for what i want is https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29
[10:01:22] <kormat>	 which shows all non-OK services for all db hosts, _ignoring_ downtimes, so i can tell if anything will alert if a DT expires \o/
[10:15:13] <apergos>	 oh that's good!
[10:17:50] <apergos>	 maybe I'll just bookmark that, heh
[11:23:50] <jbond42>	 8FYI all i have merged an update to the pcc cli tool which automaticly posts to the gerrit change been tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/636652
[11:25:06] <kormat>	 jbond42: nice!
[11:28:06] <hnowlan>	 lovely
[12:34:36] <kormat>	 if we have a cname pointing to a production host, is there a mechanism for getting that cname included as a SAN in the puppet cert for that host?
[12:57:32] <sukhe>	 jbond42: nice, thanks!
[12:59:48] <jbond42>	 kormat: in theory you can set `profile::base::puppet::dns_alt_names` but i dont think its used anywhere and probably requires manual steps to regenerate the cert
[13:00:16] <kormat>	 jbond42: mm, ok. thanks.
[14:59:49] <_joe_>	 kormat: what is the problem you're trying to solve?
[15:00:29] <_joe_>	 btw the alt names stuff was used in the past, we've just kinda-moved away
[15:00:46] <_joe_>	 because we prefer to declare certs in code as much as possible via cergen
[15:01:05] <kormat>	 _joe_: currently there's a bunch of scripts that have `ZARCILLO=db1115` at the top. it would be nice if we could have a `zarcillo` cname that points to the active server. but this will kill the ssl without SAN
[15:01:05] <_joe_>	 esp because you can create ecdsa certs in that case, which is a great perf advantage
[15:01:38] <_joe_>	 kormat: so that's for connecting to mysql, correct?
[15:01:43] <kormat>	 yes
[15:01:55] <_joe_>	 so you want the cert to contain both the local hostname AND the SAN, correct?
[15:01:59] <kormat>	 correct
[15:02:12] <_joe_>	 so until jbond42 gifts us the next great thing
[15:02:24] <_joe_>	 I would go with profile::base::puppet::dns_alt_names
[15:02:36] <_joe_>	 but yes, it means you'll have to re-sign the hosts in puppet
[15:02:41] <_joe_>	 and remove their current cert
[15:03:10] <kormat>	 that doesn't sound awful
[15:05:10] <_joe_>	 kormat: not 100% sure though, maybe you'd be better off managing them with cergen as everything else
[15:05:43] <_joe_>	 kormat: actually sorry, we were driving you offbase I think
[15:07:44] <kormat>	 worry not. it's stevie-in-2-weeks' problem :)
[15:08:00] <_joe_>	 ack :)
[16:56:12] <hnowlan>	 last maps update for a while, promise (does it even matter now that paging is fixed?): eqiad is stable and has a properly synced OSM database. codfw's sick node (maps2002) has been replaced with maps2005 and the cluster is at full strength, all codfw hosts have regained some disk
[16:56:45] <cdanis>	 awesome, thanks hnowlan !
[17:08:43] <cdanis>	 jbond42: thank you again for adding the 'full diff' feature to pcc <#
[17:08:45] <cdanis>	 <3
[17:12:25] <jbond42>	 np :)
[18:52:07] <shdubsh>	 I'm seeing mw1379 as pooled, but alerting with status 503 for a couple days now.  Anyone working on this host?
[18:53:47] <cdanis>	 oh, good catch
[18:54:43] <cdanis>	 https://logstash.wikimedia.org/goto/328d1b73c5e60e068cd52e62360e06f5
[18:54:46] <cdanis>	 does not really look healthy
[18:55:11] <shdubsh>	 it's depooled now
[18:55:21] <cdanis>	 thanks!
[18:55:35] <cdanis>	 I'll file a ticket
[18:59:06] <cdanis>	 ... huh, I didn't expect a restart-php7.2-fpm to clear things, but it looks like it did
[18:59:58] <cdanis>	 it renders enwiki main page fine now, too
[19:00:26] <cdanis>	 repooling
[19:01:13] <shdubsh>	 cool :)
[19:03:37] <dancy>	  hmmm.
[20:02:59] <Krinkle>	 effie: regarding memc-on-host-tier, do I understand correctly that the route exists on all appservers today, but is transparent/alias to default behaviour on all but debug/canary?
[20:03:22] <Krinkle>	 so we'll roll out out by host from mcr/puppet config rather than e.g. by wiki or by host from mw config
[20:03:36] <Krinkle>	 and use appserver latency metrics or some such to gauge parser cache perf impact?
[20:05:33] <Krinkle>	 in terms of capacity, how do they compare to main cluster - is there a doc or task with more details on e.g. how it is configured and/or what we're thinking of tuning? the tasks Im finding are mostly about how it works intenrally etc which we're almost done with :)
[20:12:31] <effie>	 Krinkle: so the route exists on all servers, but right now it is exactly the same as /*/mw
[20:12:58] <effie>	 but we have enabled onhost cache on mwdebug1001 and one codfw server
[20:13:28] <effie>	 where this route checks first the local memcached and on a miss fetches from the memcache cluster
[20:14:09] <effie>	 what we are looking for is basically to reduce Rx traffic from the mecached cluster
[20:14:34] <effie>	 in my tests I didnt observer significant latency improvent
[20:14:42] <effie>	 obeserve*
[20:15:40] <effie>	 now in terms of capacity
[20:16:10] <effie>	 I looked a bit on RAM usage across our mediawiki clusters
[20:16:37] <effie>	 sto we will start by dedicating 1/4 of a server's RAM to onhost memcached
[20:16:58] <effie>	 after we roll out to canaries (1 canary on monday for starters)
[20:17:28] <effie>	 we will have a better idea if works or not
[20:17:58] <effie>	 as far as docs go, we have the meeting notes from our meetings
[20:18:46] <effie>	 and the tasks
[20:19:18] <effie>	 https://phabricator.wikimedia.org/T263958
[20:19:32] <effie>	 https://phabricator.wikimedia.org/T244340
[20:19:57] <effie>	 Krinkle: does that answer your questions?
[20:24:24] <effie>	 oh, last but not least
[20:24:41] <effie>	 when we deploy the mediawiki config change
[20:25:28] <effie>	 it will have effect only on servers with the onhost enabled featureflag
[20:25:40] <effie>	 so we can controll via puppet the rollout
[20:39:51] * Krinkle had to step out for a minute
[20:40:37] <Krinkle>	 effie: thanks, perfect.
[20:40:51] <effie>	 :)
[20:40:58] <Krinkle>	 effie: ack, I don't expect improvements, although itd be a nice win if it does. we're actually more concerned about latency increases.
[20:42:58] <effie>	 we'll see!
[20:43:29] <Krinkle>	 once we're through all the stages incl wanobjectcache (which reveives many more calls within a web req), I'd like to toggle it all back off as well to compare side by side.
[20:44:27] <Krinkle>	 for now I think its just a matter of monitoring for functional stuff through each stage, parser cache teds to be called at most once in a whole web req so that should be negliclble
[20:44:43] <Krinkle>	 where the spellling of negligible is also considered negligible
[21:00:27] <effie>	 lol