[04:09:09] <Krinkle>	 dpifke: looking at stuff in-flight, xhgui prometheus
[04:09:19] <Krinkle>	 checking /metrics, works in beta, but times out in prod?
[04:09:27] <Krinkle>	 might be due to a slow query?
[04:10:53] <Krinkle>	 currently it gets cut off by the web proxy, so it might actually succeed internally. if so, I suppose that's fine for now, but we might want to make it private in that case, e.g. denying from apache like we do for POST /import etc. does prometheus crawl directly from its central place, or is there a local deamon, so e.g. 127/::1 would be fine?
[04:11:17] <Krinkle>	 or I suppose maybe denying from public apache entirely which would not need a whitelist.
[04:15:12] <Krinkle>	 caught it at https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1107&hours=1
[04:15:30] <Krinkle>	 SELECT COUNT(*) AS profiles, MAX(request_ts) AS latest, SUM(LENGTH(profile)) AS bytes FROM xhgui /* db1107 xhgui 65s */
[04:26:24] <Krinkle>	 commented at T256039 for now
[04:26:25] <stashbot>	 T256039: Prometheus exporter for XHGui - https://phabricator.wikimedia.org/T256039
[04:26:30] <Krinkle>	 nn o/
[13:18:51] <gilles>	 Krinkle: https://drive.google.com/file/d/1WzvqRNEpg9ndTq9RLHSzCH43vAOiQnUn/view
[13:33:17] <gilles>	 phedenskog: I can't find the video of your we love speed talk online, do you have a link>
[13:33:18] <gilles>	 ?
[13:41:50] <gilles>	 ah, found it
[13:46:39] <gilles>	 I've tried to compile all of our talks'videos into a folder on the team drive, let me know if I'm missing any
[13:46:46] <gilles>	 https://drive.google.com/drive/folders/1V51ctkoSGndOXT4pWKBHpxHBPoRp-mIM
[13:51:22] <gilles>	 ori's old video from 2014 about the state of performance back then is interesting. apparently we were at 5 seconds median PLT on mobile at that point. toady we're around 1
[14:29:56] <phedenskog>	 Aha cool
[15:14:01] <Krinkle>	 gilles: marvellous, thank you!
[16:06:19] <gilles>	 https://blog.cloudflare.com/browser-beta/
[19:20:35] <Krinkle>	 AaronSchulz: what is the current state of redis replication in prod? is it something we do ad-hoc by hand during switch overs? or something that mostly kinda works by itself but with known issues during a switch over? what I'm really asking is - would it be expected that the codfw switchover effectively wiped its data?
[19:20:50] <Krinkle>	 and does this mean random old values will pop back into existence when we switch back?
[19:21:29] * Krinkle digging through cookbooks
[19:21:35] <Krinkle>	 ( \cc cdanis  in case you know the answer )
[19:40:27] <cdanis>	 I do not, but rzl might
[19:47:53] <rzl>	 we explicitly flip the direction of replication during the read-only period of the switch, I think it's step 5 or so? wikitech:Switch_datacenter will have a pointer but I'm on my phone atm
[19:48:43] <rzl>	 I believe that means we can expect redis to be pretty much caught up
[19:49:30] <Krinkle>	 rzl: thx, ok, I'll rule that out for now then, thanks!
[22:26:46] <AaronSchulz>	 Krinkle: yeah, there is the master/replica musical chair dance as part of DC switchover steps. What problem are you looking at?
[22:28:33] <Krinkle>	 AaronSchulz: about depooling a redis server for buster upgrade
[22:28:46] <Krinkle>	 eg. whether loss of an shard is already the norm or not.
[22:29:13] <Krinkle>	 which I understand now, it is not. just wanted to be sure whether it's worth looking into or not. since that can be a rather deep rabbit hole.
[23:44:57] <AaronSchulz>	 Krinkle: btw, did we forget about https://gerrit.wikimedia.org/r/c/mediawiki/core/+/589465 ?
[23:46:29] <Krinkle>	 AaronSchulz: I forgot what it connects to currently?
[23:46:36] <Krinkle>	 as an objective I mean
[23:46:40] <AaronSchulz>	 Krinkle: key size metrics
[23:46:47] <AaronSchulz>	 (by key class)
[23:47:25] <Krinkle>	 hm.. I thought that was to help inform multi-dc main stash decisions, which we've already done since, right?
[23:58:30] <AaronSchulz>	 it's still useful in general