[04:09:09] dpifke: looking at stuff in-flight, xhgui prometheus [04:09:19] checking /metrics, works in beta, but times out in prod? [04:09:27] might be due to a slow query? [04:10:53] currently it gets cut off by the web proxy, so it might actually succeed internally. if so, I suppose that's fine for now, but we might want to make it private in that case, e.g. denying from apache like we do for POST /import etc. does prometheus crawl directly from its central place, or is there a local deamon, so e.g. 127/::1 would be fine? [04:11:17] or I suppose maybe denying from public apache entirely which would not need a whitelist. [04:15:12] caught it at https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1107&hours=1 [04:15:30] SELECT COUNT(*) AS profiles, MAX(request_ts) AS latest, SUM(LENGTH(profile)) AS bytes FROM xhgui /* db1107 xhgui 65s */ [04:26:24] commented at T256039 for now [04:26:25] T256039: Prometheus exporter for XHGui - https://phabricator.wikimedia.org/T256039 [04:26:30] nn o/ [13:18:51] Krinkle: https://drive.google.com/file/d/1WzvqRNEpg9ndTq9RLHSzCH43vAOiQnUn/view [13:33:17] phedenskog: I can't find the video of your we love speed talk online, do you have a link> [13:33:18] ? [13:41:50] ah, found it [13:46:39] I've tried to compile all of our talks'videos into a folder on the team drive, let me know if I'm missing any [13:46:46] https://drive.google.com/drive/folders/1V51ctkoSGndOXT4pWKBHpxHBPoRp-mIM [13:51:22] ori's old video from 2014 about the state of performance back then is interesting. apparently we were at 5 seconds median PLT on mobile at that point. toady we're around 1 [14:29:56] Aha cool [15:14:01] gilles: marvellous, thank you! [16:06:19] https://blog.cloudflare.com/browser-beta/ [19:20:35] AaronSchulz: what is the current state of redis replication in prod? is it something we do ad-hoc by hand during switch overs? or something that mostly kinda works by itself but with known issues during a switch over? what I'm really asking is - would it be expected that the codfw switchover effectively wiped its data? [19:20:50] and does this mean random old values will pop back into existence when we switch back? [19:21:29] * Krinkle digging through cookbooks [19:21:35] ( \cc cdanis in case you know the answer ) [19:40:27] I do not, but rzl might [19:47:53] we explicitly flip the direction of replication during the read-only period of the switch, I think it's step 5 or so? wikitech:Switch_datacenter will have a pointer but I'm on my phone atm [19:48:43] I believe that means we can expect redis to be pretty much caught up [19:49:30] rzl: thx, ok, I'll rule that out for now then, thanks! [22:26:46] Krinkle: yeah, there is the master/replica musical chair dance as part of DC switchover steps. What problem are you looking at? [22:28:33] AaronSchulz: about depooling a redis server for buster upgrade [22:28:46] eg. whether loss of an shard is already the norm or not. [22:29:13] which I understand now, it is not. just wanted to be sure whether it's worth looking into or not. since that can be a rather deep rabbit hole. [23:44:57] Krinkle: btw, did we forget about https://gerrit.wikimedia.org/r/c/mediawiki/core/+/589465 ? [23:46:29] AaronSchulz: I forgot what it connects to currently? [23:46:36] as an objective I mean [23:46:40] Krinkle: key size metrics [23:46:47] (by key class) [23:47:25] hm.. I thought that was to help inform multi-dc main stash decisions, which we've already done since, right? [23:58:30] it's still useful in general