[00:50:14] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) Hey, sorry for the delay, we should be able to deploy this tomorrow. [05:08:26] 10netops, 10Operations, 10ops-eqiad, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui) [07:09:35] <_joe_> dear traffic people, I submit https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/631686 for your review [08:11:48] _joe_: lgtm, though if we end up having to change the value various times it might be worth turning it into a config setting [08:27:21] <_joe_> ema: I'm too lazy for that [08:27:26] <_joe_> but agreed :P [08:40:23] hi traffic - as lvs1016 seems "not so broken" anymore today. I would like to continue and remove another two LVS endpoint when you're okay with that [08:43:10] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [08:43:44] https://phabricator.wikimedia.org/T264398 [08:46:40] jayme: lvs1016 should be fine now judging from T264227, so +1 [08:46:41] T264227: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 [08:48:41] gilles: we did move from varnish 5 to varnish 6 during the last couple of weeks [08:49:10] gilles: see T263557 for the timeline of upgrades [08:49:11] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [08:49:12] ema: yeah, I figured that would be a suspect. were any new features enabled, or is the config identical? [08:50:32] the configuration (as in VCL) hasn't changed, but yeah there have been code changes [08:52:02] some responseStart increases during the rolling restarts are normal due to caches getting emptied, you'd expect things to get back to as they were before after a while though [08:52:55] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) The most logical suspect is the rollout of Varnish 6 {T263557} [08:53:03] gilles: I was looking at response-time-by-host during the upgrades and it seemed to me that things were alright [08:53:37] yeah I saw some earlier alerts during the rollout and figured it would do that [08:53:42] but it's stayed elevated [08:53:53] the caches should be warm by now [08:54:01] definitely [08:56:34] gilles: can you see the regression on individual hosts too? [08:58:16] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [[ https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1 | It seems to affect North America before Europe ]], and the timing lines up with the ro... [08:58:28] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) p:05Triage→03High [08:58:29] I'll look [08:58:41] I've update the task with geographical correlation [08:58:46] *updated [08:58:50] thanks [09:00:22] would it be hard to roll it back on eg. half of esams? [09:00:26] to verify [09:00:57] nope, I'd rather identify one single host and roll back only there if possible though! [09:01:45] s/identify/identify the issue on/ [09:01:53] I think the per-host dashboard is misleading because it's comparing to previous day [09:01:56] let me switch to previous week [09:01:59] ack [09:12:14] ema: the difference is easier to see when looking at all hosts: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1&var-dc=esams&var-host=All [09:12:41] there are a lot of spikes on individual hosts that make the difference harder to see outside of spikes [09:12:48] but hovering you can check the values for sure [09:13:39] there's a performance issue in loading the dashboard! :D [09:18:36] gilles: I don't really see any clear difference on, for example, cp3054 (upgraded 2020-09-29 14:24:08) https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?viewPanel=5&orgId=1&var-dc=esams&var-host=cp3054 [09:23:00] I know, but the pattern is visible for the DC as a whole, right? [09:28:52] gilles: it is on navigation-timing-by-continent, not on response-time-by-host [09:28:58] or at least I don't see it :) [09:29:29] This link: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1&var-dc=esams&var-host=All [09:35:06] I don't know if I need a workstation refresh, but the heatmap is killing my chrome [09:39:26] gilles: ok so I added https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?viewPanel=6&orgId=1&var-dc=esams&var-host=All [09:39:43] that would be "now - last_week" [09:40:58] and yeah I can see a pattern, though not very clearly to be honest [09:41:10] anwyays, let's downgrade some nodes next week and see! [09:49:45] gilles: it would be good to have a sort of SLO on which to base decisions I think, rather than "I see a pattern" :) [09:51:43] but that's a broader conversation that needs to happen between performance and sre, nothing we're gonna solve today [10:33:47] ema: I've improved the dashboard, check it out [10:34:11] looking at 7 days a 5m rolling average was too small, this is no 1h and it makes the difference stand out clearly [10:34:18] *on 1h now [10:35:34] I've also made the heatmap collapsed by default which should help with the dashboard perf ;) [10:36:20] you can see the difference on individual hosts now [10:44:58] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) I've improved the per-DC/host dashboard: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host The change is clearly visible on Esams on 2020-09-29 and on Eqsin... [10:45:19] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) a:03ema [10:46:17] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify that this is what's causing the issue. [11:07:38] gilles: I can indeed! [11:12:13] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6512038, @Gilles wrote: > @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify t... [13:01:41] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Just throwing in some random points/counterpoints to ponder: * It's possible it does take more than a day or three for the frontend caches to settle into an optimal patter... [13:12:19] 10Traffic, 10Operations, 10Performance-Team: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Eh maybe a few more to think about too: * The train this week caused a fair amount of churn with the rollout + rollback of 1.36.0-wmf.11. Is there any chance the train i... [14:47:25] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) This is live now @bcampbell -- have Element give it a shot and let us know? [15:32:22] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Thanks all. It's working. https://federationtester.matrix.org/#foundation.wikimedia.org [15:33:13] 10Traffic, 10DNS, 10Matrix, 10Operations, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) 05Open→03Resolved 🎉 [20:06:37] 10Traffic, 10Operations, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh)