[05:00:40] <_joe_> the problem with that page is - how is a single wdqs server a backend for some front-facing service? [05:02:20] <_joe_> just read the task and I truly *hate* the idea we have a service with no uptime guarantees *and without aggressive timeouts* served by the same CDN that needs to serve our website [05:03:10] <_joe_> maybe we need to think of reinstating the old "misc" cache cluster, served only from eqiad/codfw on dedicated servers, a pair per dc is probably enough, where we put all the "important but no uptime guarantees" stuff [05:03:39] <_joe_> and keep the configuration of those *simple*, like we only use trafficserver, in the simplest config we can get away with [07:22:29] In case people are not seeing this, phaultfinder is creating a lot of tickets for all sorts of hosts for mgmt not being reachable [07:31:06] cc claime (is on clinic duty) [07:35:52] of the five servers I checked all of them are in codfw/row D, so this might be some kind of network issue [08:45:34] topranks, XioNoX: ^ when you have a moment, could you have a look please? [08:46:20] * topranks looking sry just online now [08:48:05] no rush, it's just the mgmts [09:12:21] yeah I was more worried about the VC-link in eqiad row C, though it settled down a while ago [09:22:09] thanks for the head's up effie [09:27:36] jnuche: -operations is too noisy, train is done to my understanding ? [09:28:57] effie: yeah, done and looking healthy [09:29:59] grand thank you [09:35:35] "grand" - so Irish I love it :) [09:51:30] haha [09:52:19] I noted it in the incident doc yesterday but we might need to have a conversation about continuing to use -operations as the default place to discuss stuff [09:52:40] I know traditionally it's where everything happens, but "move to -sre" in the midst of incident response is a bit jarring [09:54:01] I think we need to clean up -operations [09:54:15] like having all the merges log in the channel is not... great [09:54:47] I think we're a bit more lax about flapping alerts than we used to be too [09:55:09] probably yeah [09:58:22] FYI, I'm disabling Puppet in codfw and the edges for approx 15 mins for a Puppet server reboot [10:02:44] claime: hashar started a task about cleaning up -operations iirc, but some of the comments were a bit negative so it got closed [10:05:32] claime: https://phabricator.wikimedia.org/T384804 [10:08:23] p858snake|cloud: ty [12:13:30] hnowlan: for what is worth, we moved the conversation to sre quite early, ie there were only a couple of comments by the deployer that this is prolly related to the the ongoing deployment [12:14:32] cleaning up -operations is a lot of work, while making -sre the de facto channel is easier [12:15:17] when the teams were smaller and everything was smaller, -operations made sense, now it is unrealistic to expect anyone to be able to keep up with this SNR [12:17:34] at least, if -sre were the de facto channel, there is a change with a wee bit of scrolling up, to know if something happened the day/night before or not, while even with the best hoovering in the world of -operations, it would still be a no-go [12:19:35] what we could consider adding here, would be echoing cumin executions and confctl changes [12:20:20] and potentially a bot that could !log from here to SAL and fw the message to -operations too [14:00:10] the main problem is that -sre is a silo, most devs and deployers aren't here. We could change the culture around that but it's a bit of work [14:06:09] <_joe_> I think we can experiment with moving wikibugs to a secundary channel we all use as an actual feed whenever we are interested in it [14:06:30] <_joe_> it's useful to have SAL and alerts in a channel we're discussing incidents in [14:06:55] <_joe_> even if I keep reiterating I think we should start doing calls during inccidents [14:08:32] agreed on having the context in the discussion [14:09:33] I'm not opposed to having calls but I think they're terrible for context loading so the IC needs to be extremely focused on ICing (which, as we've previously discussed, is not a bad thing) [14:12:42] when deploying I find it very useful to have wikibugs report updates for the patches being deployed, but yeah, some of that could be sent elsewhere [14:13:08] the other thing ihmo worth considering is whether some of the alert noise on -operations could be sent to individual teams or not be sent to IRC at all [14:20:58] <_joe_> taavi: more of the former maybe, but alerts are useful during incident response. no solution will be perfect i think [14:22:48] fwiw as noted on the ticket, we actually *haven't* seen an increase in overall line count in -operations. However I think the quality of the output has definitely decreased [14:24:55] there are alerts that flap on a daily basis without anyone taking action. I think I remember observability doing a bit audit/cleanup of alerts a year or two ago that improved things a lot [15:28:35] <_joe_> yeah well :) [15:28:47] <_joe_> I mean it's a much longer discussion [15:29:27] yeah, just chumming the waters [16:57:01] legoktm: taavi just fyi. I added you as reviewers of adding paws to codesearch https://gerrit.wikimedia.org/r/c/labs/codesearch/+/1165049 (let me know if I should tag someone else)