[05:00:40] <_joe_>	 the problem with that page is - how is a single wdqs server a backend for some front-facing service?
[05:02:20] <_joe_>	 just read the task and I truly *hate* the idea we have a service with no uptime guarantees *and without aggressive timeouts* served by the same CDN that needs to serve our website
[05:03:10] <_joe_>	 maybe we need to think of reinstating the old "misc" cache cluster, served only from eqiad/codfw on dedicated servers, a pair per dc is probably enough, where we put all the "important but no uptime guarantees" stuff
[05:03:39] <_joe_>	 and keep the configuration of those *simple*, like we only use trafficserver, in the simplest config we can get away with
[07:22:29] <Amir1>	 In  case people are not seeing this, phaultfinder is creating a lot of tickets for all sorts of hosts for mgmt not being reachable
[07:31:06] <effie>	 cc claime (is on clinic duty)
[07:35:52] <moritzm>	 of the five servers I checked all of them are in codfw/row D, so this might be some kind of network issue
[08:45:34] <moritzm>	 topranks, XioNoX: ^ when you have a moment, could you have a look please?
[08:46:20] * topranks looking sry just online now 
[08:48:05] <moritzm>	 no rush, it's just the mgmts
[09:12:21] <topranks>	 yeah I was more worried about the VC-link in eqiad row C, though it settled down a while ago 
[09:22:09] <claime>	 thanks for the head's up effie
[09:27:36] <effie>	 jnuche: -operations is too noisy, train is done to my understanding ?
[09:28:57] <jnuche>	 effie: yeah, done and looking healthy
[09:29:59] <effie>	 grand thank you
[09:35:35] <topranks>	 "grand" - so Irish I love it :) 
[09:51:30] <effie>	 haha
[09:52:19] <hnowlan>	 I noted it in the incident doc yesterday but we might need to have a conversation about continuing to use -operations as the default place to discuss stuff
[09:52:40] <hnowlan>	 I know traditionally it's where everything happens, but "move to -sre" in the midst of incident response is a bit jarring 
[09:54:01] <claime>	 I think we need to clean up -operations
[09:54:15] <claime>	 like having all the merges log in the channel is not... great
[09:54:47] <hnowlan>	 I think we're a bit more lax about flapping alerts than we used to be too
[09:55:09] <claime>	 probably yeah
[09:58:22] <moritzm>	 FYI, I'm disabling Puppet in codfw and the edges for approx 15 mins for a Puppet server reboot
[10:02:44] <p858snake|cloud>	 claime: hashar started a task about cleaning up -operations iirc, but some of the comments were a bit negative so it got closed
[10:05:32] <p858snake|cloud>	 claime: https://phabricator.wikimedia.org/T384804
[10:08:23] <claime>	 p858snake|cloud: ty
[12:13:30] <effie>	 hnowlan: for what is worth, we moved the conversation to sre quite early, ie there were only a couple of comments by the deployer that this is prolly related to the the ongoing deployment 
[12:14:32] <effie>	 cleaning up -operations is a lot of work, while making -sre the de facto channel is easier 
[12:15:17] <effie>	 when the teams were smaller and everything was smaller, -operations made sense, now it is unrealistic to expect  anyone to be able to keep up with this SNR 
[12:17:34] <effie>	 at least, if -sre were the de facto channel, there is a change with a wee bit of scrolling up, to know if something happened the day/night before or not, while even with the best hoovering in the world of -operations, it would still be a no-go 
[12:19:35] <effie>	 what we could consider adding here, would be echoing cumin executions and confctl changes 
[12:20:20] <effie>	 and potentially a bot that could !log from here to SAL and fw the message to -operations too 
[14:00:10] <hnowlan>	 the main problem is that -sre is a silo, most devs and deployers aren't here. We could change the culture around that but it's a bit of work 
[14:06:09] <_joe_>	 I think we can experiment with moving wikibugs to a secundary channel we all use as an actual feed whenever we are interested in it
[14:06:30] <_joe_>	 it's useful to have SAL and alerts in a channel we're discussing incidents in
[14:06:55] <_joe_>	 even if I keep reiterating I think we should start doing calls during inccidents
[14:08:32] <hnowlan>	 agreed on having the context in the discussion
[14:09:33] <hnowlan>	 I'm not opposed to having calls but I think they're terrible for context loading so the IC needs to be extremely focused on ICing (which, as we've previously discussed, is not a bad thing) 
[14:12:42] <taavi>	 when deploying I find it very useful to have wikibugs report updates for the patches being deployed, but yeah, some of that could be sent elsewhere
[14:13:08] <taavi>	 the other thing ihmo worth considering is whether some of the alert noise on -operations could be sent to individual teams or not be sent to IRC at all
[14:20:58] <_joe_>	 taavi: more of the former maybe, but alerts are useful during incident response. no solution will be perfect i think
[14:22:48] <hnowlan>	 fwiw as noted on the ticket, we actually *haven't* seen an increase in overall line count in -operations. However I think the quality of the output has definitely decreased 
[14:24:55] <hnowlan>	 there are alerts that flap on a daily basis without anyone taking action. I think I remember observability doing a bit audit/cleanup of alerts a year or two ago that improved things a lot
[15:28:35] <_joe_>	 yeah well :)
[15:28:47] <_joe_>	 I mean it's a much longer discussion
[15:29:27] <hnowlan>	 yeah, just chumming the waters 
[16:57:01] <dcaro>	 legoktm: taavi just fyi. I added you as reviewers of adding paws to codesearch https://gerrit.wikimedia.org/r/c/labs/codesearch/+/1165049 (let me know if I should tag someone else)