[07:21:30] <Emperor>	 probably-stupid question: the envoy telemetry dashboard has changed, and now it doesn't seem possible to get it to show me just swift? https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-1h&to=now&timezone=browser&var-datasource=000000005&var-site=codfw&var-origin=$__all&var-origin_instance=$__all&var-destination=$__all
[07:21:48] <Emperor>	 origin cluster / instance both allow to select "all" or nothing.
[07:22:24] <Emperor>	 For a long time I've used this dashboard to check on the envoy/swift state in eqiad/codfw and it now looks like that's no longer possible/easy?
[07:49:25] <elukey>	 hello folks, I am going to upgrade aux-k8s-eqiad in a few
[07:52:33] <elukey>	 Emperor: o/ I see in the history that ha*shar added the thanos source and reworked a little the dashboard in late March, I can see the dropdown if I select "thanos" and not "codfw"
[07:59:26] <Emperor>	 ah, thank you, that's helpful (if a little non-obvious!)
[09:20:24] <elukey>	 jelto, godog - confd status cleared on dns hosts, we should be good, apologies again for the noise
[09:20:56] <jelto>	 ack thanks for checking and for cleaning this up :)
[09:39:37] <godog>	 elukey: \o/
[12:24:39] <claime>	 Emperor: re: swift in service proxy, do you think it'd be reasonable to also add a retry_policy on 5xx?
[12:27:19] <Emperor>	 claime: I kindof feel that should be a client decision? [where client might be mediawiki]
[12:27:57] <Emperor>	 e.g. if the rise in 5xx is because we're being scraped/DoS'd, retrying will make matters worse
[12:28:05] <claime>	 Fair
[12:44:48] <elukey>	 trying to upgrade aux-k8s-eqiad to 1.31 :)
[13:11:19] <elukey>	 updated!
[13:11:43] <elukey>	 federico3: something to report - zarcillo.wikimedia.org seems broken to me, not sure if anything needs to happen after the upgrade or not
[13:12:10] <federico3>	 oh indeed, looking
[13:16:48] <federico3>	 the timing is suspicious but after the upgrade it was running - I suspect it simply lost connection to the database, maybe something in egress
[13:19:34] <elukey>	 ah ok now it works
[13:19:39] <elukey>	 sorry I thought it was totally down
[13:19:44] <Amir1>	 I don't know who owns aux-k8s but we are getting bonkers amount of logs from it suddenly https://logstash.wikimedia.org/goto/d323ce516658d2cedbaaf94f90be1280
[13:20:01] <elukey>	 my team, I just did the upgrade
[13:20:02] <elukey>	 lemme check
[13:20:12] <federico3>	 elukey: how long ago?
[13:20:47] <elukey>	 federico3: say 10/15 mins, I checked right after the upgrade
[13:20:52] <volans>	 certificate issue?
[13:21:48] <elukey>	 it seems jaeger yes
[13:21:49] <Amir1>	 thanks for looking into it
[13:22:30] <elukey>	 very diplomatic
[13:22:39] <elukey>	 I fear the real thoughts behind that sentence
[13:22:53] <elukey>	 "$%$%!!!!$#$#@$@$ Luca !#@!@#!@$!@#!#!#|
[13:23:22] <XioNoX>	 nah that's his password
[13:23:30] <Emperor>	 haha
[13:23:41] <Amir1>	 :D
[13:24:16] <marostegui>	 it could also be swearing in italian
[13:24:31] <jynus>	 marostegui: don't confuse swaring with gesturing
[13:24:38] <elukey>	 :D
[13:24:52] <marostegui>	 🤌
[13:24:56] <elukey>	 so I see the istio ingress gw complaining
[13:24:57] <elukey>	 upstream_reset_before_response_started{remote_connection_failure|TLS_error:|268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end}
[13:28:39] <elukey>	 so jaeger-collector-grpc.svc.eqiad.wmnet's TLS cert seems off
[13:28:40] <elukey>	 mmm
[14:20:19] <elukey>	 Amir1: should be better now, there was a problem with the SANs
[14:23:19] <Amir1>	 Thank you <3
[14:49:32] <sukhe>	 elukey: hi! just checking to make sure nothing was missing on our end. I still can't parse it properly, what happened with gdnsd check conf failing?
[14:51:54] <elukey>	 sukhe: o/ mostly my bad - I committed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268902 and then I used the sre k8s pool/depool cookbook that had as consequence trying to set codfw pooled for k8s-ingress-aux-rw, but puppet didn't run yet on the dns servers. gdns didn't like it much, I reverted the change, ran puppet and cleaned up the stale confd leftovers.
[14:52:30] <sukhe>	 ah ok!
[14:52:41] <sukhe>	 no worries then, was just checking if something is missing on our end or in the docs
[14:52:46] <sukhe>	 since I also managed to trigger this two weeks ago :)
[14:54:40] <claime>	 elukey: Hmm, what's the idea behind putting aux-k8s-rw as a/a?
[14:57:36] <elukey>	 claime: I probably misunderstood its meaning, I thought it was a leftover from when the aux cluster was only in eqiad. 
[14:57:54] <claime>	 the usual convention is that -ro is a/a and has both DC pooled, and -rw is a/p and has only the main DC pooled
[14:57:58] <elukey>	 os-reports for example (read-only) was a cname for the rw endpoint, etc..
[14:58:08] <elukey>	 yeah I didn't have that bit
[14:58:15] <elukey>	 os-report has been moved to -ro
[14:58:18] <claime>	 That way you get dc-local using the -ro name
[14:58:18] <elukey>	 (thanks Jelto)
[14:58:26] <claime>	 And main-dc using the -rw name
[14:58:33] <claime>	 (makes sense?)
[14:59:39] <claime>	 We could make it more explicit, I'm not even sure we actually documented that convention anywhere