[07:21:30] probably-stupid question: the envoy telemetry dashboard has changed, and now it doesn't seem possible to get it to show me just swift? https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-1h&to=now&timezone=browser&var-datasource=000000005&var-site=codfw&var-origin=$__all&var-origin_instance=$__all&var-destination=$__all [07:21:48] origin cluster / instance both allow to select "all" or nothing. [07:22:24] For a long time I've used this dashboard to check on the envoy/swift state in eqiad/codfw and it now looks like that's no longer possible/easy? [07:49:25] hello folks, I am going to upgrade aux-k8s-eqiad in a few [07:52:33] Emperor: o/ I see in the history that ha*shar added the thanos source and reworked a little the dashboard in late March, I can see the dropdown if I select "thanos" and not "codfw" [07:59:26] ah, thank you, that's helpful (if a little non-obvious!) [09:20:24] jelto, godog - confd status cleared on dns hosts, we should be good, apologies again for the noise [09:20:56] ack thanks for checking and for cleaning this up :) [09:39:37] elukey: \o/ [12:24:39] Emperor: re: swift in service proxy, do you think it'd be reasonable to also add a retry_policy on 5xx? [12:27:19] claime: I kindof feel that should be a client decision? [where client might be mediawiki] [12:27:57] e.g. if the rise in 5xx is because we're being scraped/DoS'd, retrying will make matters worse [12:28:05] Fair [12:44:48] trying to upgrade aux-k8s-eqiad to 1.31 :) [13:11:19] updated! [13:11:43] federico3: something to report - zarcillo.wikimedia.org seems broken to me, not sure if anything needs to happen after the upgrade or not [13:12:10] oh indeed, looking [13:16:48] the timing is suspicious but after the upgrade it was running - I suspect it simply lost connection to the database, maybe something in egress [13:19:34] ah ok now it works [13:19:39] sorry I thought it was totally down [13:19:44] I don't know who owns aux-k8s but we are getting bonkers amount of logs from it suddenly https://logstash.wikimedia.org/goto/d323ce516658d2cedbaaf94f90be1280 [13:20:01] my team, I just did the upgrade [13:20:02] lemme check [13:20:12] elukey: how long ago? [13:20:47] federico3: say 10/15 mins, I checked right after the upgrade [13:20:52] certificate issue? [13:21:48] it seems jaeger yes [13:21:49] thanks for looking into it [13:22:30] very diplomatic [13:22:39] I fear the real thoughts behind that sentence [13:22:53] "$%$%!!!!$#$#@$@$ Luca !#@!@#!@$!@#!#!#| [13:23:22] nah that's his password [13:23:30] haha [13:23:41] :D [13:24:16] it could also be swearing in italian [13:24:31] marostegui: don't confuse swaring with gesturing [13:24:38] :D [13:24:52] 🤌 [13:24:56] so I see the istio ingress gw complaining [13:24:57] upstream_reset_before_response_started{remote_connection_failure|TLS_error:|268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end} [13:28:39] so jaeger-collector-grpc.svc.eqiad.wmnet's TLS cert seems off [13:28:40] mmm [14:20:19] Amir1: should be better now, there was a problem with the SANs [14:23:19] Thank you <3 [14:49:32] elukey: hi! just checking to make sure nothing was missing on our end. I still can't parse it properly, what happened with gdnsd check conf failing? [14:51:54] sukhe: o/ mostly my bad - I committed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268902 and then I used the sre k8s pool/depool cookbook that had as consequence trying to set codfw pooled for k8s-ingress-aux-rw, but puppet didn't run yet on the dns servers. gdns didn't like it much, I reverted the change, ran puppet and cleaned up the stale confd leftovers. [14:52:30] ah ok! [14:52:41] no worries then, was just checking if something is missing on our end or in the docs [14:52:46] since I also managed to trigger this two weeks ago :) [14:54:40] elukey: Hmm, what's the idea behind putting aux-k8s-rw as a/a? [14:57:36] claime: I probably misunderstood its meaning, I thought it was a leftover from when the aux cluster was only in eqiad. [14:57:54] the usual convention is that -ro is a/a and has both DC pooled, and -rw is a/p and has only the main DC pooled [14:57:58] os-reports for example (read-only) was a cname for the rw endpoint, etc.. [14:58:08] yeah I didn't have that bit [14:58:15] os-report has been moved to -ro [14:58:18] That way you get dc-local using the -ro name [14:58:18] (thanks Jelto) [14:58:26] And main-dc using the -rw name [14:58:33] (makes sense?) [14:59:39] We could make it more explicit, I'm not even sure we actually documented that convention anywhere