[09:12:47] <elukey>	 hey folks, I repooled the k8s pods for kartotherian
[09:13:10] <elukey>	 It turns out it was my fault, I forgot to add the LVS ip's loopback interface to the k8s workers
[09:13:35] <elukey>	 so when they were pooled, no traffic went to them due to the misconfig (but health check worked fine)
[09:14:07] <elukey>	 this is almost surely the root cause of the maps outage of Monday
[09:14:30] <elukey>	 the poor bare metal maps nodes were left in a reduced capacity handling traffic, and they didn't like it
[09:15:03] <elukey>	 (they may have started to fail more gracefully buuuuut it doesn't remove the blame from me :D)
[09:15:42] <elukey>	 I am going to leave things as they are now (both bare metals and k8s pods) for a couple of hours, and hopefully later on I'll remove some bare metals again from lvs
[09:16:01] <elukey>	 Last but not least I'll do the incident report
[09:19:59] <kamila_>	 elukey: excellent job with the blameless postmortem ;-)
[09:20:14] <kamila_>	 thank you! <3 
[09:20:18] <fabfur>	 only one can blame here 
[09:20:20] <fabfur>	 git
[09:20:45] <kamila_>	 ^^^
[09:22:11] <elukey>	 kamila_: ahahha I was the only one involved in breaking and fixing, I just took responsibility :D
[09:22:50] <kamila_>	 good point, we need some more stickers :-D
[09:24:23] <elukey>	 jokes aside, the way I see the blameless port-mortem is that we can discuss what broke with precise details without feeling horrible about it, since nobody will point the finger to make you feel bad. Also I like when it is clear that an error was understood and follow ups taken :)
[09:24:58] <moritzm>	 the root cause is that maps has no owner, everything else is just a ripple effect
[09:26:18] <kamila_>	 yep to both
[09:41:48] <Emperor>	 could/should we have tooling to help avoid that sort of mistake?
[09:42:16] <Emperor>	 (I mean the "forgot to add LVS ip to loopback" mistake, not the "run unowned systems in prod" mistake)
[09:42:50] <Emperor>	 ...because calling management "tooling" is probably going to get me fired :D
[09:54:52] <elukey>	 Emperor: it is a difficult use case since I didn't follow the procedure that stated the step exactly, because I was doing something more "custom" and I thought I recalled all the steps. So surely follow docs is one :D We had the network probe from prometheus that intermittently failed, not as much as to trigger a real notification but I noticed it while debugging
[09:55:19] <elukey>	 hopefully we'll not need any mixed/weird/custom setup in the future
[09:55:33] <elukey>	 and the prometheus check for standard use cases is really clear
[09:58:48] <Emperor>	 👍
[10:19:28] <hnowlan>	 elukey: if it's any consolation, I did the *exact* same mistake when migrating thumbor pods with the same method 
[10:23:22] <elukey>	 hnowlan: ahahhaha niceee
[10:24:09] <elukey>	 hopefully this is the last service that needs this procedure
[10:37:07] <_joe_>	 I think so, at least until we start moving stuff to aux
[10:37:16] <_joe_>	 but that's stuff that's less critical 
[10:37:40] <_joe_>	 btw, I do think people tend to give the wrong interpretation of "blameless" in blameless postmortems
[10:39:10] <_joe_>	 it doesn't mean we don't identify which action(s) by whom caused the outage, but rather we don't use the people involved as scapegoats, as elukey said
[10:40:32] <elukey>	 +1 yes
[10:43:11] <elukey>	 on-callers - I am going to attempt another time to remove some bare metal capacity from kartotherian codfw
[10:43:18] <elukey>	 I'll do it VERY slowly :D
[10:43:25] <elukey>	 started with one node
[10:43:38] <kamila_>	 🚢
[10:47:20] <godog>	 I'm getting consistent CI failures with git failing e.g. https://integration.wikimedia.org/ci/job/alerts-pipeline-test/2324/console  on  https://gerrit.wikimedia.org/r/c/operations/alerts/+/1120923  have you seen the same too ?
[10:48:15] <fabfur>	 ah same
[10:48:24] <fabfur>	 I think
[10:48:50] <fabfur>	 nope, sorry, mine was another issue
[10:48:59] <elukey>	 (two hosts removed in codfw, removed one in eqiad too)
[10:55:59] <hnowlan>	 godog: I was seeing the same yesterday
[10:56:01] <hnowlan>	 I'll file a bug
[10:56:45] <hnowlan>	 ah already seen https://phabricator.wikimedia.org/T386784 
[10:56:56] <elukey>	 re: kartotherian - we are now in the same situation as Monday (-2 bare metal nodes from each DC, pooled=inactive) and so far nothing is complaining
[10:57:23] <elukey>	 tailing logs on maps1010 and no timeout registered
[10:58:17] <godog>	 hnowlan: nice find, thank you !
[11:39:24] <elukey>	 going afk for lunch, if anything happens with kartotherian feel free to ping me on the phone and I'll get back immediately
[11:39:29] <elukey>	 (for the on-callers)
[11:45:10] <hnowlan>	 <o
[14:26:01] <elukey>	 hey on-callers, there is a background noise of 50x for maps.w.o starting this morning: https://logstash.wikimedia.org/app/dashboards#/view/59147710-1f9e-11ec-85b7-9d1831ce7631?_g=h@9ff60ba&_a=h@6069507
[14:26:09] <elukey>	 so those are the kube workers almost surely
[14:26:16] <elukey>	 going to depool them sigh
[14:28:35] <elukey>	 back to bare-metals only
[14:28:59] <elukey>	 (right url https://logstash.wikimedia.org/goto/f16fbe1d1d5cc2822c75ec9163fcdb0a)
[14:31:46] <hnowlan>	 aw that's a shame
[14:33:16] <elukey>	 I didn't think to check there too, and I see recovery so somehow those are k8s related
[14:33:28] <hnowlan>	 kartotherian logs are pretty terse heh :( 
[14:33:45] <hnowlan>	 oh wait, it was only eqiad that was poolde 
[14:35:41] <urandom>	 Dear oncallers: I'm going to restart Cassandra on sessionstore presently (if there are no objections), to apply a JDK update.  No badness is expected.
[14:35:45] <elukey>	 in theory both eqiad and codfw (at least, the kartotherian discovery)
[14:36:36] <hnowlan>	 urandom: ack
[14:42:32] <hnowlan>	 elukey: looks like there's some difference in using the mesh to access the mw api? 
[14:43:34] <elukey>	 hnowlan: did you find anything specific? 
[14:45:05] <elukey>	 the code was restructured to allow the localhost mesh calls, because the one running on bare metal doesn't use it
[14:45:14] <elukey>	 from all tests nothing really came up
[14:45:38] <elukey>	 also kartotherian on k8s doesn't seem to scream anything in particular
[14:47:06] <hnowlan>	 elukey: in eqiad: `kubectl logs kartotherian-main-67744c6484-22j67 kartotherian-main | grep ERROR.*6500` 
[14:47:27] <elukey>	 checking, I hoped they were also on logstash
[14:47:46] <hnowlan>	 wait, no, you see those on metal too 
[14:49:20] <elukey>	 the other thing that makes me wonder a bit is https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-default-1-7.0.0-1-2025.02.19?id=EEWcHpUB1cXMpx2mEUGO
[14:49:46] <hnowlan>	 actually no - I might be onto something- you see timeouts connecting to mwapi on metal, but you don't see those on k8s. You *just* see ERRORs saying "connection failure" 
[14:49:48] <elukey>	 the vast majority of 503s are int-front, IIRC that means generated straight from the varnish frontend
[14:50:17] <hnowlan>	 so I wonder if kartotherian retries or gives a different error on a timeout that it doesn't give when it just gets a 503 from envoy (because of a hidden timeout) 
[14:52:27] <elukey>	 no ok sorry the 503s are not only int-front 
[14:56:25] <elukey>	 the majority seem to be 502s, interesting https://logstash.wikimedia.org/goto/b3b13c2735aaee6d8f2053abcd742d94
[14:56:45] <elukey>	 the 503s have the same ttfb, probably the CDN timing out while waiting for kartotherian
[14:56:56] <elukey>	 (and that explains the int-front)
[14:59:17] <elukey>	 it is true that the kartotherian pods run a single nodejs worker without supervisor, meanwhile each bare metal has nproc workers
[14:59:31] <elukey>	 but I don't see clear sign of distress on the pods (throttling etc..)
[15:04:17] <_joe_>	 the most obvious reason would be some egress rule not being correctly defined
[15:06:14] <elukey>	 I tried to hit a single wikikube worker with some of those URIs but can't see anything weird, HTTP 200
[15:06:33] <elukey>	 I thought about egress too but I'd expect to see a repro here and there
[15:07:04] <_joe_>	 so what I see from the data
[15:07:17] <_joe_>	 elukey: when did you depool kartotherian in k8s?
[15:08:00] <elukey>	 _joe_ when you see the errors dropping
[15:08:13] <_joe_>	 elukey: I'm looking at the envoy telemetry rn
[15:08:35] <elukey>	 good point didn't check it yet
[15:09:04] <elukey>	 more or less 14:20 UTC
[15:09:13] <_joe_>	 nothing in it seems to indicate any kind of timeout
[15:10:32] <elukey>	 I have the feeling that I messed up something basic in routing again
[15:11:39] <elukey>	 different layer now :D
[15:13:10] <elukey>	 the majority are 50s, all misses from xcache but very quick ttfb
[15:13:32] <elukey>	 so it seems as if ATS tries to contact kartotherian.discovery.wmnet getting an immediate garbage
[15:16:32] <elukey>	 is there anything special needed on the ingress side for a wikikube service?
[15:16:44] <elukey>	 to be able to be served via LVS
[15:18:51] <_joe_>	 elukey: I assume you have a nodeport service, correct?
[15:18:57] <elukey>	 correct yes
[15:19:02] <_joe_>	 and you have the LVS ip on the worker nodes?
[15:19:07] <_joe_>	 then you should be all set
[15:19:39] <elukey>	 yep yep all checks out
[15:25:38] <elukey>	 my impression so far is that something is still wrong between the CDN and lvs, because it would explain why we have all those 502s with short ttfb
[15:32:17] <elukey>	 I am going to pool a single k8s worker in codfw to make some tests
[15:34:06] <_joe_>	 elukey: maybe the cert is different?
[15:37:27] <elukey>	 _joe_ good point, I checked via 'curl -i "https://kartotherian.svc.codfw.wmnet:6543/img/osm-intl,6,52.15,5.3,290x332@2x.png"' (this is a URI that returned 502) but even the response from k8s are good
[15:37:44] <elukey>	 (I check the server: main-tls response header)
[15:37:56] <_joe_>	 elukey: check with host: maps.wikimedia.org
[15:39:16] <elukey>	 works fine yeah
[15:39:58] <brouberol>	 arnaudb: I think we overstepped on each other's toes on the private repo
[15:40:09] <brouberol>	 you committed my change and I git checkout -f yours
[15:40:13] <brouberol>	 I can restore them
[15:40:21] <arnaudb>	 aha
[15:40:41] <arnaudb>	 please and thank you! otherwise I can redothem brouberol 
[15:41:06] <brouberol>	 you seem to have vim open on the file, so I'll let you, if that's alright?
[15:41:30] <arnaudb>	 done!
[15:43:27] <_joe_>	 elukey: so, trying to connect to a random k8s worker
[15:43:34] <_joe_>	 it hangs on
[15:45:44] <_joe_>	 well no sorry, what hangs is openssl s_client
[15:50:56] <_joe_>	 elukey: uhm how are certs set up for kartotherian?
[15:51:14] <_joe_>	 looks like you're not using certmanager, I'm confused
[15:55:16] <elukey>	 _joe_ I was under the assumption that I was using it, lemme check
[15:55:26] <elukey>	 (due to the use of the mesh as TLS terminator)
[15:58:40] <_joe_>	 elukey: so I just checked the actual secret in prod
[15:58:52] <_joe_>	 kubectl -n kartotherian get secrets kartotherian-main-tls-proxy-certs -o json | jq .data | jq '.["tls.crt"]' | sed s/\"//g - | base64 -d | openssl x509 -text
[15:59:04] <_joe_>	 and... DNS:kartotherian.discovery.wmnet, DNS:kartotherian.svc.codfw.wmnet, DNS:kartotherian-main-tls-service.kartotherian.svc.cluster.local
[15:59:21] * elukey cries in a corner
[15:59:40] <elukey>	 I didn't add the SAN for maps
[16:01:25] <elukey>	 I don't get why openssl gets stuck but, anyway
[16:01:29] <elukey>	 sending a patch :(
[16:01:44] <_joe_>	 ah ok, I just had one. Please go on
[16:02:05] <elukey>	 please go then!
[16:03:19] <_joe_>	 elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121035/1/helmfile.d/services/kartotherian/values.yaml
[16:03:31] <_joe_>	 this should IIRC do the trick
[16:03:47] <elukey>	 my soul is extremely crushed at this point
[16:03:56] <_joe_>	 elukey: so yeah this is usually the last pitfall when adding a new service to the CDN :D
[16:04:11] <_joe_>	 I made this mistake a couple times, and valentin saved me once
[16:04:29] <elukey>	 _joe_ thanks a lot, I really hope Valentin doesn't see this since he told me a million times to check the SANs, and I usually do
[16:04:57] <volans>	 couldn't we have some CI to check those and/or some helper script to create all the changes needed for a new service?
[16:05:28] <_joe_>	 volans: sadly it depends on various factors that are not easy to codify in CI
[16:05:56] <_joe_>	 we already have some helper scripts FWIW
[16:06:06] * vgutierrez sees nothing
[16:06:11] <elukey>	 ahahahaha
[16:06:13] <_joe_>	 vgutierrez: go away
[16:06:26] <_joe_>	 I hoped you wouldn't be pinged by "Valentin"
[16:06:45] <elukey>	 vgutierrez: please if you want to go ahead with some remark I will take it, today I deserve it
[16:06:52] <elukey>	 :D
[16:07:20] <elukey>	 "zero days since I checked a SAN properly"
[16:07:39] <_joe_>	 elukey: my patch needs refinement
[16:07:58] <elukey>	 _joe_ yeah for staging I was about to tell you
[16:08:05] <elukey>	 maybe empty extraFQDN?
[16:08:17] <elukey>	 seems what we do in other configs
[16:08:53] <vgutierrez>	 yeah... my name pings me
[16:09:00] <_joe_>	 damn
[16:09:01] <vgutierrez>	 O:)
[16:09:04] <_joe_>	 sorry
[16:09:07] <vgutierrez>	 no problem 
[16:09:29] <_joe_>	 seems like you like spice in your life, maybe elukey can interest you in a spicerack
[16:09:59] <_joe_>	 with lots of cumin, I promise. It will help ease the pain every time your run the reimage cookbook
[16:10:27] <elukey>	 doesn't work anymore, the next best way is to craft a phabricator task with the right tags (cumin, etc..)
[16:11:14] <elukey>	 then you ask an LLM to create a weird bug report with an obscure python stacktrace
[16:11:59] <elukey>	 and you have a certain SRE completely stalled for a day :D
[16:13:11] <volans>	 human DoS :-P
[16:13:22] <elukey>	 :D
[16:18:24] <_joe_>	 elukey: you mean you and marostegui broke him?
[16:18:27] <_joe_>	 kudos
[16:19:32] <elukey>	 nono I just saw a task being opened for a little cumin bug that took somebody's interest very quickly :D
[16:22:50] <elukey>	 _joe_ taking over the patch to figure out the staging thing so you are free
[16:23:16] <_joe_>	 it's good to go now
[16:27:38] <elukey>	 okok, I recalled we had something like "staging: true" to auto-set those, but it seems only for ingress certs
[16:27:50] <elukey>	 anyway, good for me, thanks again!
[16:28:05] <_joe_>	 please +2/deploy at your will
[16:28:12] <elukey>	 doing it now
[16:28:24] <_joe_>	 I'm sure we will find other issues with kartotherian on k8s, but the big one will be solved
[16:34:35] <elukey>	 fingers crossed
[16:34:41] <elukey>	 deployed and checked the sans
[16:34:45] <elukey>	 repooling the k8s ndoes
[16:39:21] <elukey>	 (done)
[16:52:51] <_joe_>	 not looking great
[16:53:55] <_joe_>	 but at least traffic is flowing  there https://grafana.wikimedia.org/d/d821ac19-02c5-49ac-bf18-58d2e27fdf19/kartotherian?orgId=1
[16:55:04] <elukey>	 yes yes I depooled, 503s registered as well https://logstash.wikimedia.org/goto/d37213fd2ef44a76f63ad48aa013c2b8
[16:55:18] <elukey>	 mostly timeout-related I guess
[16:55:38] <_joe_>	 yes
[16:55:44] <_joe_>	 looks like it's underprovisioned?
[16:56:08] <elukey>	 I am thinking the same yes, we probably need double the current pods
[16:57:21] <elukey>	 to make it work properly a pod as 5k millicores assigned, and a single nodejs worker
[16:57:38] <elukey>	 (mapnik etc.. need a fair amount of compute to operate)
[16:58:02] <elukey>	 probably even more memory
[16:58:48] <hnowlan>	 yeah, they're OOMkilling
[16:59:15] <elukey>	 depooled everything a couple of mins ago, metrics recovered
[17:00:03] <elukey>	 I'll make some calculations tomorrow and adjust the replicas and the pod settings
[17:00:14] <_joe_>	 elukey: I was thinking memory as hnowlan pointed out
[17:00:21] <elukey>	 definitely yes
[17:00:46] <elukey>	 we tried to load test a single pod but we never reached that consumption, so probably some use case was missing
[17:49:55] <elukey>	 going afk, thanks to all for the help!