[09:12:47] hey folks, I repooled the k8s pods for kartotherian [09:13:10] It turns out it was my fault, I forgot to add the LVS ip's loopback interface to the k8s workers [09:13:35] so when they were pooled, no traffic went to them due to the misconfig (but health check worked fine) [09:14:07] this is almost surely the root cause of the maps outage of Monday [09:14:30] the poor bare metal maps nodes were left in a reduced capacity handling traffic, and they didn't like it [09:15:03] (they may have started to fail more gracefully buuuuut it doesn't remove the blame from me :D) [09:15:42] I am going to leave things as they are now (both bare metals and k8s pods) for a couple of hours, and hopefully later on I'll remove some bare metals again from lvs [09:16:01] Last but not least I'll do the incident report [09:19:59] elukey: excellent job with the blameless postmortem ;-) [09:20:14] thank you! <3 [09:20:18] only one can blame here [09:20:20] git [09:20:45] ^^^ [09:22:11] kamila_: ahahha I was the only one involved in breaking and fixing, I just took responsibility :D [09:22:50] good point, we need some more stickers :-D [09:24:23] jokes aside, the way I see the blameless port-mortem is that we can discuss what broke with precise details without feeling horrible about it, since nobody will point the finger to make you feel bad. Also I like when it is clear that an error was understood and follow ups taken :) [09:24:58] the root cause is that maps has no owner, everything else is just a ripple effect [09:26:18] yep to both [09:41:48] could/should we have tooling to help avoid that sort of mistake? [09:42:16] (I mean the "forgot to add LVS ip to loopback" mistake, not the "run unowned systems in prod" mistake) [09:42:50] ...because calling management "tooling" is probably going to get me fired :D [09:54:52] Emperor: it is a difficult use case since I didn't follow the procedure that stated the step exactly, because I was doing something more "custom" and I thought I recalled all the steps. So surely follow docs is one :D We had the network probe from prometheus that intermittently failed, not as much as to trigger a real notification but I noticed it while debugging [09:55:19] hopefully we'll not need any mixed/weird/custom setup in the future [09:55:33] and the prometheus check for standard use cases is really clear [09:58:48] 👍 [10:19:28] elukey: if it's any consolation, I did the *exact* same mistake when migrating thumbor pods with the same method [10:23:22] hnowlan: ahahhaha niceee [10:24:09] hopefully this is the last service that needs this procedure [10:37:07] <_joe_> I think so, at least until we start moving stuff to aux [10:37:16] <_joe_> but that's stuff that's less critical [10:37:40] <_joe_> btw, I do think people tend to give the wrong interpretation of "blameless" in blameless postmortems [10:39:10] <_joe_> it doesn't mean we don't identify which action(s) by whom caused the outage, but rather we don't use the people involved as scapegoats, as elukey said [10:40:32] +1 yes [10:43:11] on-callers - I am going to attempt another time to remove some bare metal capacity from kartotherian codfw [10:43:18] I'll do it VERY slowly :D [10:43:25] started with one node [10:43:38] 🚢 [10:47:20] I'm getting consistent CI failures with git failing e.g. https://integration.wikimedia.org/ci/job/alerts-pipeline-test/2324/console on https://gerrit.wikimedia.org/r/c/operations/alerts/+/1120923 have you seen the same too ? [10:48:15] ah same [10:48:24] I think [10:48:50] nope, sorry, mine was another issue [10:48:59] (two hosts removed in codfw, removed one in eqiad too) [10:55:59] godog: I was seeing the same yesterday [10:56:01] I'll file a bug [10:56:45] ah already seen https://phabricator.wikimedia.org/T386784 [10:56:56] re: kartotherian - we are now in the same situation as Monday (-2 bare metal nodes from each DC, pooled=inactive) and so far nothing is complaining [10:57:23] tailing logs on maps1010 and no timeout registered [10:58:17] hnowlan: nice find, thank you ! [11:39:24] going afk for lunch, if anything happens with kartotherian feel free to ping me on the phone and I'll get back immediately [11:39:29] (for the on-callers) [11:45:10] hey on-callers, there is a background noise of 50x for maps.w.o starting this morning: https://logstash.wikimedia.org/app/dashboards#/view/59147710-1f9e-11ec-85b7-9d1831ce7631?_g=h@9ff60ba&_a=h@6069507 [14:26:09] so those are the kube workers almost surely [14:26:16] going to depool them sigh [14:28:35] back to bare-metals only [14:28:59] (right url https://logstash.wikimedia.org/goto/f16fbe1d1d5cc2822c75ec9163fcdb0a) [14:31:46] aw that's a shame [14:33:16] I didn't think to check there too, and I see recovery so somehow those are k8s related [14:33:28] kartotherian logs are pretty terse heh :( [14:33:45] oh wait, it was only eqiad that was poolde [14:35:41] Dear oncallers: I'm going to restart Cassandra on sessionstore presently (if there are no objections), to apply a JDK update. No badness is expected. [14:35:45] in theory both eqiad and codfw (at least, the kartotherian discovery) [14:36:36] urandom: ack [14:42:32] elukey: looks like there's some difference in using the mesh to access the mw api? [14:43:34] hnowlan: did you find anything specific? [14:45:05] the code was restructured to allow the localhost mesh calls, because the one running on bare metal doesn't use it [14:45:14] from all tests nothing really came up [14:45:38] also kartotherian on k8s doesn't seem to scream anything in particular [14:47:06] elukey: in eqiad: `kubectl logs kartotherian-main-67744c6484-22j67 kartotherian-main | grep ERROR.*6500` [14:47:27] checking, I hoped they were also on logstash [14:47:46] wait, no, you see those on metal too [14:49:20] the other thing that makes me wonder a bit is https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-default-1-7.0.0-1-2025.02.19?id=EEWcHpUB1cXMpx2mEUGO [14:49:46] actually no - I might be onto something- you see timeouts connecting to mwapi on metal, but you don't see those on k8s. You *just* see ERRORs saying "connection failure" [14:49:48] the vast majority of 503s are int-front, IIRC that means generated straight from the varnish frontend [14:50:17] so I wonder if kartotherian retries or gives a different error on a timeout that it doesn't give when it just gets a 503 from envoy (because of a hidden timeout) [14:52:27] no ok sorry the 503s are not only int-front [14:56:25] the majority seem to be 502s, interesting https://logstash.wikimedia.org/goto/b3b13c2735aaee6d8f2053abcd742d94 [14:56:45] the 503s have the same ttfb, probably the CDN timing out while waiting for kartotherian [14:56:56] (and that explains the int-front) [14:59:17] it is true that the kartotherian pods run a single nodejs worker without supervisor, meanwhile each bare metal has nproc workers [14:59:31] but I don't see clear sign of distress on the pods (throttling etc..) [15:04:17] <_joe_> the most obvious reason would be some egress rule not being correctly defined [15:06:14] I tried to hit a single wikikube worker with some of those URIs but can't see anything weird, HTTP 200 [15:06:33] I thought about egress too but I'd expect to see a repro here and there [15:07:04] <_joe_> so what I see from the data [15:07:17] <_joe_> elukey: when did you depool kartotherian in k8s? [15:08:00] _joe_ when you see the errors dropping [15:08:13] <_joe_> elukey: I'm looking at the envoy telemetry rn [15:08:35] good point didn't check it yet [15:09:04] more or less 14:20 UTC [15:09:13] <_joe_> nothing in it seems to indicate any kind of timeout [15:10:32] I have the feeling that I messed up something basic in routing again [15:11:39] different layer now :D [15:13:10] the majority are 50s, all misses from xcache but very quick ttfb [15:13:32] so it seems as if ATS tries to contact kartotherian.discovery.wmnet getting an immediate garbage [15:16:32] is there anything special needed on the ingress side for a wikikube service? [15:16:44] to be able to be served via LVS [15:18:51] <_joe_> elukey: I assume you have a nodeport service, correct? [15:18:57] correct yes [15:19:02] <_joe_> and you have the LVS ip on the worker nodes? [15:19:07] <_joe_> then you should be all set [15:19:39] yep yep all checks out [15:25:38] my impression so far is that something is still wrong between the CDN and lvs, because it would explain why we have all those 502s with short ttfb [15:32:17] I am going to pool a single k8s worker in codfw to make some tests [15:34:06] <_joe_> elukey: maybe the cert is different? [15:37:27] _joe_ good point, I checked via 'curl -i "https://kartotherian.svc.codfw.wmnet:6543/img/osm-intl,6,52.15,5.3,290x332@2x.png"' (this is a URI that returned 502) but even the response from k8s are good [15:37:44] (I check the server: main-tls response header) [15:37:56] <_joe_> elukey: check with host: maps.wikimedia.org [15:39:16] works fine yeah [15:39:58] arnaudb: I think we overstepped on each other's toes on the private repo [15:40:09] you committed my change and I git checkout -f yours [15:40:13] I can restore them [15:40:21] aha [15:40:41] please and thank you! otherwise I can redothem brouberol [15:41:06] you seem to have vim open on the file, so I'll let you, if that's alright? [15:41:30] done! [15:43:27] <_joe_> elukey: so, trying to connect to a random k8s worker [15:43:34] <_joe_> it hangs on [15:45:44] <_joe_> well no sorry, what hangs is openssl s_client [15:50:56] <_joe_> elukey: uhm how are certs set up for kartotherian? [15:51:14] <_joe_> looks like you're not using certmanager, I'm confused [15:55:16] _joe_ I was under the assumption that I was using it, lemme check [15:55:26] (due to the use of the mesh as TLS terminator) [15:58:40] <_joe_> elukey: so I just checked the actual secret in prod [15:58:52] <_joe_> kubectl -n kartotherian get secrets kartotherian-main-tls-proxy-certs -o json | jq .data | jq '.["tls.crt"]' | sed s/\"//g - | base64 -d | openssl x509 -text [15:59:04] <_joe_> and... DNS:kartotherian.discovery.wmnet, DNS:kartotherian.svc.codfw.wmnet, DNS:kartotherian-main-tls-service.kartotherian.svc.cluster.local [15:59:21] * elukey cries in a corner [15:59:40] I didn't add the SAN for maps [16:01:25] I don't get why openssl gets stuck but, anyway [16:01:29] sending a patch :( [16:01:44] <_joe_> ah ok, I just had one. Please go on [16:02:05] please go then! [16:03:19] <_joe_> elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121035/1/helmfile.d/services/kartotherian/values.yaml [16:03:31] <_joe_> this should IIRC do the trick [16:03:47] my soul is extremely crushed at this point [16:03:56] <_joe_> elukey: so yeah this is usually the last pitfall when adding a new service to the CDN :D [16:04:11] <_joe_> I made this mistake a couple times, and valentin saved me once [16:04:29] _joe_ thanks a lot, I really hope Valentin doesn't see this since he told me a million times to check the SANs, and I usually do [16:04:57] couldn't we have some CI to check those and/or some helper script to create all the changes needed for a new service? [16:05:28] <_joe_> volans: sadly it depends on various factors that are not easy to codify in CI [16:05:56] <_joe_> we already have some helper scripts FWIW [16:06:06] * vgutierrez sees nothing [16:06:11] ahahahaha [16:06:13] <_joe_> vgutierrez: go away [16:06:26] <_joe_> I hoped you wouldn't be pinged by "Valentin" [16:06:45] vgutierrez: please if you want to go ahead with some remark I will take it, today I deserve it [16:06:52] :D [16:07:20] "zero days since I checked a SAN properly" [16:07:39] <_joe_> elukey: my patch needs refinement [16:07:58] _joe_ yeah for staging I was about to tell you [16:08:05] maybe empty extraFQDN? [16:08:17] seems what we do in other configs [16:08:53] yeah... my name pings me [16:09:00] <_joe_> damn [16:09:01] O:) [16:09:04] <_joe_> sorry [16:09:07] no problem [16:09:29] <_joe_> seems like you like spice in your life, maybe elukey can interest you in a spicerack [16:09:59] <_joe_> with lots of cumin, I promise. It will help ease the pain every time your run the reimage cookbook [16:10:27] doesn't work anymore, the next best way is to craft a phabricator task with the right tags (cumin, etc..) [16:11:14] then you ask an LLM to create a weird bug report with an obscure python stacktrace [16:11:59] and you have a certain SRE completely stalled for a day :D [16:13:11] human DoS :-P [16:13:22] :D [16:18:24] <_joe_> elukey: you mean you and marostegui broke him? [16:18:27] <_joe_> kudos [16:19:32] nono I just saw a task being opened for a little cumin bug that took somebody's interest very quickly :D [16:22:50] _joe_ taking over the patch to figure out the staging thing so you are free [16:23:16] <_joe_> it's good to go now [16:27:38] okok, I recalled we had something like "staging: true" to auto-set those, but it seems only for ingress certs [16:27:50] anyway, good for me, thanks again! [16:28:05] <_joe_> please +2/deploy at your will [16:28:12] doing it now [16:28:24] <_joe_> I'm sure we will find other issues with kartotherian on k8s, but the big one will be solved [16:34:35] fingers crossed [16:34:41] deployed and checked the sans [16:34:45] repooling the k8s ndoes [16:39:21] (done) [16:52:51] <_joe_> not looking great [16:53:55] <_joe_> but at least traffic is flowing there https://grafana.wikimedia.org/d/d821ac19-02c5-49ac-bf18-58d2e27fdf19/kartotherian?orgId=1 [16:55:04] yes yes I depooled, 503s registered as well https://logstash.wikimedia.org/goto/d37213fd2ef44a76f63ad48aa013c2b8 [16:55:18] mostly timeout-related I guess [16:55:38] <_joe_> yes [16:55:44] <_joe_> looks like it's underprovisioned? [16:56:08] I am thinking the same yes, we probably need double the current pods [16:57:21] to make it work properly a pod as 5k millicores assigned, and a single nodejs worker [16:57:38] (mapnik etc.. need a fair amount of compute to operate) [16:58:02] probably even more memory [16:58:48] yeah, they're OOMkilling [16:59:15] depooled everything a couple of mins ago, metrics recovered [17:00:03] I'll make some calculations tomorrow and adjust the replicas and the pod settings [17:00:14] <_joe_> elukey: I was thinking memory as hnowlan pointed out [17:00:21] definitely yes [17:00:46] we tried to load test a single pod but we never reached that consumption, so probably some use case was missing [17:49:55] going afk, thanks to all for the help!