[05:14:10] alright, something isn't quite right in k8s in codfw, which I suspect may be related to the steadily increasing numbers of KubernetesCalicoDown alerts, which I'm starting to look at now [05:14:21] if other folks are around to lend a hand, a that would be greatly appreciated [05:22:59] kubectl -n kube-system get po -l k8s-app=calico-kube-controllers [05:23:00] NAME READY STATUS RESTARTS AGE [05:23:00] calico-kube-controllers-57c75d4867-cgzpc 0/1 CrashLoopBackOff 21 (4m19s ago) 21d [05:24:58] 2024-10-09 05:17:48.105 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [05:24:58] 2024-10-09 05:17:48.105 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [05:34:33] summarizing what I've seen so far: [05:34:33] 1. typha appears to be fine, or at least available [05:34:33] 2. k8s API calls made by the calico controller appear to be timing out [05:34:33] 3. we appear to be bleeding calico node pods as a result of #2? [05:34:57] <_joe_> So etc issues? [05:35:05] <_joe_> etcd [05:35:18] <_joe_> sorry still half an hour out [05:35:48] <_joe_> scott please phone Alex and Janis [05:36:27] _joe_: ack, will do [05:36:31] <_joe_> wake them up, I’m on a treadmill and I’ll get home as fast as I can [05:36:47] SGTM, and thank you [05:36:55] <_joe_> and let’s consider a switchover [05:40:27] ο/ [05:41:06] catching up to backlog [05:41:13] hmmm [05:41:13] thanks, akosiaris! [05:41:40] looking at the etcd load angle now [05:44:32] steady-state op rates and memory (presumably ~ key counts) have 2-3x'd over the course of the last 2h or so, but latency is looking fine (and rates certainly below what we see during deployments): https://grafana.wikimedia.org/goto/AwmnLlkNg?orgId=1 [05:44:58] hello - can I help? [05:45:02] !incidents [05:45:02] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [05:45:03] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [05:45:03] 5303 (ACKED) ProbeDown sre (ip4 probes/service codfw) [05:45:03] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [05:45:03] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [05:45:28] I see liveness probes of calico-node pods failing [05:46:14] good morning [05:46:54] looking at per host stats, there is some increase in CPU usage but I don't yet see any CPU pegged boxes [05:47:50] so the CPU usage increase is probably due to the restart of pods [05:47:54] do we have an incident doc? [05:47:56] I just wrote Jayme but not sure if he is awake already [05:48:15] akosiaris: no, I but I can start one right now [05:48:18] I can start one, one sec [05:48:24] ah ok great thanks [05:48:28] jelto: call him maybe? [05:48:47] looking into calico pods logs, nothing eye catching yet [05:49:01] do we have a rough idea of any user impact? [05:50:04] jayme is one the way to his computer but needs some time [05:50:14] thank you jelto [05:50:47] I am here too [05:50:54] just dropped the doc link in _security [05:53:42] The calico/typha pods are hitting resource limits, https://grafana.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources?orgId=1&from=now-3h&to=now [05:53:42] Can we increase the limits maybe? [05:54:34] it would proly get us as far [05:54:40] jelto: go ahead [05:55:12] I will take a look what happened between 3-4 UTC that may have triggered this [05:56:39] there's clearly a sizable upward trend in get / list ops on the k8s API starting some time between 3 and 4 UTC, yeah - hard to tell if that's a result of things failing and restarting, though (https://grafana.wikimedia.org/goto/ZZZQY_kHg?orgId=1) [05:57:00] effie: I think it started earlier, at least 8 hours ago? [05:57:42] On the calico-node level, there is nothing weird in logs. The weirdest thing I found was [05:57:50] bird: I/O loop cycle took 7808 ms for 6 events [05:59:25] question_mark: so far betwwn 3-4 seems like a good starting point [05:59:32] resource bump like that? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1078791 [05:59:53] jelto: +1ed, let's try it out to see if it stops the bleeding [06:00:00] want to deploy it? or should I? [06:00:22] effie: I'm guessing the KubernetesAPILatency alerts earlier were then unrelated? [06:00:25] I can deploy when I got a +2 from jenkins [06:00:34] jelto: +2 yourself and proceed [06:00:37] ok [06:00:38] forget about CI right now [06:02:09] submitted, I'll apply once it's on deploy2002 and the diff looks as expected [06:02:22] jelto: the 1.5Gi thing, might backfire https://github.com/kubernetes/kubectl/issues/1426 [06:02:28] question_mark: looking [06:02:38] if it does, just manually switch to G on deploy2002 and proceed [06:04:02] first time I see some warnings in calico-node logs: 2024-10-09 06:03:00.473 [WARNING][144778] felix/health.go 211: Reporter is not ready. name="int_dataplane" [06:04:29] we are 25 nodes with fully functional calico-node pods [06:04:44] <_joe_> akosiaris: I think it's a positive feeback [06:04:49] unless the patch that jelto has works, we might want to switchback to eqiad [06:05:11] <_joe_> we should move traffic for everything but mw-rw to eqiad right now and maybe destroy all deployments but the mw ones [06:05:17] <_joe_> so that the system can try to recover [06:05:26] o/ [06:05:30] <_joe_> akosiaris: only doubt about switching back is - do DBAs need to do anything? [06:05:40] diff looks ok, running helmfile -e codfw --selector name=calico apply -i --context 5 [06:05:55] I can get started moving -ro services to eqiad [06:06:01] we're still provisioned for that [06:06:23] _joe_: hmmm, the circular replication thing? [06:06:31] <_joe_> yes [06:06:37] <_joe_> so we need to call someone in that team too [06:06:52] swfrench-wmf: yeah, that will allow us to serve at least some uses [06:06:55] <_joe_> anyways, I'm still soaked in sweat and panting, don't trust a word I say rn :P [06:07:23] 212 calico-nodes ready [06:07:24] what? [06:07:27] are we recovering? [06:07:28] <_joe_> lol [06:07:30] apply still running, [06:07:32] <_joe_> I guess that was jelto [06:07:36] lol [06:07:39] yes helmfile still doing it work [06:07:39] holding [06:07:48] but a lot of fresh new healthy pods [06:09:06] jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [06:09:23] <_joe_> I still see a lot of app-level issues [06:09:26] shall I call someone in d/p for the question of whether we can switch to eqiad right now? [06:09:30] <_joe_> but calico has recovered? [06:09:42] I think we are going to bet getting out of the woods soon [06:09:45] <_joe_> question_mark: I'd wait a couple minutes to see if we recover [06:09:49] alright [06:10:08] <_joe_> great catch jelto [06:11:10] yeah, good job [06:11:40] now, what on earth happened and this got triggered like this? [06:11:55] resource usage of calico looks better to my untrained eye [06:11:57] <_joe_> we're out of the woods I'd say [06:12:09] great - I'm catched up [06:12:35] <_joe_> lol [06:12:56] other things talking to the apiserver seem to behave strange as well ... like helm-state-metrics in crashloop [06:13:24] precipitous drop in k8s API calls and etcd ops ... yeah, that has the smell of resource saturation -> feedback loop [06:13:50] and holes in metrics, like https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-kube-controllers-57c75d4867-cgzpc&var-container=All [06:14:01] <_joe_> swfrench-wmf: yeah as I guessed, it was probably calico down -> reallocate pods -> more strain on calico [06:14:46] I linked the previous incident "2024-04-03 calico/typha down" in the new doc as well which was quite similar i think [06:14:53] <_joe_> ok, seems like things are reasonably stable [06:15:45] <_joe_> jelto: yeah and tbh I'm a bit worried by this positive feedback mechanism [06:16:06] <_joe_> because while I wasn't here during that incident, it seems that the dynamics are also similar [06:17:13] funnily enough, it takes a while to realize that the effects on MediaWiki are kinda hidden in https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-1h&to=now [06:17:30] the best sign is 200s dropping by 25%, from 4k to 3k for ~5 minutes [06:18:20] latencies increased too, but for a short period of time [06:18:36] this is weird btw, cause a death of calico-node doesn't reallocate pods [06:20:09] so far the only thing I have found standing out was mw-script jobs starting [06:20:32] <_joe_> effie: I doubt that could be the cause [06:20:37] but unsure yet if it is related [06:20:39] <_joe_> unless it was like 900 jobs at the same time [06:21:12] <_joe_> then maybe we need to add some poolcounter support to mwscript-k8s to limit the number of jobs that can be started :) [06:21:47] alright, thank you _so_ very much, folks, for popping in to assist [06:22:00] awkward time of day for bad stuff to happen! [06:22:32] thank you as well scott especially, it's quite late for you [06:23:40] there is this sharp increase in rx traffic on nodes running calico typha [06:24:01] ~7MB/s [06:24:10] starting around 4:00 [06:24:59] which matches the increased GET rate...okay [06:25:42] etcd SLO dashboard looks partially broken [06:26:16] they don't apply to etcd v3 anyways IIRC [06:26:35] question_mark: that's for conftool, not kubernetes API etcd [06:26:44] https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos?orgId=1&from=now-6h&to=now [06:26:55] and it works, but there is a second one confusingly named similarly [06:27:11] thanks [06:27:12] and the correct one has a wrong date range set as default [06:27:25] probably a result of the last reporting round [06:27:27] I'm not sure if the resource increase was what really resolved the issue. It was probably more related to the redeploy of all calico pods at the same time. But that's just a hypothesis :) [06:27:30] (i'll be talking about SLOs in the P&T all staff meeting later today :) [06:28:11] jayme: looking at etcd objects in codfw, I see configmaps(?!) roughly double from 6k to 12k between 2 and 6? [06:28:16] jelto: did you see calico-node pods being killed? [06:28:53] swfrench-wmf: yeah...I was looking at the same thing [06:29:07] might be jobs effie mentioned? [06:29:26] just crashlooping, like `calico-node-s89nl 0/1 CrashLoopBackOff 20 (3m46s ago) 133d` and a lot Running but not ready [06:29:39] there is nothing in SAL fwiw around that timeframe [06:29:53] ah, yeah ... perfectly correlates with jobs.batch objects ... and their associated network policies! [06:30:23] ? [06:30:32] * akosiaris trying to put 2+2 together [06:30:36] https://grafana.wikimedia.org/goto/4N6Qs_zHR?orgId=1 [06:30:55] kubectl -n mw-script get jobs |wc -l [06:30:56] 1791 [06:30:59] hoooooooolyyyy [06:31:03] there we go [06:31:04] heh, precisely [06:31:27] <_joe_> yeah [06:31:28] so, we have a pretty wild amount of objects now [06:31:35] <_joe_> ok so [06:31:40] the increased apiserver GET rate is mostly for node and endpoint object - that is probably just inflicted by calicos [06:31:44] durations less than 8h for all [06:31:55] <_joe_> we might need to make mw networkpolicy names not depend on the deployment? [06:32:11] for maint scripts, yeah [06:32:15] it doesn't make sense [06:32:18] <_joe_> so that for that many jobs we still have just one networkpolicy? [06:32:31] creating a netpol for every script invocation isn't worth it [06:32:39] <_joe_> yup [06:32:39] kubectl -n mw-script get cm |wc -l [06:32:41] 12610 [06:32:58] kubectl -n mw-script get netpol |wc -l [06:32:58] 1803 [06:33:07] the configmaps are most likely also redundant [06:33:12] <_joe_> yes [06:33:25] ok, that explains why kube-calico-controllers also started getting throttled [06:33:35] <_joe_> so we will need to make it so that configmaps only change for canaries, maybe [06:33:44] will be be seeing this in eqiad as well soon-ish? [06:33:49] there are no canaries in mw-scripts? [06:34:08] btw, this isn't timer based, right? Those haven't even been specced out yet [06:34:18] so this is just someone runnings mwscript-k8s manually [06:34:29] and ofc they aren't to blame, but imagine this on a timer/cron [06:35:16] maybe this should be on sal too [06:35:21] we should probably have manual things in separate namespace with tight resource quota [06:35:33] jayme: I believe the script effectively prevents creating jobs in the non-primary dc [06:35:45] if there are already 200 jobs running, we probably don't want anyone to create another 200 [06:35:48] we 'll need to work a bit on the mwscript front for sure. Pool netpols, pool cms, pool secrets. Overall all objects, we can't be creating all of these from everything script invocation [06:35:55] swfrench-wmf: ok [06:35:57] for every* [06:36:30] jayme: Reuven pointed out that the plan for cron stuff is to go under mw-cron [06:36:43] so, I think we are all on the same page on that one [06:36:43] ah, okay [06:37:03] mw-script ns should be strictly for manual stuff [06:37:04] so it's more about tightening the quota on mw-script rn [06:37:14] <_joe_> I can add poolcounter support to mwscript so we limit the scripts running to N at each time [06:37:18] and probably some aggressive GC [06:37:27] I doubt there is a point in keepings around all these jobs [06:37:53] <_joe_> akosiaris: yeah tbh I'd make mwscript-k8s run in --follow mode by default [06:38:00] so what do we think got calico unstuck? resource limit increase or not? [06:38:01] <_joe_> and GC explicitly once the job is done [06:39:04] question_mark: yes, resource limit increase, but that need was driven by an increased creation of extra Network Policy resources (and other resources) due to the new mwscript-k8s rollout [06:39:22] yes, understood [06:39:30] once we sort that out and not be creating all these k8s resources for every invocation of mwscript, we 'll be able to revert the increase [06:39:37] if we want to by then ofc. [06:40:10] as a side note, ofc this is going to happen again if mwscript continues on being run [06:40:27] unless we implement at least some aggressive GC [06:40:37] and these were manually run jobs? [06:40:40] yes [06:40:54] my very rough guess is that this is some foreachwiki effort [06:41:05] 1791 of them, due to foreachwiki? [06:41:15] something like that [06:41:21] I 'll have to dig to verify that guess [06:41:29] but I have to run some errands first [06:41:31] i'll update the summary in the incident doc with that for now [06:41:34] it's going to be a long and interesting day [06:42:14] There was a thread on wikitech-l of a maint script being started on every wiki question_mark [06:42:27] That was 00:37 utc + 1 [06:42:44] thank you [06:42:59] Should we issue a warning to deployers to avoid using mwscript while we sort these things out? [06:43:19] <_joe_> just ask not to run it in batches? [06:43:28] with an outage at stake at any possible time due to a wrong invocation that spaws way too many jobs, feels like some action is justified [06:43:28] alright, stepping away for real now - thanks, again, all! [06:43:37] bb swfrench-wmf tx! [06:43:44] swfrench-wmf: have a good night ! Thanks for responding! [06:43:49] we should also add some observability there too [06:44:12] yeah, I started looking into the statsd part yesterday, after mszab.o's ping [06:44:22] thanks swfrench-wmf! [06:44:38] have a good night o/ [06:44:42] we definitely need at least MW level metrics and probably some k8s job related metrics so that we don't get caught in a surprise again [06:44:49] there might be something we can put together with what we have now, [06:44:52] who gets the t-shirt btw? [06:45:56] kubectl -n mw-script get netpol |wc -l [06:45:56] 1858 [06:46:08] so, another 55 added since I last checked [06:46:15] whatever it is, it is still running [06:46:20] "Working theory: [06:46:22] Some manually run mediawiki jobs spawned thousands of jobs using mwscript-k8s, which created unique Kubernetes network policies for each job, and corresponding NetworkPolicy objects, triggering Calico resource limits [06:46:24] " [06:46:30] feel free to correct, I don't know any details :) [06:47:28] usr/bin/python3 /usr/local/bin/mwscript-k8s -f --comment T363966 -- extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=mrwiki --throttle --video --key=144p.mjpeg.mov --missing [06:47:28] T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966 [06:47:29] there we go [06:48:21] <_joe_> yeah brooke said she was launching scripts with it on wikitech-l [06:48:34] yeah, I 'll reply there [06:48:40] I have to step out [06:48:48] thanks effie [06:51:16] so action items so far [06:51:33] more observability on mwscript-k8s jobs [06:51:54] a single network policy for all mwscript-k8s jobs? [06:56:33] given that we're clearly out of the woods, know the culprit and are working on actionables, I'm closing the incident [06:56:43] agreed [06:56:48] Thanks for taking over the IC role! [06:57:18] thanks everyone for responding :) [06:57:35] so brooke did announce this on sal, it was just hours before the incident [07:14:59] I 've replied to the wikitech-l thread fwiw [07:43:09] fwiw I've updated https://grafana-rw.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources to allow selecting the DC and/or cluster instead of just global values and will graduate the dashboard out of user dashboards in a bit [08:04:02] * arturo reads the incident backscroll and learns quite a few things [08:15:25] <_joe_> jayme: maybe rename it too, it's of general usefulness :D [08:15:49] that's what I meant by graduating :) [08:16:29] <_joe_> ahah ok gotcha :) [08:17:01] <_joe_> i thought just moving it out of "user dashboards" [08:17:22] yeah, I wasn't precise...to me that includes removing my name prefix :) [08:22:15] <_joe_> yeah it was kinda obvious, sorry [08:22:36] I posted the summary on the status page [08:23:01] <_joe_> thanks jynus [09:10:17] I am checking rpki2002, that is not actionable by us, right? [09:11:01] Oh, I see it is an old alert. Downtime must have expired [09:13:50] jynus: ah right, we just need to decom 2002, we're now using 2003 [09:13:59] I see, thanks [09:14:19] can I extend the downtime, e.g. for a month? [09:21:07] wow I missed a lot. Thank you everyone for all the great work. FWIW switching over to eqiad requires that prep cookbook but also beforehand all db maint needs to be cancelled too. Thankfully it's not needed but if it'll be, call me and I deal with the mess [09:21:43] I would argue that in an emergency, circular replication is optional [09:22:05] (assuming codfw is being depooled) [09:22:42] yeah, depending on "how emergency" [09:22:45] Amir1, jynus: can we think of an easy way to signal when a main dc is ready to be switched to, or not? (ideally: true/false) [09:23:09] question_mark: it should be always, what it is not ready always is to flop in and out [09:23:28] and to stop running maintenance [09:23:33] in a safe way [09:23:34] One sign is that https://orchestrator.wikimedia.org/web/clusters shouldn't have any red or yellow hosts [09:23:39] (for maint) [09:23:53] currently there is a red one for s8 which is the revision table alter table happening [09:23:57] right, so I guess while we're running maintenance, the answer is "FALSE - unless" [09:24:16] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [09:24:20] This should help too [09:24:21] right now basically noone is confident to do a switchover unless a DBA is present :) [09:25:01] what I meant is that, should an emergency happened, I still think we could do it at any time [09:25:14] related: I'm working (currently just gathering data) on a dashboard to help oncall people to decide if it's worth to depool/repool a DC or not: https://phabricator.wikimedia.org/T376787 [09:25:21] but do we need DBAs present for it or not? [09:25:24] if someone wants to add some data [09:25:27] *insert slack +1 emoji to Jaime's point* [09:25:45] question_mark: not really, but I understand the confidence issue [09:25:47] question_mark: better to have DBAs but if everything is on fire, nope [09:25:56] actually, I am not so sure [09:26:13] because maybe the cookbook would fail, it supposes a non-emergency [09:26:23] we would need a cookbook with less checks [09:26:52] it does check for masters in the new dc to have catch up from the old ones indeed [09:27:08] but if there was someone around with mw knowledge, it should be able to be done to workaround it [09:27:10] an assumes reachability of the old DC anyway for many things [09:27:31] I would create an emergency failover with less checks [09:27:50] (ignore replication, assume codfw is unusable) [09:28:10] we discussed this in the past, it's a very large and complex problem because you can design it only for some failure scenarios and not all the possible ones [09:28:20] question_mark: so we would lack the automation, but in terms of readyness, the idea from the beginning was to "always be ready" [09:28:49] so I think the problem of automatic switchovers when one of the DCs is not cooperative/unavailable is a separate (and very complex one) [09:29:11] yeah and on those cases, you have way bigger problems than lack of circular replication [09:29:13] but the case we had today, and in a way even this weekend (not necessarily for DB stuff) is, people right now are not sure if a data center is in a state to be switched to or pooled [09:29:18] but in a way, it is actually easier "we swichover no matter the issues" [09:29:33] so perhaps we can think of better ways to capture this with automated systems [09:29:45] until we have that, we won't have the confidence to do it unless most people are present/reachable [09:30:06] (not sure whether switching over would have helped us much today, but that's another matter :) [09:30:13] I would say for automation: blackhole network, ignore alerts, and switch locations with a dedicated emergency cookbook [09:30:48] I am talking about db layer, I cannot speak for other layers [09:30:53] db/mw [09:30:58] slightly unrelated, we also need an emergency primary database switchover too, I started it but haven't finished it yet [09:31:13] the chance of a master going down and not coming back online is not neglible [09:31:18] Amir1: I know and is something we should talk about when you have a moment [09:31:28] the main issue would be ongoing maintenace, I'd say, what Amir says [09:31:34] it's in my todo list to talk to you about that script ;) [09:31:43] awesome volans! [09:32:12] as for the automation maybe we should have something that is aable to check for any ongoing maintenance and knows how to stop it safely [09:32:32] volans: the problem is it may not be easy/clean [09:32:37] or gives indications on what to do, like for a schema change maybe we're ok to leave that host out for a while [09:32:57] but indicate to the tool that it needs to stop after that host [09:33:00] if there is an alter table running for 1 hour, it can be cancelled, but it may take a long time to rollback [09:33:15] I don't know if those are run on the secondary on primaries [09:33:27] yup, the current one is taking 1 day and 19 hours, cancelling it gonna take a while [09:33:45] if they are not (or should not be) we maybe should stop doing those so we are available 24/7 [09:34:30] volans: what we should be doing starting today is fix the ROW issue [09:34:31] jynus: I'm not sure, we have a lot of schema changes we need to do, if we do it only when were are around, we won't catch up [09:34:55] Amir1: that is ok, but in that case we should do what mark says and set something as "unable to failover" [09:34:55] I think jaime meant on masters [09:34:59] (specially since due to active/active, we need to depool/repool all replicas) [09:35:17] yes, the only blocker would be the master, we can switchover if it is on 1 replica [09:35:29] if a replica is doing a schema change it will stay depooled and continue, who cares [09:35:38] if you mean masters, we never schema changes on those automatically, it's always done manually [09:35:43] so yeah, knowing we are active active [09:35:54] it's stressful enough on its own [09:35:54] we are mostly 24/7 ready for switchover [09:36:28] <_joe_> so, adding a pre-flight check step to our switchover to check db status automatically would be great, imho [09:36:41] yes [09:36:44] and it can be conservative [09:36:55] volans: Manuel wasn't keen on that during "normal" times, we stop all maint during the prep period but that I think is just a cautious action [09:36:58] if there is any complications, have a block/warning and then indicate "DBAs should get involved" [09:36:58] _joe_: what I mean is switchover as it is now would fail [09:36:59] <_joe_> it should be volans-strict yes :) [09:37:04] but then at least that gives confidence to everyone the rest of the time [09:37:13] as I am assuming primary is unreachable [09:37:31] I want a failover script, that sure, does other checks [09:37:40] but don't assume primary is accessible [09:38:03] (but still does it cleanly) [09:38:07] it might not be the case, e.g. if today's issue required us to do a dc switchover [09:38:23] but sometimes yes, master won't be available [09:38:32] yes, but we could do a switchover without setting up circular replicatin beforehand [09:38:42] yup [09:38:48] that is not contemplated in the regular one [09:39:06] so it should just drop some checks, but it would be very similar [09:39:21] yes, but the case of having a master or the entire dc unavailable isn't really what we're talking about here [09:39:26] otoh, if you do that, you'll need a DBA to clean up the secondary dc databases anyway [09:39:28] that's something we should try to solve also, but much more complex :) [09:39:49] let's focus on the basics, we have two data centers up and running, dbs available everywhere, but there app level issues [09:39:51] sadly circular replication setup takes a lot of time, and DBAs don't like it because it prevents maintenace [09:39:51] yeah [09:40:04] that's what happened today, problems in k8s infrastructure [09:40:06] question_mark: I am literally answering that [09:40:13] but noone dared switching over as not sure if the DBs were ready [09:40:39] we create a new cookbook for a fast switchover, withot needing dba intervention [09:41:33] I think Amir got what I meant, if we meet question_mark and volans you will get what I mean, too (it is not scare or dangerous or requires a lot of work) [09:41:42] *scary [09:41:55] just not easy to explain in text [09:41:59] ok [09:42:02] without drawings :-D [09:42:11] it's clear to me [09:42:41] volans: and should be just chaning a couple of operations of the existing cookbook- drop 1 check and stop replication instead [09:43:07] I think it might even be just a flag on the existing switchdc cookbooks [09:43:16] that will adapt/skip some steps [09:43:21] sounds good to me [09:43:30] and then we can create a nice dashboard with "ok to use switchdc" "ok to use fast switch dc" [09:43:38] yeah, that works too [09:43:52] switchdc and switchdc --fast :) [09:43:58] yes we need some current maintenance ongoing check [09:44:09] yep, sounds like a plan [09:44:32] thank you :) [09:44:44] although I wonder if other layer will have some issues with an emergency [09:44:50] *layers [09:44:54] it's not unique to DBs for sure [09:45:01] ideally we would have that everywhere [09:45:02] migrating maintenance jobs seems slow [09:45:23] same sort of thing is true for caching dcs and doing maintenance there or being down, like we discussed on monday [09:46:01] I wonder what is a good forum to explain the details around db architecture ? [09:46:07] related to switchovers [09:46:50] as I think for most people those are like magic "we don't touch it, only the dbas do" :-/ [09:48:02] volans: we would also need a followup script for setting up circular replication aferwards [09:48:03] that is exactly the problem, even with a lot of effort you're not going to be able to explain the db architecture and their implications to a group of 50 SREs to a level that they feel confident they can do complicated things [09:48:19] what they need is "is it save to run this switchover script right now myself, or do I need to contact a DBA" :) [09:48:35] not at low level, but I think we can do it at high lelvel + cookbooks [09:48:43] yeah that never hurts [09:49:07] It doesn't help we are missing our GOAT DBA [09:49:26] :-) [09:50:08] plus a dashboard [09:51:34] I see a status page with "replication codfw -> eqiad (can only run switch dc --emergency)" / "circular replication codfw -> /<- eqiad enabled (can run switch dc)" [09:52:02] i think the cookbook should tell/warn you (with override available) :) [09:52:29] they already do :-), those are the checks I meant to disable for a fast version [09:52:30] (sure, let's also have a dashboard, but not depend on looking at it) [09:53:06] [a fast version that don't exist yet] [09:53:18] *doesn't [09:56:11] as a note, I haven't been involved with db stuff for some years, but feel we can do better as I designed the arch intra-dcs (and we are progressing, see automatic setup of circular replication) [10:04:07] so, switchdc -> fails certain checks -> do you want to do it in an emergency (yes/no) could be a workflow [10:20:03] hello folks! Is it a good time for a swift-proxy roll restart? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380 [10:20:34] elukey: I am not aware of any ongoing blocker [10:21:22] so +1 from me [10:21:38] super thanks! It is just to avoid rate limit for docker registry, shouldn't touch anything else [10:21:41] proceeding [10:40:30] ping us when done :-) [10:49:18] done! [10:49:34] nothing on fire afaics from https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m [10:54:02] yep, I noticed that there was some small amount of errors on swift during the outage [11:05:39] those errors line up quite closely with thumbor 5xx rate, so thumbor contributed at least some of the swift errors [11:06:03] <_joe_> hnowlan: I would guess they're the main source of swift errors :) [11:06:10] <_joe_> at least at the time [11:06:14] ah, I see [11:06:19] <_joe_> and some were related to maps maybe [11:06:37] I was not understanding at first the relation with swift proper [11:06:51] vs full upload stack [11:08:30] for monitoring purposes they're one and the same, a 5xx from thumbor will always look like a 5xx on request to swift [11:10:29] I get it now [11:10:38] I was confused for a while [12:33:48] I seem to be back on a stable connection [14:03:16] swfrench-wmf: jhathaway I believe you are both fully aware, but see updates on: https://docs.google.com/document/d/1w5x8-_KTEzL1ARuyVeROlslAMNI68g1JRqVyqGKr_jY/edit#heading=h.95p2g5d67t9q [14:03:48] jynus: thanks! yes, alas, quite aware :) [14:06:51] thanks jynus [14:07:53] debmonitor will be unavailable for ~ 5 minutes [14:36:40] puppetboard will be unavailable for ~ 5 mins [14:53:16] Hi! quick cookbooks/clush related question. When defining a pre/post script in a SREBatchRunnerBase cookbook, can I assume that $(..) commands would be execued as in bash? Ex: when executed on cephosd1001.eqiad.wmnet, would `return ["ceph osd set-group noout $(hostname)"]` excute the `ceph osd set-group noout cephosd1001` command? [14:54:08] brouberol: it depends [14:54:38] if you override pre_scripts and post_scripts then yes [14:54:44] it runs confirm_on_failure(hosts.run_async, script) [14:54:45] I do, yes [14:55:19] I have a WIP patch I can link to if that can help [14:55:26] your $hostname will be resolved as the cumin host [14:55:35] because is not escaped to be send over [14:55:37] as is [14:56:04] you can test your patch with https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging [14:59:12] indeed, I am! But the command is logged before execution, meaning the $(command) appears in the log message. [14:59:34] actually sorry I misremembered one detail [14:59:40] r.run_sync("echo $(hostname)") [14:59:47] returns the hostnmae of the target host [14:59:56] however, I'm able to see the following ceph log, which definitely validates your point [14:59:56] Oct 09 14:58:59 cephosd1001 ceph-mon[1370]: mon.cephosd1001 mon.0 244959 : from='client.? 10.64.130.13:0/1790667226' entity='client.admin' cmd='[{"prefix": "osd set-group", "flags": "noout", "who": ["cephosd1001"]}]': finished [15:00:12] cf the `who: [cephosd1001]` [15:00:13] what hostname do you want there? [15:00:53] cephosd1001. That ceph log confirms that the $(command) is executed on the target host and does what I want it to do! [15:00:55] all good, thanks! [15:01:13]