[07:18:52] <brouberol> welcome back jayme. I'll have a look! [07:21:05] <jayme> thanks :) [07:57:03] <brouberol> I'm caught up after a couple of days of OOO. can you point me to a gerrit URL showing the failure? [08:08:39] <brouberol> > You may run rake check_deployments[diff,'dse-k8s-services/postgresql-airflow-analytics-product'] to quickly repdoduce [08:08:39] <brouberol> Sigh, seems like I'm not caffeinated enough [08:09:03] <jayme> won't be very helpful though as it just says 'Template did not render correctly (HEAD of origin/master).' [08:09:07] <jayme> https://integration.wikimedia.org/ci/job/helm-lint/24562/consoleFull [08:16:22] <brouberol> I have an idea what might be causing this: we have isolated the PG releases in their own helmfile, itself with assciated kubeconfig file by root:root, to avoid accidental deletion by an un-priviledged user [08:16:35] <brouberol> so there might be a missing yaml file in CI, that we need to download [08:18:39] <jayme> that sounds weird. Download from where? [08:20:33] <brouberol> not sure, I'm struggling to run the rake tasks atm [08:21:20] <brouberol> either in or out of docker, nothing seems to work [08:23:44] <brouberol> alright, I think I should be able to reproduce. I'll report back when I know more [08:24:40] <brouberol> hmm, it seems to work, with a diff regarding a newline [08:24:48] <brouberol> would that newline diff fail the job though? [08:26:12] <brouberol> nvm me. I had a rough night, I can't seem to be reading correctly this morning. [08:26:12] <brouberol> > +Template did not render correctly (HEAD of local branch). [08:36:12] <brouberol> hmm, it's difficult to debug this without any additional debug information [08:42:46] <brouberol> when running `rake "check_deployments[diff,dse-k8s-services/postgresql-airflow-analytics-product]"`, I'm seeing [08:42:46] <brouberol> helmfile lint output: [08:42:46] <brouberol> ---------------- [08:42:46] <brouberol> err: no releases found that matches specified selector() and environment(aux-k8s-eqiad), in any helmfile [08:44:32] <jayme> I don't see that one :) [08:45:26] <brouberol> do you see anything more than "Template did not render correctly" ? [08:46:02] <jayme> no, undortunately the CI does not output the actual error. CI will do something like 'helmfile -e dse-k8s-eqiad template' on both git revisions [08:46:23] <jayme> execution error at (cloudnative-pg-cluster/templates/cluster.yaml:96:5): The s3.accessKey and s3.secreyKey values were not provided [08:46:56] <brouberol> ok so that indicates a missing secret file [08:46:57] <jayme> is what I get. So I would assume you need to provide some (additional) fixtures [08:47:39] <brouberol> that's what I meant by "missing some YAML file that we might need to download" [08:48:30] <brouberol> the file itself is /etc/helmfile-defaults/private/dse-k8s_services/postgresql-airflow-platform-eng/{{ .Environment.Name }}.yaml [08:49:12] <brouberol> brouberol@deploy1003:~$ sudo cat /etc/helmfile-defaults/private/dse-k8s_services/postgresql-airflow-platform-eng/dse-k8s-eqiad.yaml [08:49:12] <brouberol> --- [08:49:12] <brouberol> s3: [08:49:12] <brouberol> accessKey: XXX [08:49:12] <brouberol> secretKey: XXX [08:49:12] <jayme> I think you can provide those values in .fixtures.yaml [08:49:52] <brouberol> nice, I'll whip up a patch for that [08:49:58] <brouberol> how did you get the error message btw? [08:50:01] <jayme> so helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product/.fixtures.yaml [08:50:16] <jayme> I changed CI code, which forces a 'rake all' run [08:50:19] <brouberol> I got lost in rake/ruby, which I'm not super familiar with [08:50:53] <brouberol> aah, we had these .fixtures.yaml files, we just didn [08:50:59] <elukey> o/ [08:51:04] <brouberol> 't port them to the new helmfile PG dir [08:51:09] <brouberol> ./facepalms [08:51:15] <elukey> as FYI I started a chain of changes to upgrade all charts to mesh.configuration:1.13 [08:51:18] <elukey> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454 [08:51:21] <elukey> first batch of 20 [08:51:32] <elukey> so if you need to do something similar, please sync with me first :D [09:03:53] <brouberol> jayme https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144477 seems to work [09:06:38] <jayme> brouberol: the diff is expected because the pg setup moves from the airflow deployments to the postgres-airflow ones? [09:14:40] <brouberol> yep [09:15:01] <jayme> cool. +1 then [09:15:02] <jayme> thanks [09:15:10] <brouberol> we separated the airflow and PG deployments, to harden the permissions on the airflow kubeconfig files, to make them, only deployable/deletable by SREs [09:15:16] <brouberol> thanks for the help! [11:21:08] <jayme> brouberol: dse-k8s-services/airflow-wmde/dse-k8s-eqiad seems to still be broken [11:21:18] <jayme> https://integration.wikimedia.org/ci/job/helm-lint/24687/console [11:39:44] <brouberol> hmm, that's odd [11:40:34] <brouberol> oh, I see why. I'll send a patch to btullis [11:43:36] <brouberol> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144514 was merged. You should be fine after a rebase jayme [12:07:04] <elukey> brouberol: o/ lemme know if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454 is ok when you have a moment :) [12:07:22] <brouberol> looking [12:07:24] <elukey> I'll use it as first real test if you are ok (and if so, lemme know if we can roll it out) [12:07:52] <elukey> the new config auto injects a custom histogram config for all the envoys basically [12:08:07] <elukey> so we can reduce what we ingest on Prometheus [12:08:36] <brouberol> this is only changing the statsd->prom exporter config for envoy itself, right? [12:10:31] <elukey> the main change yes, but it may bring more due to the module update [12:10:57] <elukey> ah no wait it is not statsd->prom, it is related to envoy's histogram bucket config [12:11:25] <elukey> the rest in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454/1/charts/airflow/templates/vendor/mesh/configuration_1.13.0.tpl is basically what is already running elsewhere [12:12:11] <elukey> in the diff it is under "histogram_bucket_settings" [12:51:09] <brouberol> yep, that looks good to me, in the sense that I trust you on the histogram config and I'm not seeing any airflow config being affected [12:51:23] <brouberol> do you want to try this out on a specific airflow instance? [12:52:55] <elukey> ideally yes, no idea what's best etc.. [12:53:57] <brouberol> I can checkout your patch locally and deploy it on airflow-test-k8s if you want [12:57:24] <elukey> that would be great thanks! [12:57:47] <elukey> the test is to check whether the envoy metrics have the histogram buckets stated or not [12:59:45] <brouberol> sure, let me do this right now [13:03:09] <elukey> <3 [13:10:50] <brouberol> I have to perform a bit of ,aintenance in that ns, I'll ping you when I get to deploying the patch [13:11:08] <elukey> yes please but if you have time, I didn't mean to brutally nerd snipe you :D [13:11:31] <brouberol> np! [13:20:23] <brouberol> hmm somehow, I'm only seeing a diff related to the chart version [13:22:29] <elukey> when it happen to me, I just run puppet (that forces some gz creation of new chart's versions etc..) [13:22:46] <elukey> in this case, we didn't merge so maybe it doesn't workk [13:22:57] <elukey> it should be ok to merge and then test directly in my opinion [13:24:43] <brouberol> I tweaked the helmfile so that it would use a locally checked out version of the chart with your changes in them, so that should work [13:25:01] <brouberol> but sure, if you want to merge and deploy, I'm not seeing any change atm, so I'm ok with that! [13:49:00] <elukey> merged! [13:52:53] <brouberol> alright, I'll deploy [13:53:41] <brouberol> ok, I'm indeed seeing a config change this time [13:54:44] <brouberol> aaand it's deployed [13:56:34] <elukey> nice! [13:56:54] <elukey> what namespace? I'll check the envoy metrics after some meetings [14:40:24] <brouberol> airflow-test-k8s [14:43:12] <elukey> yep just tested on dse-worker 1009, it seems working! I'll ask Filippo to confirm [14:44:56] <brouberol> nice! [14:47:23] <elukey> Filippo confirms, it is safe to be deployed in other airflows! Thanks a lot [14:49:31] <brouberol> anytime :)