[02:46:49] FIRING: [2x] PuppetFailure: Puppet has failed on thanos-be2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:49:11] elukey: db2226 seems to have a degraded RAID but I cannot find any task automatically created for it, I am wondering if there've been some modifications to that script that could have broken that part? https://phabricator.wikimedia.org/P77203 [04:50:13] Created https://phabricator.wikimedia.org/T396319 [06:46:49] FIRING: [2x] PuppetFailure: Puppet has failed on thanos-be2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:49:04] marostegui: I'm not sure if l.uca will be around today. I tried to repro but as you saw all worked fine and the task was created. To be investigated a bit more why it didn't in the first place. [06:50:04] if you don't mind a bit of IRC spam I can force the icinga status to go OK and then let it automatically re-trigger and see if it opens a new task. [07:01:05] volans: Yeah, I saw the task and updated it [07:01:19] volans: I don't mind the spam no [07:01:33] great, thx, will try to repro now [07:01:40] thank you [07:23:56] Starting a schema change on s6 [07:25:54] I've silenced that thanos alert for 3 days [07:34:07] marostegui: End-of-spam, I couldn't make it work even changing the description, added o11y too, task updated, alert back in its original critical state, I've re-acked it linking the task [07:38:15] Yeah I saw the task [07:38:23] Thanks for working on it volans [07:38:33] I just learnd that T395688 already existed [07:38:33] T395688: Icinga event handler 'raid_handler' failed to create a Phabricator task - https://phabricator.wikimedia.org/T395688 [07:43:49] marostegui, volans o/ - thanks for working on it, I'll check in the follow up tasks [07:43:52] sigh [07:46:11] elukey: Grazie bambino [07:46:23] lol [10:11:56] marostegui: Amir1: can I start s4 in codfw? [10:12:20] yep [10:13:03] tnx, started [10:16:43] marostegui: can I start it in eqiad? [10:17:01] s4? [10:21:08] yes, codfw is done [10:21:14] go for it [10:21:56] ok tnx [10:47:30] elukey: https://phabricator.wikimedia.org/T396340 this one just arrived too, is this part of the testing? [10:47:43] yes yes! [10:47:50] I think I found the issue [10:47:52] Thanks, I will merge with the other one [10:53:48] Just saw the fix...haha [10:54:27] yeah and Riccardo got it right at first, but tried "-" that caused a similar issue [10:54:48] then I tried to remove "-" but didn't work, and it was because our manual test of the handler changed the owner of the log file [10:55:52] marostegui: going to merge, run puppet on alerts1002 and hopefully we'll see the last task being cut [10:56:01] excellent, thank you! [10:56:38] I couldn't see a best way to start my week than debug for hours an icinga parsing issue [10:58:30] <3 thanks [10:58:36] elukey: You are welcome [11:06:58] marostegui: https://phabricator.wikimedia.org/T396341 [11:06:59] \o/ [11:09:08] elukey: Nice!!! [11:09:14] I will close it then as duplicate [11:09:26] Ah you did already [11:09:27] Thanks [11:10:03] I didn't [11:10:16] I am closing the other phab tasks (created by humans :D) [11:10:20] yeah, I got the notifications from the other task yeah [11:10:23] I will close this one [12:07:59] FYI I'm starting to upgrade clouddbs to mariadb 10.11, starting from clouddb1015 T394372 [12:07:59] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [12:10:40] dhinus: Great! [12:11:41] if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148297 it will not actually install the package, until I remove the 106 package, correct? [12:12:39] or should I merge only _after_ uninstalling the previous package? [12:12:52] (splitting the patch in 1 patch per host) [12:13:03] dhinus: It will attempt to install it yes [12:13:09] And probaby puppet will fail [12:13:18] ok so I will split the patch [12:13:19] So I'd recommend to do it only on the hosts you're going to do now [12:13:21] Yeah [12:13:24] thanks [12:58:55] marostegui: can I get a +1 on the stack of patches? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148297 [12:59:15] done [12:59:27] thanks! [13:00:56] dhinus: I've checked and +1ed all of them [14:09:26] I upgraded clouddb1015, but when I restarted mariadb it's not using the right socket path [14:09:52] it's using /run/mysqld/mysqld.sock instead of /run/mysqld/mysqld.s4.sock [14:10:22] I started it with "systemctl start mariadb@s4" [14:17:32] it did read the http port correctly though: "mysqld[1299]: Version: '10.11.13-MariaDB' socket: '/run/mysqld/mysqld.sock' port: 3314 MariaDB Server" [14:17:43] which is strange because I think the port setting and the socket setting are in the same config file [14:17:58] /etc/mysql/mysqld.conf.d/s4.cnf [14:20:27] I'll try stopping the unit and starting it again [14:22:11] interesting, restarting the unit did fix it: "mysqld[6974]: Version: '10.11.13-MariaDB' socket: '/run/mysqld/mysqld.s4.sock'" [14:27:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:15] clouddb1015 is upgraded and repooled! [14:59:28] marostegui: referring to https://phabricator.wikimedia.org/T384212#10862972 where can I put the passwords for the users the private puppet repo? [15:00:03] e.g. in hieradata/role/common/deployment_server/kubernetes.yaml as they are needed by the k8s pod, or more like under mysql as they are db-related creds? [15:00:17] federico3: Probably put it on the same place where Amir1 put the switchovermaster and all the other tooling, I am not sure where that is [15:00:27] ok [15:00:49] federico3: But for now, youcan leave it on my home in cumin1002 [15:00:54] and I can start creating the users etc [15:00:59] I'm not sure it's in the in the priv repo [15:01:37] Can you put in the task the username you want, the range and the grants? [15:01:46] So it is not scattered across multiple comments [15:02:16] yet I see other teams putting mysql passwords under hieradata/role/common/deployment_server/kubernetes.yaml [15:02:34] 3 usernames, do we have a policy on their names? [15:03:02] 3? [15:04:48] yep, I'll write a diagram [15:08:09] marostegui: is there no automation that will create the users on mariaDB from the secret git repo? [15:08:21] federico3: no [15:08:58] federico3: why do you need 3 RO users? [15:09:07] I suggest we give the users some very grep-friendly names [15:10:14] 1 read only that can read SHOW REPLICA STATUS and `hearthbeat` on core DBs [15:10:14] 2 with read-write access only on the `zarcillo` DB on db1215 and the `zarcillo_preprod` respectively [15:11:42] maybe we can use the same RW for both dbs? [15:23:01] federico3: Let's continue this chat on the task, so we can have this conversation async [15:24:03] I was looking at how other projects do that but I'm not seeing "standard" patterns [15:26:00] the point of having 2 users was to enforce split between prod and preprod so there's no risk of touching the wrong thing even by mistake but maybe we are ok with just one user [15:27:46] ah I see what you mean [15:28:07] Ok, let's do that. Please add all the info on the task (usernames, ranges, grants etc) and I will get it done [15:30:15] I think we can go for hieradata/role/common/deployment_server/kubernetes.yaml , how about doing tokenization in the usernames and password? [15:33:26] federico3: if you for for plain hiera you can't pick a random file, they are loaded based on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/%2B/production/modules/puppetmaster/files/hiera/production.yaml [15:34:12] random file? I'm not sure I follow [15:34:30] does your sofware has the deployment_server role? [15:34:31] I'm talking about putting a subtree under zarcillo at the bottom of hieradata/role/common/deployment_server/kubernetes.yaml [15:35:39] it has its entry under hieradata/role/common/deployment_server/kubernetes.yaml [15:37:16] you mean under profile::kubernetes::deployment_server_secrets::services? [15:37:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:45] hm, there are both sections in hieradata/role/common/deployment_server/kubernetes.yaml [15:38:10] see also https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Add_private_data/secrets_(optional) [15:38:12] profile::kubernetes::deployment_server_secrets::admin_services: [15:38:14] and [15:38:17] profile::kubernetes::deployment_server_secrets::services: [15:39:04] I don't think you need admin_services [15:39:15] but check with serviceops if in doubt [15:39:56] is the service running in the aux-k8s cluster? [15:40:00] yes [15:40:09] is the service running in the aux-k8s cluster? <- yes to this I mean [15:40:33] there is a aux-k8s block towards the end for jaeger for example [15:40:38] at the very bottom of the file [15:40:46] I think yoy should put it there [15:40:48] I was going to say the same, just similar to jaeger [15:41:14] I'm talking about putting a subtree under zarcillo at the bottom of hieradata/role/common/deployment_server/kubernetes.yaml [15:41:28] that wasn't clear :) [15:42:17] admin_services is definitely something not needed, if you need to have pods running with env variables that reference k8s secrets adding something under config->private is the way to go [15:42:47] otherwise simple secrets is fine as well [15:45:00] some services appear to have a structure: [15:45:00] "secrets" key with a list of items [15:45:30] keep in mind that those will be rendered in a separate helmfile config on deploy1003, and a template is needed to load them up and create resources etc.. [15:45:33] each item has "name" , " data", sometimes "type" and "annotations" [15:46:18] ...with data being key/values [15:47:18] yep yep, what I was saying is that adding them to puppet private will create files like /etc/helmfile-defaults/private/aux-k8s_services/jaeger/aux-k8s-eqiad.yaml on deploy1003 [15:47:33] helmfile.yaml is configured to look into that usually [15:47:51] but the chart needs to be able to do something with secrets, like creating k8s secret resource etc.. [15:48:22] iirc that's k8s native Secret but many services are not using it [15:49:13] sure, what I mean is that the chart needs to have a template that look for secrets and renders them as k8s secret resources [15:50:31] for example, python-webapp has templates->secret.yaml, but that one afaics is related to config->private [15:50:37] I'm aware, I'm looking if the python-webapp chart template has it already built in [15:51:10] aha [15:51:22] # prod-specific secrets, controlled by SRE [15:51:22] - "/etc/helmfile-defaults/private/aux-k8s_services/zarcillo/{{ .Environment.Name }}.yaml" [15:51:28] exactly [15:51:37] could this be automatically populated with all available secrets? [15:51:56] yes whatever you put in hieradata/role/common/deployment_server/kubernetes.yaml [15:52:00] gets rendered there [15:52:06] see the jaeger example above [15:53:16] what I don't recall is if we have something specific that picks up a plain "secret" config and creates k8s secrets [15:58:31] federico3: I think that the best way for you is to use config->private, that will automatically create the secrets for you, and expose env variables referencing them (that you can skip or use) [16:01:21] let me update the wiki docs [16:01:26] ahhhh no wait [16:01:37] uh? [16:01:46] I totally forgot about the "secrets" chart [16:01:58] there you go yes [16:02:32] to recap, two ways: [16:02:41] 1) config->private etc.. as we discussed [16:03:13] 2) you add to helmfile.yaml the "wmf-stable/secrets" chart (see what jeager does) and then you can use the "secrets:" config [16:03:32] the only difference is that the former also creates the env variables that pods can reference [16:07:43] looking good? https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Add_private_data/secrets_(optional) [16:15:15] I think it misses the last two bits that I added above, that could drain a lot of time if one doesn't know [16:15:51] if you simply add "secrets" without the related chart in helmfile.yaml the config won't do much [16:16:13] I can add all the bits later on if you want [16:16:26] yes, feel free [16:54:05] marostegui: I created the file in your home, but what ipaddr range do we want to enable? All of the k8s aux cluster? [16:54:16] federico3: you tell me :) [16:54:36] I'd like to limit it as much as possible, by any definition of "as possible" [16:55:29] federico3: As I've mentioned before, let's please gather all these info on the task please, ranges, usernames, grants etc [16:55:49] (except the password :D) [16:56:06] the password are in your home dir, the other stuff I'll put it in the task [16:56:27] I'm looking at further reducing the ipaddr range where possible [16:57:22] thank you [20:17:32] hello, data-persistence - question about alert routing: [20:17:32] * we recently added a routing rule [0] and receiver for data-persistence "task" severity alerts, as part of the migration of periodic jobs (e.g., the parsercache purge jobs) to k8s [20:17:32] * as configured, this routes all tasks to #DBA. however, there will soon be other data-persistence alerts with task severity, which do not make sense to send there. [20:17:32] given that, would it be acceptable to switch the target project to #data-persistence? [20:17:32] if not, we can overcome that with some additional configuration complexity, but I figured I'd ask first :) [20:17:32] [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135418