[07:59:50] Morning! [08:04:51] greetings [08:35:45] morning! [09:53:07] dcaro: just saw that the alerts ObjectStorageSizeQuotaFull are working, nice! [09:53:28] yep :), I created a couple tasks and commented in the silence [09:54:00] one downside is that they might stay in that disabled state indefinitely, though I guess it does not bother/pollute too much the view [09:55:58] I wonder if we could remove team=wmcs and use project={projectname} instead [09:56:13] like we do for example for https://alerts.wikimedia.org/?q=project%3Dwikistats [09:58:11] that one comes from metricsinfra [09:58:24] yes, but we can still tag however we like, don't we? [09:59:14] we could potentially have alerts from prod being tagged with "project=foobar" and without any "team", though that might break some convention [09:59:17] it requires changing the prometheus rules in production to handle the non-production alerts/tags, users won't be able to see them though. [09:59:35] or we can just add the extra label I guess xd [10:00:07] yeah forgot alerts.w.o is behind ldap auth [10:00:14] so users will not see them [10:00:33] could still be useful for admins to see "all alerts for project x" [10:01:12] admins means us right? [10:01:29] alerts.w.o alerts can be shown on grafana.w.o btw [10:01:38] nice [10:01:47] dcaro: yes admins meant us [10:01:55] not project admins [10:02:58] godog: oh, for some reason I thought grafana.w.o was private :/, that simplifies things, we can do something else and expose that info [10:03:39] indeed [10:07:16] hmm.... we should probably not just remove the team though, that might confuse the sre oncall [10:07:33] yes you're right that would default to sre I think [10:08:41] I think it might be enough to expose the quota in a graph and document it? [10:08:56] (maybe add the link from the tool overview in the grafana.wmcloud) [10:09:09] yes maybe that's a good solution [10:09:12] to the grafana.wikimedia dashbard for quotas [10:09:41] maybe we can add a link in horizon as well [10:10:05] or just a good wiki page for project admins with all the links [10:14:51] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1184494 [10:42:22] * dcaro lunch [11:30:23] ^ready for review :) [11:31:01] I also have https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184484 , adding the `disable-ssl = true` to the mysql client config [12:11:04] LGTM [12:11:50] I was looking at duplicated tasks for PuppetDisabled https://phabricator.wikimedia.org/T403515 and https://phabricator.wikimedia.org/T403517 [12:11:51] thanks! I'll test in toolsbeta first [12:12:03] is that intended? [12:12:39] dcaro: np, feel free to add me as a reviewer in gerrit and the notifications emails hit my inbox so I see them [12:13:26] I think that the current behavior is to create a task for the specific host, if there's only one host, with the host in the title, and create an aggregated task if there's more than one alert of the same, it might be that it created the single-host one and little after the other nodes started failing too, and created the aggregated one [12:14:07] and iirc it matches the tasks by the title, so it does not reuse the first task [12:14:26] would be nice if it did though [12:14:37] I think it matches the task by title [12:14:46] (so the node-specific one does not get reused) [12:16:08] yes indeed deduplication is based on task title [12:16:23] you are the main dev no? :) [12:16:49] I am not, we have contributed to phalerts though upstream is someone else [12:16:57] okok [12:18:00] thank you for the explanation dcaro ! now I get it single vs multiple tasks [12:18:39] np :), note that it's a bug, not a feature xd [12:20:31] mmhh ok, what's the intended behavior ? [12:20:48] would be nice if it reuses the same task I think [12:21:23] dhinus did a review of those some time ago, maybe he has a better idea [12:21:50] ack ok, I'll add a note to discuss at the team meeting tomorrow [12:22:00] 👍 [12:53:15] I'm going through old tasks and boldly resolving old ones / where there isn't really a clear action [13:08:56] 👏 [13:29:53] The Broadcom 10/25G NIC for cloudcephosd1052 arrived today. I wanted to verify that it was never put into use with the Intel NIC. I would like to install it tomorrow morning, if possible. [13:31:33] jclark-ctr: I can confirm it was not put into use (it's installed, but unused) [13:32:00] thanks! [13:32:26] jclark-ctr: it's tomorrow morning europe timezone? [13:33:23] it will be 1pm utc tomorrow [13:33:29] morning my time sorry [13:34:02] okok, I say because andrewbogot.t is on PTO, but at that time I'm still around if needed [13:34:37] I think a.ndrew will be back tomorrow (according to his calendar) [13:36:01] xd, then even better [13:36:37] ohhh, I think it was joanna syncing each of our calendars to the team calendar for ptos and such.... that's why I don't see it anymore :/ [13:37:18] godog: dcaro: re: duplicated phaultfinder tasks, a few months ago I did some tweaks in the alert titles to minimize duplication, but it's still happening in certain cases [13:37:39] I think it's tricky to eliminate it completely, but it could be possible, maybe with some tweaks to phalerts [13:37:51] dcaro: the appscript runs in volans' account now, so sync should be working [13:38:06] dhinus: mmmhh ok thank you, happy to discuss tomorrow at the team meeting [13:39:03] godog: ohh, so then it's not working :/, can you see the new PTOs in the WMCS Team calendar? [13:39:36] dcaro: I don't have the calendar atm, do you have the url/id handy ? [13:40:04] I've fixed only the SRE one [13:40:31] I can I can duplicate the script for WMCS and make it work, but I don't see the WMCS-related script in my shared scripts [13:40:54] it might just be identical just using a different calendar ID [13:41:29] I think the calendar is https://calendar.google.com/calendar/embed?src=c_7b9ad6d28760abb302f0909412d1ed85b8d1db6ade03cbf2242fededb17164f1%40group.calendar.google.com&ctz=Europe%2FZurich [13:43:45] looks like it [13:46:31] let me see if it works [13:46:35] (not sure about permissions) [13:47:06] 🤞 [13:57:42] dcaro: I think I made it run but didn't import any new event in the calendar, do you have an example of missing vacation event? [13:57:56] or can you create a fake one for next sunday for example ;) [13:59:33] volans: just created one for tomorrow [13:59:42] (I don't see andre.w pto yet though) [13:59:50] do you see it in his calendar? [14:00:00] the script can't create them out of thin air :D [14:00:43] I do yes, it's a full day event for two days [14:02:28] mmmh, the script checks for some keywords [14:02:42] "not reall, just testing" can't make it :D [14:03:09] I see andrew PTO's until today, and maybe it doesn't take today's events, checking [14:03:10] \o/ yay [14:03:22] yep, might also be a syncing issue, might refresh in a bit [14:04:03] right now I'm running it manually [14:04:16] once it works I set the trigger every hour like the other one [14:06:21] changed the name to 'Out of office' [14:06:57] Importing: [dcaro] Out of office [14:07:01] Error attempting to import event: GoogleJsonResponseException: API call to calendar.events.import failed with error: You need to have writer access to this calendar.. Skipping. [14:07:10] I can't write to that cal [14:07:14] if you can add me it should work [14:07:23] let me try, I think I don't have rights either though [14:08:04] I can only read [14:19:37] volans: you are the one that needs write access right? or is it using some service account? [14:21:52] myself [14:21:54] for now [14:22:09] I'll ask levi [14:22:10] then we might find a better solution for the longer run, but let's keep it simple for now [14:22:13] thx [14:22:18] maybe mark has access [14:25:12] ok, levi says that it's locked, so we should create a new one, maybe we can wait until next week and see how to organize, given that things are shifting right now [14:43:22] ack makes sense [15:48:16] Raymond_Ndibe: I recreated lima-kilo once now (ensuring I have latest main) and it worked ok, running ldap before maintain-kubeusers [15:48:26] (not sure what I ran before :/) [16:21:37] Raymond_Ndibe: rebuilt it again, making sure that ldap stuff runs before the k8s stuff (latest main), and it worked ok [16:43:40] dcaro: those queries that I killed to get clouddb unstuck... well they're interesting :) T403639 [16:43:41] T403639: [wikireplicas] slow query runs every hour, but never completes - https://phabricator.wikimedia.org/T403639 [16:49:15] interesting [16:51:10] I'm not sure how to track where that query is actually running (maybe a cloud vps vm?) [16:52:09] going offline, I'll think about it tomorrow :) [16:52:59] if you have the local ip/port, you can lsof/ss and get the other side of the connection I think [16:53:07] or even tcpdump [16:53:18] cya! [16:53:23] yeah, a bit of a hassle but doable. I can try tomorrow! [16:53:38] * dhinus off [17:36:46] * dcaro off [17:36:48] cya! [22:39:59] hey folks o/ i'm trying to provision a magnum k8s cluster via tofu and seeing the error `Failed to create trustee or trust for Cluster`. the cluster template is successfully create, but not the cluster itself