[07:15:11] greetings [08:32:40] morning [09:03:58] hello! [09:14:36] * volans back [09:15:34] as I'm still going through backlogs, have anyone took over/made some additional progress to the tracing bits I've left last week before going off? Just to avoid any double work if anyone has worked on that [09:19:59] I'm not aware of anyone no [09:20:43] I don't think anybody worked on that [09:23:14] dcaro: regarding T408387, I'll let you handle the project creation if that's ok, as mine is the only +1 at the moment on the task [09:23:15] T408387: CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387 [09:23:55] unless you think we need further discussion about that request [09:26:47] dhinus: 👍 [09:26:59] I'll do [09:35:58] ack, thanks. I see the MR is still pending a review (toolforge-deploy 1040), if anyone has time :) [09:41:59] * dcaro has to cleanup lima-kilo xd [09:45:59] see my last comment for the latest context :D [09:46:44] 👍 [09:51:15] * volans doesn't currently recall what he wrote, be recalls to have written something that might be useful :-P [09:52:58] I need a couple of quick +1s: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1078 and https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/279 [09:53:46] first done [09:54:35] the second one is mine, not sure if I can review myself :D [10:07:13] dhinus: +1d the last [10:07:44] thanks both! [10:30:59] volans: +1d the mr [10:31:05] (tested locally and such) [10:32:04] dhinus: do you want to meet this afternoon for the refinement? I have a slot I used to use to pair with raymond, but he's out this week [10:34:51] dcaro: <3 thanks a lot, I'll get back to you with some questions for the credentials for toolsbeta in a bit (busy with other things right now) [10:34:53] dcaro: ok, that slot at 3:30 works for me! [10:35:20] tofu is currently failing on codfw (but working find on eqiad1). I opened T410265 [10:35:20] T410265: [tofu-infra] tofu failing to retrieve DNS zones on codfw - https://phabricator.wikimedia.org/T410265 [11:02:59] I just re-enabled Puppet again on apt1002.wikimedia.org (the second time in seven days), don't keep Puppet disabled on this server for longer than half an hour at most. this breaks changes to the reimage config for any new system (any preseed.yaml will be in vain) [11:04:09] https://phabricator.wikimedia.org/T408777#11378287 are the direct result of that [11:18:28] I'm looking into T409668, do we have other cloud-vps projects where we defined similar alerts? [11:18:28] T409668: [alerting] Create alerts for cloud-vps/VideoCutTool app - https://phabricator.wikimedia.org/T409668 [11:27:08] dhinus: it is up to the requestors to come up with exact alerting rules instead of us guessing, ihmo [11:31:16] I'll try to get a more precise definition from them [11:31:46] I was hoping we had some other projects they/we could copy from, but it looks like no other projects ever requested custom alerts? [11:33:00] I don't see any project-specific ones in metricsinfra-controller-2 ("alerts" table in mariadb), except for "supported" projects like toolforge or quarry [11:44:30] I think there's some custom ones in the DB of metricsinfra, but I think we created ourselves directly (without prompting) [11:45:31] * dcaro lunch [11:46:38] I don't see anything that matches what they need (cpu, ram or disk space...) I replied in task, and cc-ed you there [11:54:18] where are the global alerts in /srv/prometheus/cloud/rules/alerts_global.yml coming from? [11:56:21] dhinus: we have created custom rules for non-WMCS-managed projects in the past (although the adoption has been quite limited because this is not a well-advertised feature), when those projects hahve showed up with the exact Prometheus expressions to configure the alerts from. [11:57:34] taavi: sounds fair, I tried asking the user to provide promql rules, if they do it would be easy to add them to the alerts table [11:58:28] I wonder if we could have some global alerts for all projects (low disk space could be one). I'm also not finding where the current "global" ones are defined [11:58:49] there is a separate table for them iirc [11:59:32] if we did default alerts like that IHMO we'd need a way to disable/customize them per project [12:00:20] ah separate table, gotcha! [12:00:40] I was looking everywhere in git but didn't think of a separate table :) [12:02:02] re: disable/customize, we do have already a few global alerts firing on several projects https://alerts.wikimedia.org/?q=%40cluster%3Dwmcloud.org [12:02:15] would it be a problem to have some more, without any disabling? [12:04:39] also just having alerts generally has no value unless the people maintaining a project know about it, which is usually not the case [12:05:40] the current global alerts are supposed to represent states that are unambiguously problems (e.g. puppet is broken), which is not as easy to define for something like disk space [12:06:29] true, although I'd be tempted to say that disk space <10% is quite unambiguous... CPU and RAM would be trickier [12:07:46] it's not, for example the tools-prometheus servers operate with a disk-based data retention scheme so having 5% free on the /srv partition is the normal state for those [12:09:01] fair enough [14:35:39] moritzm: sorry about leaving puppet off (again). I was working on T407586 and it's easy to get distracted when waiting for a reimage to fail :/ [14:35:39] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [14:37:43] ok, attempt 2 at fixing cas in codfw1dev: https://gerrit.wikimedia.org/r/1206382 https://gerrit.wikimedia.org/r/1206383 [14:40:21] taavi: Think we should just move that bespoke CAS service to a ganeti host? That would make this edge case go away. [14:40:29] And more closely resemble the thing that it's meant to test [14:40:47] that is definitely an option as well, yes [14:42:30] slyngs, moritzm: we've been chasing issues with the codfw1dev cas due to it living on the same host as other services; it would be easy to move it to its own ganeti host wouldn't it? [14:43:13] taavi: I think I'd prefer to do that unless you think your patches are also useful in the general case. [14:44:43] do you have the time for that? the last patches fix the problem just as well [14:46:50] sure, we can easily create new ganeti VMs if that help [14:48:07] taavi: I am naturally wary about adding an edge case to envoy but if that doesn't bother traffic people (or whoever it is that cares most about envoy) then I'll get over it [14:48:17] moritzm: context is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1206382/1 [14:49:59] what is the current use case of cloudidp-dev, it provides OIDC for the test environment of horizon.wikimedia.org? [14:50:26] yes [14:53:15] what are the users of this test envirionment, just a few internal WMCS staff members? [14:53:47] not strictly internal people, but yes, just a handful [14:54:17] given that horizon.w.o runs against the prod IDPs, we could also simply authenticate the horizon test environment against idp-test.w.o? [14:54:52] the codfw1dev cluster is backed by a different test/dev ldap [14:54:54] that uses a different LDAP tree [14:55:05] ok [14:55:37] separate Ganeti seems fine then [14:57:51] moritzm: do you prefer that to patching envoy to support the status quo? (maybe you don't have opinions about envoy) [15:01:24] I don't have a strong opinion, but this setup seems already full of corner cases, so simplying seems like preferable [15:32:52] I would probably need a bit of support for rebuilding, at least with CR -- moritzm, should that be you, or slyngs, or someone else? [15:33:30] I'm have some time if you need it :-) [15:33:43] Well, not right now, but in general [15:33:50] great, thank you! [15:45:04] FYI I'm going ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203383 and will set the cluster noout and norebalance as per the procedure in https://phabricator.wikimedia.org/T399180 [15:58:57] all done, quite anticlimactic which is exactly what I was after [15:59:25] you like to live dangerously :D [16:01:46] haha! been on a anxiety-reducing quest for some time now [16:03:38] lol [16:04:52] 🎉 [16:05:03] nice! [16:14:54] there's currently a spike of requests coming to toolforge being denied [16:19:52] https://zhdeletionpedia.toolforge.org/ we think [16:20:34] what exactly are you seeing happening? [16:47:41] T410288 [17:10:40] * dhinus off [17:10:57] any easy way for me to get developer accounts/usernames from phabricator names? silly ldap search did not find anything useful [17:12:06] dcaro: if people have linked their accounts you'd see that on the profile page (e.g. https://phabricator.wikimedia.org/p/taavi/) [17:14:54] there is a conduit endpoint I wrote to do the other way--ldap to phab--https://phabricator.wikimedia.org/conduit/method/user.ldapquery/ [17:15:45] hmm "LDAP User: Unknown" I guess they did finish creating the developer account? [17:17:42] it means they have not linked their account to phabricator [17:17:58] they might or might not have an account, but just have not taken the step to explicitely link those two [17:18:06] ack [17:19:15] thanks [17:21:45] That conduit endpoint could use better docs. :/ The lookup value is the LDAP account's `cn`, not the `uid` as one might expect. So a lookup for "bd808' will fail, but "BryanDavis" will find my @bd808 Phab account. [17:23:10] There is a partial mapping in Striker's local database too. That one gets populated by OAuth between Striker as the client and Phab as the server. [17:24:38] * bd808 figures out this question was about the ProVe project requesat [17:27:25] yep :), it was about that, still useful the dig into the conduit endpoint [18:44:24] * dcaro off [18:44:26] cya