[07:43:13] hmm.... I've started getting errors trying to silence alerts from cookbooks running on my laptop [07:43:15] Failed to POST to http://alertmanager-eqiad.wikimedia.org/api/v2/silences: 403 Client Error: Forbidden for url: http://alertmanager-eqiad.wikimedia.org/api/v2/silences [07:43:24] anyone know if anything changed there? [07:46:41] ooohhh, the proxy is not starting, and for whichever reason debug logs are not being printed :/ [07:48:12] yep, that's it [08:01:32] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1075843 fixes it (quick review) [08:04:31] 👀 [08:06:01] +1'd [08:12:49] thanks, hmpf... can't bootstrap a new ceph osd while reimaging another as it fails the network checks when reimaging one is rebooting xd [08:23:35] mmm, is that a new dependency chain? [08:23:44] what network check is it failing? [08:40:11] before bootstrapping a node we do a set of pings with and without jumbo frames to all the other osds in the cluster, as the new nodes are already "defined" in hiera, it also tries pinging those for the check and fails if it's reimaging/rebooting [08:51:47] moritzm: I was able to reimage cloudcephosd1039, [08:51:53] but cloudcephosd1040 is stuck with cert issues [08:52:07] (I tried manually downgrading puppet to 5, but did not help xd) [08:58:33] I'll try reimaging again :/ [09:07:49] I can have a look? [09:07:55] or did you already kick off the reimage? [09:09:08] I did xd, sorry, let's see if it works, I'll ping you again if it fails [09:10:52] dcaro arturo I think this one can be merged, WDYT? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075609 [09:11:37] ack [09:12:06] dhinus: in a meeting, looks ok, if you want I'll look deeper after [09:13:53] let's wait a few hours, we can merge after lunch [09:36:04] dhinus: LGTM, +1d [09:36:58] arturo: thanks [09:37:23] dhinus: also, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075859 [09:37:43] I've already added myself to the reviewers for that one :) [09:37:49] ok [09:37:53] :-) [09:38:02] I was waiting for the tests to pass, but looks good [09:39:00] hard to say if it covers ALL related bits, it's always hard to track hiera keys scattered across different files [09:39:38] the CI is complaining about missing defaults in cloud.yaml [09:40:11] oh interesting, never seen that before [10:10:38] when does tofu kick in? [10:11:07] (keystone was kicking in right after project creation) [10:11:15] it is run manually at the moment [10:11:40] so now every time we crate a project we have to run tofu too? [10:11:49] yes, we do project creation via tofu now [10:12:07] oh, that was not clear to me, what happens with the cookbook? [10:12:25] the cookbook now displays a message "do this via tofu" or similar [10:14:58] I don't see that in the code :/ [10:15:10] oh https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1069994 was never merged :-( [10:16:32] how are quotas managed? [10:16:46] (and users in the project) [10:17:01] unfortunately some quotas can't be managed via tofu, for example trove ones [10:17:21] in general, quotas has not been migrated to tofu, see T371391 [10:17:22] T371391: Cloud VPS: extend tofu-infra to cover quotas - https://phabricator.wikimedia.org/T371391 [10:17:36] the cookbook did that before when creating the project [10:17:36] and regarding project memberships, see T371393 [10:17:38] T371393: Cloud VPS: extend tofu-infra to cover projects, users and roles - https://phabricator.wikimedia.org/T371393 [10:18:24] Is the docs anywhere on "how to add a project" in tofu? (as now it's more steps than before) [10:19:32] I don't think that specific doc exists. Managing all resources via tofu is similar workflow: create a patch, run tofu [10:19:49] but currently it's not, as you have to manually do the quotas and user access [10:19:56] in-between we need to know how to do it no? [10:20:18] there are docs on how to do manually memberships and quotas [10:20:30] some regression to our workflows are expected while we fully migrate into tofu [10:20:34] maybe the cookbook should tell you "send the new project patch to tofu and click ok, then I'll set the quotas and add the access" [10:21:02] patches welcome [10:21:05] but not being anywhere defined "what do I have to do to create a project?" makes it easy to forget about all that [10:22:46] I'm quite busy right now with ceph and such, I can try to do something, but I'd really appreciate that if you remove/change a team workflow document the alternative so we don't have to discover later on the fly [10:28:58] added a note to the patch, I'll take over if you don't want to do it [10:29:20] * dcaro lunch [11:41:16] dcaro: turns out, I had updated the documentation on how to create a new openstack project a month ago, to reflect the new tofu reality [11:41:18] https://wikitech.wikimedia.org/w/index.php?title=Portal%3ACloud_VPS%2FAdmin%2FProjects_lifecycle&diff=2221049&oldid=2160445 [11:50:11] good, then that should be linked there instead [11:51:21] there's no info on how to add a new project to tofu though, just pointers to the generic tofu help that I can see [11:54:27] well, you are a smart person, it will take you 30 seconds to figure out that all you need to do is to add the project definition to a yaml file, following the other hundred examples in the same repo [11:56:46] there are no docs on wikitech how to add a flavor either, and this MR went through just fine https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/15 [11:58:31] after asking around, and looking into the code to try to understand what's doing yes [11:58:52] excellent then [11:59:40] I would appreciate if you relax this level of scrutiny over my work, and acknowledge that we are in a transition period. Which I also explicitly shared information about in yesterday email to cloud-admin@ [12:00:48] here's the change I'm requesting https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1075892 [12:01:19] is not only that, it is also yesterday's comments on the type checking [12:01:29] a more constructive approach would be to patch/edit stuff yourself, instead of pointing with the finger stuff you don't like from the back seat [12:02:35] I just sent a patch [12:09:37] arturo: I was looking at https://phabricator.wikimedia.org/T375259 and struggling a bit with nftables. Once a packet towards 185.15.56.57 is received on cloudgw1002, what does the host do with it ? [12:10:10] XioNoX: let me see [12:11:16] arturo: https://phabricator.wikimedia.org/T375259#10179065 left the first part of the investigation [12:11:23] XioNoX: nothings, just accepts it. There is no NAT for this address here [12:11:31] arturo: Don't get me wrong, the tofu work is great, and really needed. I think it's going to help a lot. I would love to have time to work on it too, but there's too much stuff going on. That does not mean that I will not try to improve it when my opinion is requested (like yesterday), we all want the same thing. [12:12:08] arturo: accept it, but the IP is not on the host itself, so where does the packet goes ? it forwards it to its default gw ? [12:12:40] XioNoX: routing, see `ip route list table cloudgw` [12:12:40] so back to cloudsw1-d5 (10.64.20.1) ? [12:12:50] no, it routes it to neutron [12:13:05] ahh, I forgot there was a specific cloud table [12:13:21] XioNoX: there is a VRF here in these boxes [12:13:22] dcaro: ack [12:17:34] arturo: cool yeah, following the thread, where is 185.15.56.238 / cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org. ? [12:17:53] XioNoX: in a linux netns in cloudnet boxes [12:18:00] that is the IP of the neutron virtual router [12:18:33] see [12:18:35] https://www.irccloud.com/pastebin/8SCNeqMA/ [12:22:01] arturo: thx, that's active/passive or active/active ? could cloudgw have tried to send packets to cloudnet1005 ? [12:22:43] there is VRRP going on, instrumented by neutron itself. Active/passive [12:23:23] ok! so all good so far [12:23:56] last hop I guess is the VM, but when c8 went down the VMs were migrated to the other racks, no [12:23:58] ? [12:23:59] XioNoX: I think I checked VRRP in both sides and they were correctly detecting the switch being down, that's what motivated me to suspect about CR/cloudsw, but maybe I got confused [12:24:29] yes, last step is the VM, which I believe there was none on that rack at that time [12:24:38] * arturo brb [12:25:13] alright, so no smoking gun :( [12:40:35] I could double check the keepalived logs to see if I see something different [12:40:47] I'm not sure if we still have the logs from that day [12:47:24] arturo: a bit more hope, I left another comment with a possible way to investigate it further [12:49:10] ok [13:04:33] XioNoX: I added the keepalived logs to the ticket [13:04:39] will follow up later, thanks for working on this [13:05:20] arturo: looks like it failed over properly yeah [13:33:38] arturo: just found out about this tool that lets you automate importing resources from openstack into terraform https://github.com/GoogleCloudPlatform/terraformer [13:33:50] it has limited support for openstack: https://github.com/GoogleCloudPlatform/terraformer/blob/master/docs/openstack.md [13:34:46] you've already done the work for security groups, maybe we can consider it if we want to import nova instances in the future (e.g. for toolforge) [14:03:43] 👀 [14:05:00] tracking VMs in tofu will be the next big milestone :-P [17:10:29] * dhinus offline [17:25:21] * dcaro off [17:25:26] cya