[01:25:26] I am running the safe_reboot cookbook on all cloudvirts. Every time it reboots one we get an email about a neutron agent being down; it should be fine to ignore those. [09:43:03] dcaro: reviewed and left a comment in https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 [09:43:17] this other one has some rebase conflicts: https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 [10:25:00] dhinus: thanks! replied on the first and rebased the second (testing the second now, but feel free to review already) [11:51:53] dhinus: taavi: Using tables catalog of maintain-views: https://phabricator.wikimedia.org/T363581#10948409 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163846 [11:52:19] the ruby sorting locally worked but not in PCC [11:52:53] to de-risk the change, should I start removing dropped tables from fullViews right now? [11:56:28] taavi: for the deployment, it turns out that currently the default mount option is all, so you should be able to deploy your tools [12:02:28] Amir1: do you mean splitting the patch in 2, first modifying the list, then replacing the list with the catalog? [12:02:58] the PCC sorting bug is annoying. I also have another question: is there any dropped views that we should inform users about? [12:04:17] dhinus: Raymond_Ndibe btw. I'm in the coworking space if you want to pair on something, for ~40 though (got another meeting) [12:05:22] dhinus: I mean adding a new patch that would remove dropped tables from fullViews so the diff would be smaller [12:05:36] I wrote a script to analyse the pcc, the result is in the comment [12:05:58] but I'm checking if we are removing any full views from tables that actually exist [12:06:14] Almost all are dropped tables so it's noop for them [12:06:20] dcaro: wait, does components-api set a different default than jobs-api? [12:06:34] taavi: that´ s the jobs-api default, is the jobs-cli the one that sets it [12:06:55] (we should probable change that to move to the API instead) [12:07:15] Amir1: yep I think it's worth adding an extra patch. and probably an announcement to cloud-announce with the list of views being dropped (I can write that one). [12:07:59] do you think it'd be needed to announce them? if they are dropped tables they were not accessible regardless [12:08:06] it's also possible some existing views are broken yes [12:08:57] taavi: there's also the filelog/mount option combo check there, that should be moved too imo (if it's not there already, looking) [12:09:07] are you sure that the dropped "fullviews" are all for tables that no longer exist in any wiki? [12:09:25] 80 to 90% of these tables don't exist anymore: [12:09:26] {'flaggedpage_pending', 'article_assessment', 'updates', 'wikilove_log', 'machine_vision_provider', 'article_feedback_ratings', 'machine_vision_safe_search', 'geo_killlist', 'article_feedback_stats', 'article_assessment_pages', 'links', 'msg_resource_links', 'flaggedrevs_stats2', 'article_feedback_properties', 'article_assessment_ratings', 'machine_vision_freebase_mapping', 'geo_updates', 'article_feedback', 'watchlist_count', [12:09:27] 'flaggedrevs_stats', 'article_feedback_revisions', 'namespaces', 'hashs', 'machine_vision_image', 'article_feedback_pages', 'wikilove_image_log', 'page_broken', 'machine_vision_label', 'imagelinks_old', 'article_feedback_stats_types', 'machine_vision_suggestion', 'flaggedtemplates'} [12:09:45] I'm sure of 80% of them, writing a script to make sure what's left [12:09:51] (I dropped a lot of them myself) [12:10:36] sounds good, let's see what's in the remaining 10 to 20% [12:14:09] dcaro: huh. indeed in https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/39 I even added a "# TODO: remove default from the API", which was then lost as a part of https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/67 :( [12:15:16] how easy is changing the default in components-api now? imo we should get that to default to none as well before launch (so that we don't break all the tools by changing it later) [12:15:41] the default changes depending on the type of job you create in the jobs-cli [12:15:55] should not be hard though, but I wonder if that logic should be in the jobs-api instead [12:16:09] as in, if you use buildservice based, it defaults to none, otherwise to all [12:16:48] dhinus: okay, from the list of tables that we removing from the list of fullViews, nothing is in the cataloged tables (e.g. would have been set to partially public so it wouldn't show up). From the list of tables that exist in production but not cataloged yet (https://phabricator.wikimedia.org/P77827) only wikilove_log is in fullViews [12:17:42] One last thing, I think there are a couple of tables that only exists in wmcs I think. is there a list of them somewhere? e.g. namespaces [12:18:29] I'd be up to moving the default to the api (or just making the field required), but that's a breaking change and I don't think we can do it before monday [12:19:04] the default in jobs-cli depends on whether it's a buildservice image or not. all components-api jobs are currently based on build service images, so the default could for now be hardcoded there [12:19:05] moving it to the api should be ok imo [12:19:51] Amir1: hmm I think the _tables_ list should match the one in prod, it's only the _views_ that are different? unless some tables were created directly on clouddbs and are not replicating from sanitariums [12:20:15] yeah, I think there were one at least [12:20:37] since we rebuilt the sanitarium and it broke [12:20:42] I dig [12:22:07] taavi: this should do it [12:24:25] * taavi assumes you mean https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:24:39] hahahah, yep, sorry [12:25:20] Amir1: "namespaces" does not seem to exist in clouddbs, it's in the yaml but not in the actual dbs, I checked all sections [12:25:27] dcaro: that if condition looks wrong, I think even if you'd set mount=none it'd set mount=all because the `not self.mount` check fails [12:25:44] anyway, I'm going for lunch now, I can have a closer look once I'm back [12:26:18] dcaro: can you merge https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 into the parent one? [12:26:37] dhinus: I'd prefer doing it after, and merge it to main by itself [12:26:46] ah I see, I got the ordering wrong :) [12:26:57] 👍 [12:27:12] taavi: you are right, fixed [12:27:54] dcaro: both are approved now [12:27:59] thanks! [12:28:25] I'll merge in a bit (want to merge /test each individually) [12:30:33] ack [12:38:12] dcaro: I see two "high-prio" MRs left, are they both ready for review? (I see you pushed new changes) [12:38:40] dhinus: yep, I just rebased on the latest merge [12:39:11] ack. I'll try to review both before the team meeting [12:39:25] thanks [12:40:22] dhinus: there will be also the jobs-api change above too, I'm testing it but I'll flag it with high-prio too [12:43:31] ack [13:10:32] dcaro: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175#note_149385 [13:24:35] dcaro: left two comments in https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 [13:31:11] dcaro: the ceph cluster in codfw1dev has a select 11000 objects which seem to never rebalance properly. I'm guessing this is because the crushmap is unhappy with the distribution of osd hosts, does that seem right to you? [13:31:43] This has been true since I removed cloudcephosd1001-dev for maintenance; I thought re-adding it yesterday would cause everything to balance out but that seems to not be true [13:37:18] taavi: replied there [13:38:17] thanks, approved [13:39:04] I'll change the suggestion though [13:40:08] dhinus: the openapi.yaml file is autogenerated, so we should not manually change anything, it might be a bug/weird behavior upstream though, but I would just go ahead with the changes and not block on the openapi generator unless it breaks things [13:49:57] dcaro: ack, sgtm [13:50:06] does anyone happen to know if the php 7.4 webservice maintain-harbor toolforge tool runs to serve a toolinfo file is still needed? I thought we moved that all from a tool to a component [13:50:22] I also seemed to remember it was autogenerated, but I couldn't find anything in your changes that would explain that diff [13:51:21] added a comment there, might be a change in pydantic/fastapi [13:51:42] ack, thanks [13:51:44] approved! [13:51:49] taavi: that tool should not be used anymore I think, Raymond_Ndibe ? [13:53:03] dcaro: I migrated the ircservserv tool to components (T397929). it seems to be fine, I can't set the email config but !93 will fix that [13:53:03] T397929: Migrate ircservserv to components - https://phabricator.wikimedia.org/T397929 [13:53:22] \o/ [13:53:28] also I noticed the built image name is different than the builds-cli default (`tools-harbor.wmcloud.org/tool-ircservserv/ircservserv:latest` versus `tools-harbor.wmcloud.org/tool-ircservserv/tool-ircservserv:latest`), but that's probably maybe fine? [13:54:10] taavi: other than running some scripts and doing personal tests, that tool is no longer in use yes [13:54:28] can I disable that tool to avoid confusion then? [13:54:55] I do have a number of scripts there (I guess I can copy that to local if we want to get rid of it) [13:54:57] Yes you can [13:55:57] Just give it maybe 1 week before doing that. Need to make sure there is nothing of importance there [13:56:07] i mean if it's still in use then it can be kept. but if it's only used for totally unrelated testing, that could just as well be done with some other tool and cause much less confusion [13:56:12] But the webservice can be taking down now [13:56:19] * taavi does [13:56:35] I support taking it down [13:56:49] There is nothing it does that can’t be done with the other test tools we have [13:58:15] there was also a 'while true; do date; sleep 1; done' test cronjob, which I also just removed [13:59:17] Yeaa all just tests, nothing important. Basically any kind of job the tool is running can be taken down rn. Won’t be missed [13:59:48] I spawn jobs like that sometimes to test xd, yep [14:00:41] yeah so do I, but in a tool with 'test' in the name or something :D [14:01:10] Raymond_Ndibe: I created T397933, can you please comment on it when you've checked everything you want to check? (or just disable it directly via Striker then) [14:01:10] T397933: Disable tools.maintain-harbor - https://phabricator.wikimedia.org/T397933 [14:02:05] I abuse sometimes wm-lol (one of my tools) and sample-complex-app I think [14:02:37] that works, or just do what I did and create a dedicated testing tool [14:45:26] andrewbogott: did you end up doing what was needed to disable the old vlan network from new VMs? [14:48:35] taavi: I didn't. I turn out to be less sure about how to do it properly so want to test marking the network unshared in codfw1dev and then see what results from that, if anything. [14:49:03] That seems better than doing it in the UI (which also turns out to be not entirely trivial) [14:49:30] ^ is that a good enough answer for now, or is that blocking things? [14:49:46] yes, I just don't want to have this getting completely lost [14:50:23] andrewbogott: looking into ceph on codfw, it seems that the new nodes got into the rack B1 (that did not exist before) [14:50:35] will we have enough hosts for 4 racks/HA zones? [14:50:48] (the new osd has more space than all the others together xd) [14:50:50] After I finish additions and removals there will be exactly 4 pretty big OSD nodes. [14:51:13] They won't be in physically different racks but I was hoping we could simulate one-per-rack in the crush map [14:52:26] dcaro: here is what ought to happen in coming days: [14:52:26] https://phabricator.wikimedia.org/T393614#10948977 [14:53:30] okok [14:53:35] the command to move stuff is `root@cloudcephmon2006-dev:~# ceph osd crush move cloudcephosd2001-dev rack=C8D5` [14:54:11] just moved that node, things should start rebalancing [14:54:33] I remember though at some point some stuff got stuck also when the cluster was very full [14:55:04] and I had to force things around by reweighting osds to push data out of them [14:56:01] the fact that we have a huge node with a lot of weight might be making things weird too, I'll let it run and think more about it if it does not rebalance ok xd [14:56:34] the new nodes are not as huge as that one but still pretty huge [14:56:44] I mean, the not-yet-installed ones [14:57:12] we might want to use the same weight for all then to avoid rack imbalances :/ [14:57:44] everyone: I'm about to reboot cloudvirtlocal1xxx hosts. That will cause toolforge and toolsbeta etcd clusters to drop from 3 to 2 during reboots, should not affect performance. [14:57:55] yeah, and just not use all the space in the extra big one, I agree. [14:58:37] HA should be smart enough to make sure to have copies in the others, but might end up getting all the 'copies' in just one rack xd [14:59:05] yeah [14:59:30] indeed, looks like it's rebalancing now. I'll try to rack things sensibly as I bring in the new nodes and we'll see what happens. Thanks [15:30:34] topranks, we have two new cloudcephosds that need to be attached to cloud-storage1-b-codfw; is that something you can do? dcops referred me to you :) cloudcephosd2005-dev (xe-0/0/25) and cloudcephosd2006-dev (xe-0/0/24) [15:30:42] there will be a third but it's not fully racked yet [15:33:56] andrewbogott: no probs I set the vlans on those ports in Netbox and pushed the change to the switch [15:34:01] you should be good to go [15:34:05] thank you! [15:34:33] oh yeah, now it shows 'state UP'. splendid [15:56:20] andrewbogott: just noticed that norebalance was set in the cluster [15:56:39] I think that's because I'm running a bootstrap cookbook [15:58:16] oh my, sorry xd [15:59:15] tell me about 'C8D5' -- does that mean something like 'really in c8 but pretending to be d5' or similar? [15:59:59] so as we did not have 4 HA zones in codfw, I created 3, one for each rack in eqiad where one is the combination of c8 and d5 [16:00:33] it would not have mattered the name though, I think that at the time I was still playing with some of the scripts that might have expected specific names for the racks [16:00:35] ok, so it's entirely fictional as far as codfw1dev is concerned [16:00:37] yep [16:00:56] great, I'll stick with what you have then and won't worry about where the servers really are [16:01:52] dcaro: I just approved https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 [16:02:19] dhinus: I saw thanks :) [16:02:38] I think that was the last high-prio \o/ [16:02:41] dhinus ++ [16:04:45] andrewbogott: it finished with the degraded pgs :) [16:05:05] great! just in time for me to force another huge rebalance :) [16:15:19] 'ceph osd crush move' sure is a lot simpler process than I was expecting [16:15:57] is 'HEALTH_WARN 38 pgs not deep-scrubbed in time; 38 pgs not scrubbed in time' anything I should care about, or is that just leftover from when things were stuck yesterday? [16:18:44] it's leftover, once the cluster stabilizes it should retake triggering the scrubs [16:18:48] scrubbings? [16:18:52] xd [16:32:31] cool [16:33:27] Thank you dcaro, I think I'm all set for now. Next steps are waiting, and then decomming two of the old OSD nodes to free up switch space. [16:34:38] yw :) [17:35:37] * dcaro off, cya on monday! [20:38:45] andrewbogott: I have a new to me error coming out of tofu talking to the OpenStack API. It is suddenly giving me a 401 Unauthorized response when looking up data on the debian-12.0-bookworm image. [20:39:48] The same client/credentials are working to call lots of other OpenStack API endpoints in the same session. It's just this one that is suddenly angry: https://openstack.eqiad1.wikimediacloud.org:29292/v2/images?name=debian-12.0-bookworm&sort=name%3Aasc&status=active [20:40:31] that's interesting, I'm /just about/ to update the way that glance checks credentials but I haven't merged the change yet [20:40:36] anticipatory failure [20:41:03] let's see if I can produce that on the cli... do you know what account tofu is using to query? [20:41:15] ZuulDevOpsBot [20:41:47] thx [20:43:04] `sudo wmcs-openstack image list` is angry on cloudcontrol1007. Possibly related? [20:43:09] wow, I get a 401 using novaadmin as well. How... [20:43:10] yes [20:43:19] so I must've broken it. But... how? [20:44:32] ¯\_(ツ)_/¯ I do see a snapshot queued for "cloudstore-nfs-02-shelved" in the horizon UI [20:45:09] I also get 3 "Error: Unable to retrieve the project." toast messages when I got to https://horizon.wikimedia.org/project/images [20:51:37] ok, at least the problem wasnt' a result of the thing that it shouldn't have resulted from... [20:51:56] btw, the same issue is present in codfw1dev. So it's definitely something I did just now... [20:52:34] * bd808 plays rubber duck until the "oh!" moment [20:56:57] do we know for sure this hasn't been broken for days? It doesn't break VM creation I think... [20:57:23] let me see when this pipeline last worked... [20:59:01] well, this won't probably be extremely helpful. It looks like I last ran this in a logged way successfully on the 20th. [21:00:42] that's kind of good because that range includes some policy.yaml changes. [21:00:43] this query would be part of any tofu plan that is validating that a VM exists I think. [21:00:49] although not to the glance policies [21:01:49] I take it things /have/ worked since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1141980 ? [21:02:55] yeah, for sure. That's way back at the start of May [21:03:04] yeah [21:06:54] my backscroll in the tmux window where I run this doesn't have history prior to today. :/ [21:07:40] So worked Friday, doesn't work today is the narrowest I can prove. [21:16:23] oh great, it's started working without me changing anything that should've mattered [21:20:14] zuuldevopsbot is still being told to go away if that makes you feel better [21:20:34] seems like it works on one cloudcontrol and fails on the other two [21:22:48] how about now? [21:23:13] * bd808 tries the script [21:23:42] yup. it works now [21:23:54] which loose wire did you jiggle? [21:24:10] I... turned debug logging on and then puppet turned it back off again [21:24:35] that sounds... sketchy :) [21:24:42] yeah! [21:24:56] I also adjusted some role assignments but if that did something the effect was delayed. [21:26:44] One of the things I did today was to create an ldap account named 'glance'. This has the same name as the linux user 'glance' that runs the glance service on the cloudcontrols. [21:26:59] But... there shouldn't be any ldap effects on a bare metal server should there? [21:27:07] Or are there still group lookups and such? [21:27:13] * andrewbogott surprised he does not know the answer to that [21:28:18] the ldap account would show up on all VMs, but not on a box like cloudcontrol1007.eqiad.wmnet [21:28:41] right [21:28:50] the ldap->prod connection for NSS is just the admin module in puppet [21:28:50] so there shouldn't have been any interaction with the existing glance service [21:29:33] wait, tell me more about that connection? [21:32:44] we manually copy uid values from Developer accounts into the admin module's data.yaml file [21:41:59] I think I just lost my saving throw vs adding a local puppetserver in the zuul project. sigh [21:50:42] andrewbogott: do you have a good intuition about where in puppet new Cloud VPS project specific things should go? I'm going to need to write some custom puppet code for the zuul project. [21:51:18] would role::wmcs::zuul::whatever sound right to you or something else? [21:51:39] let me look... [21:52:01] it's not super clear to me if "role::wmcs" is a namespace for the Cloud Services team or for Cloud VPS generally at this point [21:52:51] we have legacy "role::labs" things but I'd rather not add new there [21:53:35] yeah, not clear to me either. I sort of thought we had a tree for project-specific things but I don't really see it now [21:54:20] is zuul going to have bare-metal puppet code as well? [21:54:29] I'm seeing that quarry just has a top-level profile dir [21:54:49] modules/profile/manifests/quarry [21:54:57] I would hate for there to be modules/profile/manifests/zuul [21:55:11] unless /you/ need to distinguish between what deploys on metal vs. not [21:55:20] oops, typo [21:55:32] I wouldn't hate for there to be modules/profile/manifests/zuul <- what I was trying to type [21:56:16] well... there are going to be zuul things in prod too, so yeah it probably needs a "this is cloud stuff" scope [21:56:55] 'wmcs' doesn't seem right to me (partly because of impending name changes) [21:57:08] role::zuul and some subs already exist [21:57:23] how would you feel about breaking new ground and creating a 'cloudprojects' tree or similar? [21:57:28] ..."partly because of impending name changes"... o_0 [21:57:54] are they renaming you back to labs to complete my erasure from history? [22:05:41] Just, after the split we don't know what the two halves will be called. Maybe neither will be wmcs.