[09:10:34] good morning [09:11:12] dcaro: do you plan to follow up with robh regarding hardware for ceph? [09:11:30] (email) [09:12:49] done already [09:13:03] added a note in the network meeting too to discuss the switch space and such [09:13:42] there's not much, so if we want to spread the nodes we'll have to either shift other hosts around, or if QoS is ready, then we can try the single port setup [09:22:16] ok, thanks [09:22:51] dcaro: there was this plan to create a document with the different ceph HA strategies we had been exploring. Do you know if it was finally created? [09:23:18] The document was created, it has not yet been filled up (PTO + other more urgent work) [09:25:12] https://docs.google.com/document/d/1UtMK8ZLLfn1CFbcgBccBvzTlIuab244XAs98_gTuBlg/edit?tab=t.0#heading=h.qqx9jzis1t6 feel free to add anything if you have time [09:26:54] ok, thanks [10:14:03] arturo: thanks, added a bit more structure if you want to keep adding stuff [10:14:13] ack [11:38:21] * dcaro lunch [12:38:49] heads up I'm testing the NodeDown alerts by shutting down cloudvirt1063 (which is already in maintenance) [12:45:44] ack [13:06:21] ack [13:11:37] we're starting the toolsbeta upgrade with Raymond_Ndibe [13:16:40] dhinus: ack, I'm in a public place without headphones xd, let me know if you need anything from me [13:28:43] ok! [13:41:01] dhinus: Raymond_Ndibe there's one worker node (nfs-7) in toolsbeta that's scheduling disabled, is that you? [13:41:02] hmm not sure, unless it's the cookbook doing it? [13:41:02] what can cause the worker node to be "scheduling disabled"? [13:41:02] we're also seeing one error in functional tests (haven't started the upgrade yet) [13:41:02] the instructions usually start from the control nodes [13:41:03] I think it might be manually set (or with cookbook) [13:41:03] we are at line 22 of https://etherpad.wikimedia.org/p/k8s-1.27-to-1.28-upgrade [13:41:03] ummm no, that shouldn't be happening. We are about to begin toolsbeta upgrade but so far the only command we've executed is `sudo cookbook wmcs.toolforge.k8s.prepare_upgrade --cluster-name toolsbeta --src-version 1.27.16 --dst-version 1.28.14 --cluster-name toolsbeta --task-id ` [13:41:04] we should make the tests pass before upgrading though [13:41:04] what's the faiure? [13:41:04] *failure [13:41:04] I don 't expect that command have that effect [13:41:04] btw. I'll uncordon the node, see if it works ok [13:41:04] dcaro: ok [13:41:04] Raymond_Ndibe: I don't think the command did it, should have been someone manually or an old run of the cookbook or something [13:41:04] the failure is coming from `get components health without auth works` [13:41:05] looking, the new node seems to be unable to start pods :/ [13:41:05] I wasn't expecting that to be an issue until we get to tools but yea, I'll fix it before continuing [13:41:05] not really, the node is running a build ok [13:41:05] aaaahhh, the pod was a test.... [13:41:05] nm, the node seems to be working ok [13:41:05] looking into components, I think it might be the old image issue [13:41:05] hmm, it's failing to pull the new image, but the old one is still running, what's the output of that test? [13:41:05] yeaa it is [13:41:06] https://www.irccloud.com/pastebin/VV23og7o/ [13:41:06] hmm, sometimes it works, sometimes it does not :/ [13:41:06] https://www.irccloud.com/pastebin/CTeRUCeV/ [13:42:07] I'll update the components-api deployment by hand to an image that exists :/ [13:42:41] I don't expect that to solve this particular problem [13:43:38] using tools-harbor.wmcloud.org/toolforge/components-api:image-0.0.27-20241030094558-9279575a [13:43:47] it should be able to create all the pods correctly now [13:44:22] Raymond_Ndibe: why do you think so? [13:45:06] this seems to be working now [13:45:10] https://www.irccloud.com/pastebin/WyMsRUUc/ [13:45:19] dcaro: is that a newer or older version? [13:45:23] newer [13:45:36] latest one I found in tools harbor [13:45:46] what was the previous version? [13:45:52] 0.0.19 [13:45:56] the previous version was 0.0.19 [13:45:57] yes [13:46:53] are the tests passing now? [13:47:36] seems to be working. {"status": "OK"} vs {data: {"status": "OK"}...} [13:48:05] yep, it should not care as long as it's a 200 [13:48:10] and json parseable [13:48:12] iirc [13:48:17] I thought we were running 0.0.42 after this MR? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/564 [13:48:58] looks like it should yep [13:49:02] was that deployed? [13:50:02] that's what's is the code, so yep, maybe it was never deployed? [13:50:23] (we should use the same for tools btw) [13:50:24] I didn't run the deploy 🤦🏽 [13:50:45] ack, that makes sense then :) [13:50:57] we have pending adding some kind of check for things not having been deployed xd [13:51:20] yes we didn't add it for tools because we wanted to discuss it with you dcaro. But if you think we should use the same version for tools that should be great [13:51:46] yep, we can bump to the latest, though we might want to have the 'disable' flag before that just in case [13:52:30] do you think we should deploy 0.42 to toolsbeta before or after the 1.28 upgrade? [13:52:51] my suggestion is to continue toolsbeta upgrade, then I add the disable flag, then we deploy that for components-api before we do tools upgrade next week [13:54:07] we can go with this upgrade, and deploy later [13:56:18] the upgrade cookbook is running :rocket: [13:56:51] \o/ [14:09:06] * dcaro have to relocate again, be back in a bit [14:48:50] control plane look all upgraded and well, are tests passing? [15:01:36] the whole cluster is upgraded now, Raymond_Ndibe how are tests going? [15:02:21] all tests are passing [15:02:46] going to let it run a bit more but everything looks good [15:04:16] \o/ [16:18:06] Raymond_Ndibe: it's everything done for toolsbeta upgrade then? Anything missing I can help with? [16:18:55] arturo: do you know if cloudcontrol2006-dev can be powered down? see last comment in T370401 [16:18:55] T370401: cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401 [16:19:08] dhinus: I'll reply soon [16:19:20] but the answer is yes [16:19:20] ack [16:19:32] thanks! [16:58:02] * arturo offline [17:50:16] dcaro: yes everything is done for toolsbeta. test ran for approx 3hrs without errors so yea [17:51:27] obviously there are still some minor things left like removing the flag thing and some version upgrade for components but those can be handled [17:56:17] Raymond_Ndibe: ack, kudos! :) [18:02:46] dhinus: re T375101#10278327, we have a cookbook for this, which also includes the meta_p update run which you missed [18:02:47] T375101: Prepare and check storage layer for nrwiki - https://phabricator.wikimedia.org/T375101 [18:05:27] taavi: thanks, I followed https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas/DNS which does not mention the cookbook, I will update the wiki [18:05:55] i recommend following https://wikitech.wikimedia.org/wiki/Add_a_wiki#Maintain_views instead [18:06:04] that wiki page is about the dns mechanism specifically, not the general process [18:06:46] this section seemed to match what I needed... https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas/DNS#Adding_a_new_wikidb_alias [18:07:00] "When a new wikidb is added" [18:07:38] let me add a colored box pointing to the other docs [18:07:44] thanks :) [18:08:59] I noticed that the script run updated a few other aliases, I wonder if some other wikis were added in the past and the cookbook was not run [18:09:19] how can we make sure all the DNS records are in sync? [18:09:36] by running the script and seeing if it notices any not in sync :-) [18:10:53] I'm running the cookbook now, let's see what's the output [18:11:12] though the cookbook takes the single wiki as an argument [18:11:23] and I don't want to run it once per each wiki :) [18:11:45] I think the DNS script will check all the aliases [18:11:50] but what about meta_p? [18:12:06] if you mean checking if the DNS is in sync, you do that with the wikireplicas-dns script [18:12:32] i don't think the meta-p script has an option to detect missyncs, it just reloads the data [18:12:37] yes sorry, my question was incomplete [18:12:50] I will do a full manual run of the DNS script, and that should sort DNS [18:13:11] didn't you say you did that already? [18:13:25] anyhow, none of this would be a problem if everyone used the cookbook :-) [18:13:25] only with --shard s5 [18:15:42] the cookbook runs the dns script without the "--shard" option so that will sort out things [18:15:49] even if it takes longer.... [18:16:22] I though I was smarted and saved some time by adding "--shard s5" but karma is getting me [18:16:28] *thought [18:16:32] *smarter [18:16:40] * dhinus cannot type [18:17:55] that script could probably be made much faster by listing all of the existing records at once instead of calling the designate API once per wanted record to get the current state :-) [18:25:39] last question (I hope): by a quick look at the maintain-meta_p script, it looks like it's only adding the wiki you specify in the args... I worry that if the same mistake of not using the cookbook was made in the past, we might have some wikis missing from the meta_p.wiki table? [18:26:50] or maybe I'm reading it wrong and it's looping? I see there's also --purge [18:27:44] i think it also has a flag to re-create the entire table, but yes, if you only ran it with a wiki specified now, it's entirely possible that it's missing something [18:28:27] so I should probably run that script manually with --all-databases --purge [18:28:42] what's relying on that table? [18:29:11] all the bots that need to iterate each wiki [18:29:17] and things like https://db-names.toolforge.org/ [18:30:00] gotcha [18:34:00] any concerns before I run "maintain-meta_p --purge --all-databases" on clouddb1014 and clouddb1018? [18:36:34] the cookbook completed successfully in the meantime [18:37:42] I'm testing maintain-meta_p with --dry-run [18:38:52] it fails with "WARNING failed request for https://labtestwikitech.wikimedia.org" [18:39:06] I'll postpone this to tomorrow :) [18:39:35] thanks taavi for the help, and for spotting I had missed the meta_p update! [18:40:08] * dhinus logs off [18:51:39] * dcaro off