[08:00:51] morning [08:01:16] o/ [08:49:00] morning [09:31:36] hey folks - quick question on cloudbackup* hosts [09:32:06] specifically cloudbackup2004 is in rack d7 in codfw... which we are due to do some network maintenance in later [09:32:22] we're moving the hosts to newer switches, so there is an interruption to each for ~10s as we move the cable [09:41:09] topranks: that should be ok! [09:41:32] arturo: ok great :) [09:41:49] I'll do all the basic connectivity checks afterwards anyway [09:42:12] task is T373105 by the way [09:42:15] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [09:45:42] also guys I might close this one? [09:45:43] T373986 [09:45:43] T373986: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986 [09:46:00] things seem stable since the work Tuesday? [09:46:01] yes I think so, thanks! [09:46:32] obviously keep an eye on it. thanks. [11:06:23] topranks: yep, so far so good :) [11:06:26] * dcaro lunch [11:06:37] I might be a bit late to the coworking space [11:06:49] glad to hear :) [11:54:32] fyi, gnmi stats for cloudsw switches are almost there, if you have knowledge about cfssl maybe you can help with the last blocker - https://phabricator.wikimedia.org/T375179 [11:55:29] quick sneak peak : https://grafana.wikimedia.org/d/5p97dAASz/network-interface-queue-and-error-stats?from=now-30m&var-device=cloudsw1-c8-eqiad&var-interface=All&var-site=eqiad%20prometheus%2Fops&orgId=1 (that's running it manually) [11:57:19] nice! [11:57:45] unfortunately, I have no experience with cfssl :/ [12:24:52] arturo: I am meaning to get back into the 'NetTalk BOF' sessions, however today I have switch maintenance right after I gotta prepare for so I think I need to cancel today. Plan to discuss the "mini pop" architecture in two weeks time instead if that's ok? [12:25:15] topranks: yep, no problem, thanks for the heads up [12:28:51] I'm seeing widespread puppet failures in several cloudvps projects [12:30:17] it seems missing entry in cloud.yaml [12:30:22] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::firewall::ferm_status_script' (file: /srv/puppet_code/environments/production/modules/profile/manifests/firewall.pp, line: 21) on node tools-elastic-6.tools.eqiad1.wikimedia.cloud [12:30:36] clasic [12:31:00] see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074113 [12:33:24] fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074166 [12:36:12] +1'd [12:45:46] dhinus: I brain-dumped here about the default security groups things and tofu: https://phabricator.wikimedia.org/T375111#10160484 [13:12:10] arturo: I saw your comments in the task and I'm browsing the openstack provider github issues to see if I find some useful suggestions... I will comment in the task [15:02:10] arturo: I've just posted an idea to the task, not sure it will work though [15:05:33] thanks! [16:00:19] arturo: I think all buster hosts are ceph hosts, so they're already tracked in T309789 [16:00:21] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [16:00:29] I'll create a task for upgrading bullseye hosts [16:00:42] thanks! I'll add it to the notes [16:00:47] thanks! [16:00:52] * arturo offline [16:01:11] EOL is 31 Aug 2026 so we have almost 2 years [16:03:33] oh, so maybe not that urgent that one [16:10:17] yep, I've created a task anyway https://phabricator.wikimedia.org/T375217 [16:10:21] and set priority to low :) [16:10:28] 👍 [16:13:34] it's LTS support rather than standard support, not sure if we should worry about it https://wiki.debian.org/LTS [16:23:09] hmm, tools-k8s-worker-nfs-24 has no network :/, as in, I rebooted it and it still does not have any network configured, weird, looking [16:24:55] it has network on openstack (`| addresses | lan-flat-cloudinstances2b=172.16.6.225`) [16:27:43] hmm. it got timeout while waiting for the network to be configured, that's dhcp and such :/, cloudnet iirc [16:28:45] everything seems up (`openstack network agent list`) [16:41:54] * dhinus offline [17:12:49] * dcaro off [17:34:27] * dcaro paged [17:56:56] * dcaro back off, the page was not really a page [17:57:01] (as in expected downtime)