[12:44:59] * dcaro paged [12:46:23] got TektonDown and NodeNotReady in tools, looking [12:48:30] kubectl get nodes shows all nodes up [12:50:34] "Kubernetes cluster has no data about nodes marked as ready", it seems an artifact of prometheus, looking [12:50:57] same thing with tekton, no data is getting into prometheus, maybe a cert expired? [12:51:04] (that rings a bell) [12:51:58] yep T309782 [12:51:59] T309782: toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782 [13:18:45] created a silence for all the *puppet* alerts as we are getting no data yet [13:24:58] okok, data is starting to flow in [13:25:14] tekton is detected up [13:25:16] https://usercontent.irccloud-cdn.com/file/l4tfvpIt/image.png [13:26:36] The alerts are clearing up, I declare the incident over, will re-check in a bit to see if all of them are gone but that's all for today [13:32:58] shoot... the cloud puppetserver got out of space also [13:33:04] https://www.irccloud.com/pastebin/AEHnsGH7/ [13:33:13] (unrelated to the previous issue) [13:35:52] created T366406... looking [13:35:52] T366406: [cloudvps] 2024-05-01 cloudinfra puppetserver got out of space - https://phabricator.wikimedia.org/T366406 [13:45:22] okok, freed some space and the runs are working again, I'll silence the alerts and check again in a bit [13:46:40] * dcaro back in 1h or so