[12:46:56] godog: mortizm was saying you're having problems getting prometheus7001 installed? [12:47:24] godog, denisse: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028818 should address T364354. sorry about the noise last night [12:47:25] T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354 [12:47:40] topranks: hey, yes that's right, I'm running makevm and I see the VM up on ganeti7004 though no console [12:47:50] vgutierrez: neat, thank you! will take a look shortly [12:50:15] yeah he explained it's not working [12:50:43] the cookbook is at the 'waiting for reboot' stage is it? [12:50:55] yes that's right [12:50:57] [120/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for prometheus7001.magru.wmnet not found yet, keep polling for it: unable to get uptime [12:50:58] I'll have a quick look see if I can spot the issue with install7001, if not we can revert to install1004 and try again [12:51:27] thank you topranks ! I'll leave makevm running so if we reboot the vm and it starts working then we should be good [12:58:41] yeah, so we haven't tried any host with install7001 fwiw [12:58:46] all Traffic hosts we did were with 1004 [13:00:57] godog: out of curiosity, can you share the makevm command you ran? [13:01:34] er, the reimage cookbook I guess since it abstracts that [13:02:22] sukhe: for sure, I'm using makevm indeed [13:02:23] cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 125 --network private --os bullseye -p 7 --cluster magru02 --group B4 prometheus7001 [13:02:30] on cumin1002 FWIW [13:04:14] yeah, weird! [13:06:52] ikr? [13:16:26] so we reverted to install1004 but we only half-reverted back to install7001 [13:16:46] homer? [13:16:48] TL;DR puppet was set to configure install7001 but the switches still had the install1004 IP in to relay to [13:17:04] yeah - and it was run but cumin host hadn't pulled the merged patch first [13:17:09] ah [13:17:15] godog: I guess we should abort that and try again [13:17:28] I can't promise it'll work mind, but I'm doing some traces to try to catch if it fails [13:17:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [13:18:02] topranks: ok! will try rebooting the host now which should pxe boot [13:18:09] the *vm* not the host [13:18:28] sure yeah that works [13:19:34] ok booting into d-i afaics [13:20:06] yep tons of traffic on the interface :) [13:20:28] cheers topranks ! will keep an eye on the console [13:20:29] https://www.irccloud.com/pastebin/OihYmaEF/ [13:21:27] yeah loading initrd.gz, let's see if it can do it [13:22:40] FIRING: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [13:23:39] topranks: to confirm, prometheus7001 should be loading files / talking to install7001 and not install1004 ? [13:24:11] I take that back, it can't be since install7001 isn't a thing atm [13:24:54] and I take that back too, install7001.w.o not install7001.magru.wmnet [13:25:07] no install7001 is a thing [13:25:28] it's on bullseye now and working, for some reason we had an issue with bookworm [13:25:37] yeah I figured it out eventually [13:25:39] I/F will take that offline, it's fine on bullseye for now [13:27:09] ok! yeah makevm eventually timed out, I'm wondering if I can kick off a reimage now or I should decom and makevm again ? [13:27:16] moritzm: maybe you know? ^ [13:27:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [13:29:07] godog: in the past I have simply run the cookbook again and it worked fwiw [13:29:29] hah! thanks sukhe that's even easier, I'll try that [13:29:56] you should expect it to remove the DNS recrods and such but that's expected because it will assign them again [13:30:56] indeed [13:31:38] yeah no it added another entry and not removed the old one, I'll abort, decom and makevm again [13:32:29] weird! [13:32:49] sorry about that, in two cases, it did rollback changes. but I guess yours didn't hit the rollback step at all [13:33:23] sure np, yes that's possible [14:08:50] ok mistery (to me) solved on why loading kernel/initrd is slow, they are loaded from apt.w.o whereas I was imagining they would be loaded from the local install sever [14:16:02] yeah, in the past these were co-hosted on the same install server, but with the creation of the split design, these were kept on the main apt repo server [14:16:31] otherwise we'd need to sync these separately from the main repo, which provided little gain for quite some complexity given that installs are infrquent [14:18:33] got it, thanks moritzm [14:19:11] godog: Feel free to let me know if I can assist you on that. [14:19:27] denisse: thank you! will let you know for sure [16:35:01] I can confirm that if the makevm gets interrupted and you run it again it will use a new IP and you get into some kind of cycle where the old IP isn't removed [16:35:44] afair [16:36:22] Yeah, I recall I ran into something similar once and I had to manually (?) remove the IP and certificate generated. [16:36:26] I think the fix was that I just used the "next" host name number [16:36:34] and later remove the unused IP from netbox [16:39:40] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:40:03] depends when the makevm gets interrupted [16:41:17] makevm does rollback everyting until the VM is created denisse, mutante, after that there is no point in rolling it back [16:41:30] if the reimage fails you can re-run the reimage itself, no need to re-run makevm [16:41:53] the only step after the VM creation is to run the reimage cookbook [16:42:24] there is no need for any manual cleanup [16:42:26] Oh, that's good to know. :o [16:46:10] if you want I've sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1028867 to make it more explicit [16:47:55] volans: Thanks for taking a look at it. I gave it a +1. [16:48:10] One small question regarding that patch. Is there a reason not to use f strings there? [16:49:40] which dynamin info would you like to have? [16:51:26] volans: Ah, none. That makes sense now. :) [16:53:45] we could add the VM name if that helps or any other info, feel free to modify it at will [20:39:55] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:40:20] ^ Taking a look. [20:40:30] ay yi yi [20:40:37] Looks like it's clearing up slowly [20:41:04] Yeah, the lag seems to be going down steadily. [20:41:17] I've ACK'd the alert. [20:41:27] I think it'll self resolve.