[12:46:56] <topranks>	 godog: mortizm was saying you're having problems getting prometheus7001 installed?
[12:47:24] <vgutierrez>	 godog, denisse: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028818 should address T364354. sorry about the noise last night
[12:47:25] <stashbot>	 T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354
[12:47:40] <godog>	 topranks: hey, yes that's right, I'm running makevm and I see the VM up on ganeti7004 though no console
[12:47:50] <godog>	 vgutierrez: neat, thank you! will take a look shortly
[12:50:15] <topranks>	 yeah he explained it's not working 
[12:50:43] <topranks>	 the cookbook is at the 'waiting for reboot' stage is it?
[12:50:55] <godog>	 yes that's right
[12:50:57] <godog>	 [120/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for prometheus7001.magru.wmnet not found yet, keep polling for it: unable to get uptime
[12:50:58] <topranks>	 I'll have a quick look see if I can spot the issue with install7001, if not we can revert to install1004 and try again 
[12:51:27] <godog>	 thank you topranks ! I'll leave makevm running so if we reboot the vm and it starts working then we should be good
[12:58:41] <sukhe>	 yeah, so we haven't tried any host with install7001 fwiw
[12:58:46] <sukhe>	 all Traffic hosts we did were with 1004
[13:00:57] <sukhe>	 godog: out of curiosity, can you share the makevm command you ran?
[13:01:34] <sukhe>	 er, the reimage cookbook I guess since it abstracts that
[13:02:22] <godog>	 sukhe: for sure, I'm using makevm indeed
[13:02:23] <godog>	 cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 125 --network private --os bullseye -p 7 --cluster magru02 --group B4 prometheus7001
[13:02:30] <godog>	 on cumin1002 FWIW
[13:04:14] <sukhe>	 yeah, weird!
[13:06:52] <godog>	 ikr?
[13:16:26] <topranks>	 so we reverted to install1004 but we only half-reverted back to install7001 
[13:16:46] <sukhe>	 homer?
[13:16:48] <topranks>	 TL;DR puppet was set to configure install7001 but the switches still had the install1004 IP in to relay to 
[13:17:04] <topranks>	 yeah - and it was run but cumin host hadn't pulled the merged patch first 
[13:17:09] <sukhe>	 ah
[13:17:15] <topranks>	 godog: I guess we should abort that and try again 
[13:17:28] <topranks>	 I can't promise it'll work mind, but I'm doing some traces to try to catch if it fails 
[13:17:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[13:18:02] <godog>	 topranks: ok! will try rebooting the host now which should pxe boot
[13:18:09] <godog>	 the *vm* not the host
[13:18:28] <topranks>	 sure yeah that works 
[13:19:34] <godog>	 ok booting into d-i afaics
[13:20:06] <topranks>	 yep tons of traffic on the interface :)
[13:20:28] <godog>	 cheers topranks ! will keep an eye on the console
[13:20:29] <topranks>	 https://www.irccloud.com/pastebin/OihYmaEF/
[13:21:27] <godog>	 yeah loading initrd.gz, let's see if it can do it
[13:22:40] <jinxer-wm>	 FIRING: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[13:23:39] <godog>	 topranks: to confirm, prometheus7001 should be loading files / talking to install7001 and not install1004 ?
[13:24:11] <godog>	 I take that back, it can't be since install7001 isn't a thing atm
[13:24:54] <godog>	 and I take that back too, install7001.w.o not install7001.magru.wmnet
[13:25:07] <topranks>	 no install7001 is a thing 
[13:25:28] <topranks>	 it's on bullseye now and working, for some reason we had an issue with bookworm 
[13:25:37] <godog>	 yeah I figured it out eventually
[13:25:39] <topranks>	 I/F will take that offline, it's fine on bullseye for now 
[13:27:09] <godog>	 ok! yeah makevm eventually timed out, I'm wondering if I can kick off a reimage now or I should decom and makevm again ?
[13:27:16] <godog>	 moritzm: maybe you know? ^
[13:27:40] <jinxer-wm>	 RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[13:29:07] <sukhe>	 godog: in the past I have simply run the cookbook again and it worked fwiw
[13:29:29] <godog>	 hah! thanks sukhe that's even easier, I'll try that
[13:29:56] <sukhe>	 you should expect it to remove the DNS recrods and such but that's expected because it will assign them again
[13:30:56] <godog>	 indeed
[13:31:38] <godog>	 yeah no it added another entry and not removed the old one, I'll abort, decom and makevm again
[13:32:29] <sukhe>	 weird!
[13:32:49] <sukhe>	 sorry about that, in two cases, it did rollback changes. but I guess yours didn't hit the rollback step at all 
[13:33:23] <godog>	 sure np, yes that's possible
[14:08:50] <godog>	 ok mistery (to me) solved on why loading kernel/initrd is slow, they are loaded from apt.w.o whereas I was imagining they would be loaded from the local install sever
[14:16:02] <moritzm>	 yeah, in the past these were co-hosted on the same install server, but with the creation of the split design, these were kept on the main apt repo server
[14:16:31] <moritzm>	 otherwise we'd need to sync these separately from the main repo, which provided little gain for quite some complexity given that installs are infrquent
[14:18:33] <godog>	 got it, thanks moritzm 
[14:19:11] <denisse>	 godog: Feel free to let me know if I can assist you on that.
[14:19:27] <godog>	 denisse: thank you! will let you know for sure
[16:35:01] <mutante>	 I can confirm that if the makevm gets interrupted and you run it again it will use a new IP and you get into some kind of cycle where the old IP isn't removed
[16:35:44] <mutante>	 afair
[16:36:22] <denisse>	 Yeah, I recall I ran into something similar once and I had to manually (?) remove the IP and certificate generated.
[16:36:26] <mutante>	 I think the fix was that I just used the "next" host name number 
[16:36:34] <mutante>	 and later remove the unused IP from netbox 
[16:39:40] <jinxer-wm>	 FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:40:03] <volans>	 depends when the makevm gets interrupted
[16:41:17] <volans>	 makevm does rollback everyting until the VM is created denisse, mutante, after that there is no point in rolling it back
[16:41:30] <volans>	 if the reimage fails you can re-run the reimage itself, no need to re-run makevm
[16:41:53] <volans>	 the only step after the VM creation is to run the reimage cookbook
[16:42:24] <volans>	 there is no need for any manual cleanup
[16:42:26] <denisse>	 Oh, that's good to know. :o
[16:46:10] <volans>	 if you want I've sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1028867 to make it more explicit
[16:47:55] <denisse>	 volans: Thanks for taking a look at it. I gave it a +1.
[16:48:10] <denisse>	 One small question regarding that patch. Is there a reason not to use f strings there?
[16:49:40] <volans>	 which dynamin info would you like to have?
[16:51:26] <denisse>	 volans: Ah, none. That makes sense now. :)
[16:53:45] <volans>	 we could add the VM name if that helps or any other info, feel free to modify it at will
[20:39:55] <jinxer-wm>	 FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:40:20] <denisse>	 ^ Taking a look.
[20:40:30] <brett>	 ay yi yi
[20:40:37] <brett>	 Looks like it's clearing up slowly
[20:41:04] <denisse>	 Yeah, the lag seems to be going down steadily.
[20:41:17] <denisse>	 I've ACK'd the alert.
[20:41:27] <denisse>	 I think it'll self resolve.