[09:36:46] I broke puppet on prometheus1005, looking (missing hiera value) [09:38:41] ack dcaro [09:42:41] this should do it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071573 [09:48:46] godog: quick review? ^ (I'm open to moving the hiera entry anywhere else too) [09:49:58] dcaro: yes we can have the key in a single place I think that'll be more maintable [09:50:05] if you are in a rush like that is fine too [09:50:28] also it needs to be a fqdn I assume? [09:51:11] and surely it'll fail in codfw ? [09:51:26] it does :-) [09:51:49] I was just about to report the failure (since I had forced a puppet run to drop libruby2.7 on 2005) [09:53:16] godog: true, prometheus will run on codfw too, I moved it to eqiad, let me see if we can manage it being an fqdn for prometheus, but just a hostname for cloudcontrol [09:54:38] hmm, it's being compared against $facts['fqdn'], but it was working before, is that not the fqdn? [09:55:13] https://www.irccloud.com/pastebin/j4NEwYNP/ [09:55:21] it seems so, how was/is that working [09:55:23] ? [09:55:28] dcaro: I'm not sure I understand, as a fqdn in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071573/5/hieradata/eqiad.yaml should work [09:55:45] that value is used in three different places [09:56:28] oooohhh, 🤦, itwas already fqdn xd [09:57:43] now, ready [09:58:55] lgtm [09:59:06] ack, will merge [10:04:05] hmm... that was not enough it seems, puppet is still failing on prometheus1005 [10:18:49] moritzm: it seems puppet is not reading the common.yaml and eqiad.yaml, I think I'm missing something xd, any ideas? [10:20:32] I'll revert the patches for now, and do a nicer one properly tested [10:21:58] where in the puppet tree does the lookup happen? there's multiple resolution strategies configurable, it might match at an upper layer and then never reach those [10:22:15] our hierarchy is rather complex: https://wikitech.wikimedia.org/wiki/Puppet/Hiera [10:24:38] just add me to reviewers to the new patch after the current revert, happy to have a deeper look [10:24:58] there's two points where lookup happens, and both fail to resolve, one under profile::prometheus::cloud (https://puppet-compiler.wmflabs.org/output/1071573/3913/prometheus1005.eqiad.wmnet/change.prometheus1005.eqiad.wmnet.err) and one under profile::wmcs::services::maintain_dbusers (https://puppet-compiler.wmflabs.org/output/1071573/3913/cloudcontrol1005.eqiad.wmnet/change.cloudcontrol1005.eqiad.wmnet.err) [10:25:06] from https://puppet-compiler.wmflabs.org/output/1071573/3913/ [10:29:52] submitted the revert to unblock puppet runs, puppet is now passing [10:30:03] thank you dcaro [10:46:52] I think I got a winner https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071577 [10:46:56] godog: ^ [12:02:45] dcaro: nice, LGTM [12:02:57] ah yes, the old hieradata/common.yaml pitfall [13:28:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:33:35] RESOLVED: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [19:08:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:18:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:23:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:28:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag