[08:45:15] <Emperor>	 ^-- should that email address work? It's what's listed at https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Infrastructure_Foundations/Contact too
[08:52:05] <moritzm>	 Emperor: it's already sorted, I sent a mail to the users, we'll create a new, separate idm-help@w.o alias for such issues and https://phabricator.wikimedia.org/T382226 to prevent that user error in the future
[08:52:16] <moritzm>	 user, not users
[08:52:51] <Emperor>	 👍
[14:20:00] <Krinkle>	 (caveat: navtiming.py is unowned after the reorg which remains unresolved and I shouldn't be looking at this)
[14:20:02] <Krinkle>	 I noticed: 
[14:20:07] <Krinkle>	 > FIRING: [3x] NavtimingStaleBeacon: No Navtiming CpuBenchmark messages in 80d 16h 50m 46s - https://wikitech.wikimedia.org/wiki/Performance/Runbook/Webperf-processor_services - 
[14:20:34] <Krinkle>	 but it seems fine at the in-take:
[14:20:35] <Krinkle>	 https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=CpuBenchmark&from=now-6M&to=now
[14:21:04] <Krinkle>	 https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-topic=eventlogging_CpuBenchmark&from=now-90d&to=now-5m
[14:21:21] <Krinkle>	 The alert comes from the intermediary processing service that takes it from kafka to prometheus
[14:21:57] <Krinkle>	 but the prometheus output of webperf_cpubenchmark_* metrics looks fine as well. https://grafana-rw.wikimedia.org/d/cFMjrb7nz/cpu-benchmark?orgId=1&viewPanel=15&editPanel=15 
[14:22:38] <Krinkle>	 ref https://www.mediawiki.org/wiki/Developers/Maintainers#Services_and_administration
[14:29:52] <elukey>	 Krinkle: the definition of the alert should be https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-perf/webperf.yaml#13
[14:32:16] <Krinkle>	 https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20webperf_latest_handled_time_seconds%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D
[14:32:19] <Krinkle>	 I see.
[14:32:21] <Krinkle>	 Okay, so...
[14:32:26] <Krinkle>	 basically processing switched from eqiad to codfw
[14:32:43] <elukey>	 so time() - max(webperf_latest_handled_time_seconds{schema="CpuBenchmark"}) is 41.etc..
[14:32:57] <Krinkle>	 after /3600, it's <1 for the codfw ones
[14:33:00] <Krinkle>	 the eqiad ones are increasing
[14:33:10] <Krinkle>	 which is expected since there is an etcd lock to only be active in one DC at a given time
[14:33:17] <Krinkle>	 it follows the MW switch over automatically
[14:33:21] <Krinkle>	 or is supposed to anyway
[14:33:33] <Krinkle>	 in theory max() should hide that 
[14:34:18] <elukey>	 if you check in https://thanos.wikimedia.org/ you can see the values for both clusters (webperf_latest_handled_time_seconds{schema="CpuBenchmark"})
[14:34:20] <Krinkle>	 https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20max%28webperf_latest_handled_time_seconds%29%20by%20%28schema%29%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D
[14:34:36] <Krinkle>	 Yeah, my first link above keeps each schema and site separately
[14:34:46] <Krinkle>	 the second does the max by schema, thus picking the "latest" site implicitly.
[14:35:02] <Krinkle>	 Does the alert not use thanos and/or does it run it separately by site?
[14:35:14] <Krinkle>	 This second link shows no value above 1.0h
[14:35:23] <Krinkle>	 yet it alerts claimig to be over 80 days
[14:37:36] <elukey>	 Krinkle: if you check https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DNavtimingStaleBeacon you'll see the label "eqiad"
[14:37:55] <elukey>	 so yes it should be separate by site
[14:38:12] <Krinkle>	 ok, what if the business requirement is to not separate it by site?
[14:38:38] <Krinkle>	 I'm guessing there's a supression from a year ago holding back the codfw variant
[14:40:15] <Krinkle>	 the only solution I'm aware of is to move this (back) to alert via Grafana, which I'd probably prefer anyway, but I suppose that's up for the next owner to decide.
[14:41:14] <elukey>	 I think that observability can help, but about the ownership - since when navtiming is unowned? Was is communicated to SRE? (First time that I hear it)
[14:44:35] <Krinkle>	 elukey: since perf team was disbanded.
[14:45:15] <Krinkle>	 most of the things that CPT and Perf owned prior to July 2023 have been unowned since then.
[14:45:19] <elukey>	 okok, it would be nice that a reorg would take care of things like those though
[14:45:33] <elukey>	 anyway, I know it is not the perf's team fault :)
[14:48:40] <cdanis>	 Krinkle: if we have a prometheus metric indicating the active MW DC, this is pretty easy
[14:52:07] <Krinkle>	 cdanis: hm.. something like {site=$active_dc}?
[14:53:34] <Krinkle>	 or is there some alertmanager-level way to make it conditional from one metric value to another?
[14:53:50] <Krinkle>	 I assumed that if the former was an option, max() would suffice 
[14:54:05] <Krinkle>	 which we use already, but it's alerting on the raw ops source instead of via thanos.
[14:54:24] <Krinkle>	 in grafana I can set the alert to use the Thanos data source as we do for most MW-related alerts already
[15:14:10] <godog>	 Krinkle: I skimmed the backlog though if you need alerts.git evaluated by thanos rather than prometheus you can use # deploy-tag: global in the alert file
[15:14:47] <cdanis>	 TIL!
[15:14:51] <Krinkle>	 ok, and we can split the yaml file then I guess along that axis?
[15:14:59] <Krinkle>	 one for global and one for ops/dc
[15:15:18] <Krinkle>	 arclamp runs active-active for example
[15:16:32] <godog>	 Krinkle: yes that would work
[15:40:08] <elukey>	 TIL global as well!
[15:42:28] <inflatador>	 last 2 hosts I reimaged (cloudelastic1011 and 1012, both Bullseye) came up with Puppet 5, even though I specified Puppet 7. I didn't set the hiera as recommended by the cookbook, but I thought we were defaulting to Puppet 7 regardless now? Just wanted to make sure it doesn't have anything to do with UEFI
[15:45:53] <_joe_>	 maybe we should just retire arclamp, if there is no investment into it.
[15:46:08] <elukey>	 inflatador: o/ you need to set hiera yes :)
[15:48:18] <volans>	 elukey: but hieradata/role/common/elasticsearch/cloudelastic.yaml:profile::puppet::agent::force_puppet7: true
[15:48:30] <volans>	 and insetup also have all p7 by default, so that seems weird
[15:49:13] <volans>	 they are insetup::search_platform
[15:49:20] <inflatador>	 maybe I messed up the regex or something?
[15:52:21] <moritzm>	 insetup::search_platform defaults to P7
[15:52:21] <elukey>	 volans: I didn't check, I thought from what inflatador wrote that no hiera was set, this is why I assumed it was missing
[15:52:26] <elukey>	 weird
[15:52:31] <inflatador>	 FWiW, I recently reimaged wdqs1025 (also EFI) and it didn't have this problem AFAICT
[15:55:35] <inflatador>	 yeah, on Dec 6th, I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101095 which enabled EFI for wdqs1025 and I reimaged it while it was in insetup. The reimage finished successfully at 2024-12-09 18:06:06 based on cumin2002 logs
[15:58:31] <inflatador>	 note that cloudelastic1011/1012 are the 1st hosts to use a partman recipe I wrote ( see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103367 ) , but that shouldn't matter, right?
[16:00:54] <inflatador>	 these are also our first Supermicro chassis. Anyway, the hosts aren't in service so if y'all wanna try and reimage again LMK
[16:04:50] <elukey>	 inflatador: in theory vendor and partman recipe shouldn't matter
[16:04:56] <volans>	 inflatador: what makes you say it has puppet 5?
[16:05:09] <volans>	 ii  puppet                               7.23.0-1~debu11u1                  all          transitional dummy package
[16:07:08] <inflatador>	 volans I fixed it myself via `install-console`...they came up w/out the Puppet 7 repo I had to manually add, run `puppet agent`, sign CSR etc
[16:07:28] <volans>	 you shouldn't do those things manually
[16:07:53] <volans>	 elukey: I wonder if this is the boor order problem that it actually did d-i twice
[16:08:09] <inflatador>	 I agree, I prefer to not to
[16:08:24] <jhathaway>	 hmm, could be
[16:09:36] <elukey>	 volans: in theory though every time I saw double d-i then at the CRS the reimage procedure got stuck
[16:09:39] <elukey>	 or failed
[16:09:56] <elukey>	 inflatador: did it complete without errors? The reimage I mean
[16:10:19] <elukey>	 IIUC you fixed it manually and killed the reimage right?
[16:10:38] <elukey>	 if so it is probably double d-i, and it would fit 
[16:14:21] <elukey>	 inflatador: if this re-happens (namely, reimage stuck etc.. at first try) just control-C the cookbook and then kick off a new reimage
[16:14:41] <fabfur>	 mmm something definitely wrong on some nodes
[16:14:52] <fabfur>	 `Error: Failed to apply catalog: Cannot convert '""' to an integer for parameter: number_of_facts_soft_limit`
[16:15:38] <fabfur>	 maybe related to 4b79506d1159d85cdd630116b098001170aece76 ? 
[16:16:11] <fabfur>	 andrewbogott: wdyt? 
[16:16:56] <andrewbogott>	 fabfur: that's for sure my patch but I don't know why it's happening to you... What host is it happening on?
[16:17:37] <fabfur>	 currently have alerts for cp7005 and cp5028 
[16:18:00] <fabfur>	 other hosts are confirmed working fine
[16:18:09] <fabfur>	 (running puppet fine) 
[16:18:16] <andrewbogott>	 huh, ok, looking...
[16:18:34] <elukey>	 inflatador: if you have time can you try to reimage again the cloudelastic nodes? 
[16:18:56] <andrewbogott>	 fabfur: can I get a fqdn? I don't think I know what 7xxx means :)
[16:19:11] <fabfur>	 cp7005.magru.wmnet
[16:19:27] <fabfur>	 cp5028.eqsin.wmnet
[16:19:44] <volans>	 also others at https://puppetboard.wikimedia.org/nodes?status=failed
[16:19:53] <volans>	 but not too many, so not so wide spread
[16:20:07] <fabfur>	 yeah those two popped up on traffic chan 
[16:21:17] <andrewbogott>	 so what would result in 
[16:21:19] <andrewbogott>	 lookup('profile::puppet::agent::facts_soft_limit', {'default_value' => 2048})
[16:21:22] <andrewbogott>	 evaluating to "" ?
[16:24:08] <andrewbogott>	 oh, those hosts got stuck in a transitional state somehow, the puppet catalog is actually correct
[16:25:32] <fabfur>	 fabfur@cp7005:~$ puppet lookup 'profile::puppet::agent::facts_soft_limit'
[16:25:32] <fabfur>	 fabfur@cp7005:~$ echo $?
[16:25:32] <fabfur>	 1
[16:27:30] <andrewbogott>	 the fix is just
[16:27:30] <andrewbogott>	 sudo sed -i 's/^number_of_facts_soft_limit = $//g' /etc/puppet/puppet.conf
[16:27:33] <andrewbogott>	 on affected hosts
[16:28:08] <fabfur>	 why some hosts are affected while other doesn't?
[16:28:33] * andrewbogott waves hands
[16:28:43] <andrewbogott>	 something to do with puppet altering its own config having a race?  I don't know
[16:28:45] <moritzm>	 hmmh, is that some race where the puppet agent picks up it's config while it's being updated
[16:28:49] <andrewbogott>	 :D
[16:29:15] <moritzm>	 I haven't seen that, but then we also rarely change central settings in the puppet.conf
[16:29:16] <fabfur>	 puppet is too efficient
[16:31:10] <andrewbogott>	 So in theory my sed is safe to run everywhere since it only clobbers that config line if it's already broken. What do you think?
[16:31:58] <andrewbogott>	 I guess if there's really a race in applying then it could happen again after removing that line
[16:32:01] <andrewbogott>	 but we can iterate :p
[16:32:48] <volans>	 it's just a handful of server I'd rather run only on those
[16:33:26] <volans>	 or use https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed
[16:33:50] <andrewbogott>	 oh, nice
[16:34:57] <andrewbogott>	 hm I don't immediately see how to compound that with a sed
[16:35:32] <volans>	 right
[16:35:36] <volans>	 the list of hosts isL:
[16:35:36] <volans>	 bast4005.wikimedia.org,cephosd2003.codfw.wmnet,cp7014.magru.wmnet,elastic2088.codfw.wmnet,es2026.codfw.wmnet,logstash[1027,1036].eqiad.wmnet,mc1051.eqiad.wmnet,wikikube-worker2144.codfw.wmnet,wikikube-worker[1002,1080].eqiad.wmnet
[16:35:43] <andrewbogott>	 oh nice, thank you
[16:36:04] <volans>	 just grepped that lkine on all and got the ones empty
[16:36:22] <andrewbogott>	 for future reference... does "run-puppet-agent -q --failed-only" return true/false depending on if the -q is activated?
[16:38:29] <andrewbogott>	 I ran that sed, now re-running puppet on the 11 hosts
[16:40:19] <andrewbogott>	 ok, should be all fixed. Thanks fabfur & volans
[16:40:30] <volans>	 thx 
[16:40:31] <fabfur>	 thanks to you for the fix!
[16:40:43] * andrewbogott thinks it's too much to expect the pcc to say "this host compiles without errors 99.7% of the time'
[17:22:00] <_joe_>	 andrewbogott: not sure I follow
[17:22:53] <_joe_>	 compiling a catalog is deterministic, do you mean some history of changes?
[17:23:08] <andrewbogott>	 _joe_: We ran into an issue that seems to be an occasional race with applying puppet config, I was just making a crack that there's no way the pcc could have predicted it.
[17:23:26] <_joe_>	 andrewbogott: pcc does nothing for applying the config, it just compiles the catalog
[17:23:32] <andrewbogott>	 exactly :)
[17:23:35] <_joe_>	 so other things you won't know are dependency cycles
[17:23:55] <_joe_>	 ahh ok I thought it was a feature request :P
[17:24:35] <andrewbogott>	 Nope, we could definitely not have tested for this ahead of time :D
[17:30:43] <inflatador>	 elukey sorry to ghost, minor failure emergency which is over now ;) . both hosts' reimages stalled/failed at `The puppet server has no CSR for ${fqdn}` 
[17:30:54] <inflatador>	 err...family emergency that is
[17:31:06] <inflatador>	 anyway, I can reimage cloudelastic1012 and let you know how it goes
[19:30:27] <elukey>	 inflatador: np! Yeah the failure in CSR is a sign of double debian install, same issue that we are seeing on other supermicro nodes sigh
[19:30:39] <elukey>	 I'll reimage 1011 tomorrow as well to check!
[19:32:07] <elukey>	 all info in https://phabricator.wikimedia.org/T381919