[08:45:15] <Emperor> ^-- should that email address work? It's what's listed at https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Infrastructure_Foundations/Contact too [08:52:05] <moritzm> Emperor: it's already sorted, I sent a mail to the users, we'll create a new, separate idm-help@w.o alias for such issues and https://phabricator.wikimedia.org/T382226 to prevent that user error in the future [08:52:16] <moritzm> user, not users [08:52:51] <Emperor> 👠[14:20:00] <Krinkle> (caveat: navtiming.py is unowned after the reorg which remains unresolved and I shouldn't be looking at this) [14:20:02] <Krinkle> I noticed: [14:20:07] <Krinkle> > FIRING: [3x] NavtimingStaleBeacon: No Navtiming CpuBenchmark messages in 80d 16h 50m 46s - https://wikitech.wikimedia.org/wiki/Performance/Runbook/Webperf-processor_services - [14:20:34] <Krinkle> but it seems fine at the in-take: [14:20:35] <Krinkle> https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=CpuBenchmark&from=now-6M&to=now [14:21:04] <Krinkle> https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-topic=eventlogging_CpuBenchmark&from=now-90d&to=now-5m [14:21:21] <Krinkle> The alert comes from the intermediary processing service that takes it from kafka to prometheus [14:21:57] <Krinkle> but the prometheus output of webperf_cpubenchmark_* metrics looks fine as well. https://grafana-rw.wikimedia.org/d/cFMjrb7nz/cpu-benchmark?orgId=1&viewPanel=15&editPanel=15 [14:22:38] <Krinkle> ref https://www.mediawiki.org/wiki/Developers/Maintainers#Services_and_administration [14:29:52] <elukey> Krinkle: the definition of the alert should be https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-perf/webperf.yaml#13 [14:32:16] <Krinkle> https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20webperf_latest_handled_time_seconds%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D [14:32:19] <Krinkle> I see. [14:32:21] <Krinkle> Okay, so... [14:32:26] <Krinkle> basically processing switched from eqiad to codfw [14:32:43] <elukey> so time() - max(webperf_latest_handled_time_seconds{schema="CpuBenchmark"}) is 41.etc.. [14:32:57] <Krinkle> after /3600, it's <1 for the codfw ones [14:33:00] <Krinkle> the eqiad ones are increasing [14:33:10] <Krinkle> which is expected since there is an etcd lock to only be active in one DC at a given time [14:33:17] <Krinkle> it follows the MW switch over automatically [14:33:21] <Krinkle> or is supposed to anyway [14:33:33] <Krinkle> in theory max() should hide that [14:34:18] <elukey> if you check in https://thanos.wikimedia.org/ you can see the values for both clusters (webperf_latest_handled_time_seconds{schema="CpuBenchmark"}) [14:34:20] <Krinkle> https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20max%28webperf_latest_handled_time_seconds%29%20by%20%28schema%29%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D [14:34:36] <Krinkle> Yeah, my first link above keeps each schema and site separately [14:34:46] <Krinkle> the second does the max by schema, thus picking the "latest" site implicitly. [14:35:02] <Krinkle> Does the alert not use thanos and/or does it run it separately by site? [14:35:14] <Krinkle> This second link shows no value above 1.0h [14:35:23] <Krinkle> yet it alerts claimig to be over 80 days [14:37:36] <elukey> Krinkle: if you check https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DNavtimingStaleBeacon you'll see the label "eqiad" [14:37:55] <elukey> so yes it should be separate by site [14:38:12] <Krinkle> ok, what if the business requirement is to not separate it by site? [14:38:38] <Krinkle> I'm guessing there's a supression from a year ago holding back the codfw variant [14:40:15] <Krinkle> the only solution I'm aware of is to move this (back) to alert via Grafana, which I'd probably prefer anyway, but I suppose that's up for the next owner to decide. [14:41:14] <elukey> I think that observability can help, but about the ownership - since when navtiming is unowned? Was is communicated to SRE? (First time that I hear it) [14:44:35] <Krinkle> elukey: since perf team was disbanded. [14:45:15] <Krinkle> most of the things that CPT and Perf owned prior to July 2023 have been unowned since then. [14:45:19] <elukey> okok, it would be nice that a reorg would take care of things like those though [14:45:33] <elukey> anyway, I know it is not the perf's team fault :) [14:48:40] <cdanis> Krinkle: if we have a prometheus metric indicating the active MW DC, this is pretty easy [14:52:07] <Krinkle> cdanis: hm.. something like {site=$active_dc}? [14:53:34] <Krinkle> or is there some alertmanager-level way to make it conditional from one metric value to another? [14:53:50] <Krinkle> I assumed that if the former was an option, max() would suffice [14:54:05] <Krinkle> which we use already, but it's alerting on the raw ops source instead of via thanos. [14:54:24] <Krinkle> in grafana I can set the alert to use the Thanos data source as we do for most MW-related alerts already [15:14:10] <godog> Krinkle: I skimmed the backlog though if you need alerts.git evaluated by thanos rather than prometheus you can use # deploy-tag: global in the alert file [15:14:47] <cdanis> TIL! [15:14:51] <Krinkle> ok, and we can split the yaml file then I guess along that axis? [15:14:59] <Krinkle> one for global and one for ops/dc [15:15:18] <Krinkle> arclamp runs active-active for example [15:16:32] <godog> Krinkle: yes that would work [15:40:08] <elukey> TIL global as well! [15:42:28] <inflatador> last 2 hosts I reimaged (cloudelastic1011 and 1012, both Bullseye) came up with Puppet 5, even though I specified Puppet 7. I didn't set the hiera as recommended by the cookbook, but I thought we were defaulting to Puppet 7 regardless now? Just wanted to make sure it doesn't have anything to do with UEFI [15:45:53] <_joe_> maybe we should just retire arclamp, if there is no investment into it. [15:46:08] <elukey> inflatador: o/ you need to set hiera yes :) [15:48:18] <volans> elukey: but hieradata/role/common/elasticsearch/cloudelastic.yaml:profile::puppet::agent::force_puppet7: true [15:48:30] <volans> and insetup also have all p7 by default, so that seems weird [15:49:13] <volans> they are insetup::search_platform [15:49:20] <inflatador> maybe I messed up the regex or something? [15:52:21] <moritzm> insetup::search_platform defaults to P7 [15:52:21] <elukey> volans: I didn't check, I thought from what inflatador wrote that no hiera was set, this is why I assumed it was missing [15:52:26] <elukey> weird [15:52:31] <inflatador> FWiW, I recently reimaged wdqs1025 (also EFI) and it didn't have this problem AFAICT [15:55:35] <inflatador> yeah, on Dec 6th, I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101095 which enabled EFI for wdqs1025 and I reimaged it while it was in insetup. The reimage finished successfully at 2024-12-09 18:06:06 based on cumin2002 logs [15:58:31] <inflatador> note that cloudelastic1011/1012 are the 1st hosts to use a partman recipe I wrote ( see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103367 ) , but that shouldn't matter, right? [16:00:54] <inflatador> these are also our first Supermicro chassis. Anyway, the hosts aren't in service so if y'all wanna try and reimage again LMK [16:04:50] <elukey> inflatador: in theory vendor and partman recipe shouldn't matter [16:04:56] <volans> inflatador: what makes you say it has puppet 5? [16:05:09] <volans> ii puppet 7.23.0-1~debu11u1 all transitional dummy package [16:07:08] <inflatador> volans I fixed it myself via `install-console`...they came up w/out the Puppet 7 repo I had to manually add, run `puppet agent`, sign CSR etc [16:07:28] <volans> you shouldn't do those things manually [16:07:53] <volans> elukey: I wonder if this is the boor order problem that it actually did d-i twice [16:08:09] <inflatador> I agree, I prefer to not to [16:08:24] <jhathaway> hmm, could be [16:09:36] <elukey> volans: in theory though every time I saw double d-i then at the CRS the reimage procedure got stuck [16:09:39] <elukey> or failed [16:09:56] <elukey> inflatador: did it complete without errors? The reimage I mean [16:10:19] <elukey> IIUC you fixed it manually and killed the reimage right? [16:10:38] <elukey> if so it is probably double d-i, and it would fit [16:14:21] <elukey> inflatador: if this re-happens (namely, reimage stuck etc.. at first try) just control-C the cookbook and then kick off a new reimage [16:14:41] <fabfur> mmm something definitely wrong on some nodes [16:14:52] <fabfur> `Error: Failed to apply catalog: Cannot convert '""' to an integer for parameter: number_of_facts_soft_limit` [16:15:38] <fabfur> maybe related to 4b79506d1159d85cdd630116b098001170aece76 ? [16:16:11] <fabfur> andrewbogott: wdyt? [16:16:56] <andrewbogott> fabfur: that's for sure my patch but I don't know why it's happening to you... What host is it happening on? [16:17:37] <fabfur> currently have alerts for cp7005 and cp5028 [16:18:00] <fabfur> other hosts are confirmed working fine [16:18:09] <fabfur> (running puppet fine) [16:18:16] <andrewbogott> huh, ok, looking... [16:18:34] <elukey> inflatador: if you have time can you try to reimage again the cloudelastic nodes? [16:18:56] <andrewbogott> fabfur: can I get a fqdn? I don't think I know what 7xxx means :) [16:19:11] <fabfur> cp7005.magru.wmnet [16:19:27] <fabfur> cp5028.eqsin.wmnet [16:19:44] <volans> also others at https://puppetboard.wikimedia.org/nodes?status=failed [16:19:53] <volans> but not too many, so not so wide spread [16:20:07] <fabfur> yeah those two popped up on traffic chan [16:21:17] <andrewbogott> so what would result in [16:21:19] <andrewbogott> lookup('profile::puppet::agent::facts_soft_limit', {'default_value' => 2048}) [16:21:22] <andrewbogott> evaluating to "" ? [16:24:08] <andrewbogott> oh, those hosts got stuck in a transitional state somehow, the puppet catalog is actually correct [16:25:32] <fabfur> fabfur@cp7005:~$ puppet lookup 'profile::puppet::agent::facts_soft_limit' [16:25:32] <fabfur> fabfur@cp7005:~$ echo $? [16:25:32] <fabfur> 1 [16:27:30] <andrewbogott> the fix is just [16:27:30] <andrewbogott> sudo sed -i 's/^number_of_facts_soft_limit = $//g' /etc/puppet/puppet.conf [16:27:33] <andrewbogott> on affected hosts [16:28:08] <fabfur> why some hosts are affected while other doesn't? [16:28:33] * andrewbogott waves hands [16:28:43] <andrewbogott> something to do with puppet altering its own config having a race? I don't know [16:28:45] <moritzm> hmmh, is that some race where the puppet agent picks up it's config while it's being updated [16:28:49] <andrewbogott> :D [16:29:15] <moritzm> I haven't seen that, but then we also rarely change central settings in the puppet.conf [16:29:16] <fabfur> puppet is too efficient [16:31:10] <andrewbogott> So in theory my sed is safe to run everywhere since it only clobbers that config line if it's already broken. What do you think? [16:31:58] <andrewbogott> I guess if there's really a race in applying then it could happen again after removing that line [16:32:01] <andrewbogott> but we can iterate :p [16:32:48] <volans> it's just a handful of server I'd rather run only on those [16:33:26] <volans> or use https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [16:33:50] <andrewbogott> oh, nice [16:34:57] <andrewbogott> hm I don't immediately see how to compound that with a sed [16:35:32] <volans> right [16:35:36] <volans> the list of hosts isL: [16:35:36] <volans> bast4005.wikimedia.org,cephosd2003.codfw.wmnet,cp7014.magru.wmnet,elastic2088.codfw.wmnet,es2026.codfw.wmnet,logstash[1027,1036].eqiad.wmnet,mc1051.eqiad.wmnet,wikikube-worker2144.codfw.wmnet,wikikube-worker[1002,1080].eqiad.wmnet [16:35:43] <andrewbogott> oh nice, thank you [16:36:04] <volans> just grepped that lkine on all and got the ones empty [16:36:22] <andrewbogott> for future reference... does "run-puppet-agent -q --failed-only" return true/false depending on if the -q is activated? [16:38:29] <andrewbogott> I ran that sed, now re-running puppet on the 11 hosts [16:40:19] <andrewbogott> ok, should be all fixed. Thanks fabfur & volans [16:40:30] <volans> thx [16:40:31] <fabfur> thanks to you for the fix! [16:40:43] * andrewbogott thinks it's too much to expect the pcc to say "this host compiles without errors 99.7% of the time' [17:22:00] <_joe_> andrewbogott: not sure I follow [17:22:53] <_joe_> compiling a catalog is deterministic, do you mean some history of changes? [17:23:08] <andrewbogott> _joe_: We ran into an issue that seems to be an occasional race with applying puppet config, I was just making a crack that there's no way the pcc could have predicted it. [17:23:26] <_joe_> andrewbogott: pcc does nothing for applying the config, it just compiles the catalog [17:23:32] <andrewbogott> exactly :) [17:23:35] <_joe_> so other things you won't know are dependency cycles [17:23:55] <_joe_> ahh ok I thought it was a feature request :P [17:24:35] <andrewbogott> Nope, we could definitely not have tested for this ahead of time :D [17:30:43] <inflatador> elukey sorry to ghost, minor failure emergency which is over now ;) . both hosts' reimages stalled/failed at `The puppet server has no CSR for ${fqdn}` [17:30:54] <inflatador> err...family emergency that is [17:31:06] <inflatador> anyway, I can reimage cloudelastic1012 and let you know how it goes [19:30:27] <elukey> inflatador: np! Yeah the failure in CSR is a sign of double debian install, same issue that we are seeing on other supermicro nodes sigh [19:30:39] <elukey> I'll reimage 1011 tomorrow as well to check! [19:32:07] <elukey> all info in https://phabricator.wikimedia.org/T381919