[08:29:29] Anyone have an aide what would cause PCC runs on insetup hosts to throw "Unexpected error running run_host: Unable to find fact file for: ml-serve1009.eqiad.wmnet under directory /var/lib/catalog-differ/puppet"? The machione has had several successful puppet runs in the past. Other machines in the same batch (ml-serv1010, …11 work fine) [08:49:07] <_joe_> klausman: have you tried to manually update the facts? https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [08:49:29] looking [08:49:56] <_joe_> it's possible the cron only ran before the first successful puppet run on that host [08:50:30] ack, will try the manual upload [08:57:16] Nope, that didn't help [09:02:41] mmmmmh. [09:02:51] $ ssh pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor [09:02:53] Stdio forwarding request failed: Session open refused by peer [09:02:55] Connection closed by UNKNOWN port 65535 [09:08:36] I have no host by that name in any of my knwon hosts file [09:10:54] klausman: if the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1068657/2/manifests/site.pp I think that ml-serve1009 is not yet in the catalog, so the exporter is probably unaware of any fact related to it [09:11:21] But run-puppet-agent works fine on it [09:11:47] ah no wait sorry, ml-serve1009 already has insetup, didn't see it at first [09:12:27] The docs about manually adding facst also seem outdated: I cannot find the pcc-db host mentioned in it [09:12:39] likely yes [09:13:42] What's also odd is that ml-serve1010/11 work fine, but the DSE worker also fails [09:18:03] the right db host is 1002, I am running the command now [09:18:09] ty! [09:24:32] Fixed the wikitech page accordingly [09:25:34] All other mentions onWT of it seem to be News pages and the SAL, so not touching those [09:26:56] I already fixed the page :) [09:29:22] Hum. I hit reload, and it still said 1001 [09:30:11] klausman: https://wikitech.wikimedia.org/w/index.php?title=Help%3APuppet-compiler&diff=2220177&oldid=2207285 [09:30:21] Ah, my fixex were in the prose and the cloud command [09:35:11] Looks good now. Thanks for your help, Luca [09:35:25] (and j_oe) [09:36:22] klausman: the script is still running, looks good == pcc runs for ml-serve1009 now? [09:37:35] yes [09:37:45] ditto for the DSE host [09:38:04] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3791/console is the output [09:40:07] okok so the facts have been updated then [13:09:05] folks I left some comments to https://phabricator.wikimedia.org/T373527#10106218 about some memory pressure scenarios that we are seeing on puppetserverXXXX hosts. If you want to read and chime in, plese feel free (disclaimer: it involves JVM tuning) [13:12:33] <_joe_> elukey: you were going so well until the disclaimer [13:12:37] <_joe_> I almost went to look [13:13:11] I generally find a flamethrower to be an effective JVM tuning tool :) [13:13:32] preferably use it on the hardware the JVM was deployed on, just to make sure the infection doesn't spread! [13:14:25] I figured that it wouldn't have been appealing, so I didn't trick people to check :D [13:15:21] jokes aside, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1069185 as possible bandaid [13:16:07] it seems like a reasonable step to take, and fairly safe to try! [13:16:27] not today of course! :D On Monday for sure [13:17:58] thanks for the feedback :) [13:18:31] (of course there may be some Jruby horror that I am not aware behind the memory used etc.., still haven't dig that deep yet) [13:33:09] gehel ^^ any suggestions re: JVM tuning? We could/should run this our Search Platform devs [13:33:58] usually there is no short answer, it depends on the case/workload/etc.. (and we also have jruby on top, that could complicate things) [13:34:27] I am always scared when a jvm tuning is needed, since predicting exactly how the jvm will behave is very difficult [13:34:39] in this case changing the min/max for the heap size seems easy enough [13:35:00] (and given the importance, I'd buy extra ram for those nodes..) [13:37:07] we could also potentially fan out to more puppetmaster frontend nodes or something, if needs be [13:37:42] IIRC we fan out to 3x per core site via SRV today? [13:37:53] or apply the flamethrower treatment to puppet ;) [13:38:10] bblack: yeah [13:47:37] bblack: so you mean updating the srv records to include all puppetservers, regardless of the DC (like eqiad/codfw have 6 in total) [13:48:34] no, I meant like adding more puppetservers in each DC [13:48:48] ahh okok new nodes, didn't get it sorry [13:49:13] definitely yes, buying ram seemed easier outside capex etc.. [13:49:19] I assume eqiad has more load, so spreading it globally might actually smooth that a bit, but I'm not sure how much that extra x-dc latency would cost in terms of painfully-slower pupppet runs and excess pointless x-dc traffic, etc [13:49:54] yep yep exactly, same doubt that I had [13:55:21] would it help to run puppet less often? Dunno how much we do it by default [14:02:38] OnCalendar=*:7/30:00 [14:04:59] (which means every 30 minutes) [14:05:46] inflatador: I doesn't seem something related to too much traffic hitting the puppetservers, more about how much memory it is required to run puppetserver on jruby.. the traffic is not that high, the main issue is that we allocate a big heap and we get close to the host's limit (that in my opinion should have at least +32G of ram) [14:08:22] elukey ACK, sounds like additional RAM is probably the best thing then [14:10:01] maybeee not 100% sure, but I am pessimistic and maybe in the future we'll need to scale the jvm size up to handle more servers, and having extra room for expansion would surely be good [14:13:46] I come from a cloud background..."throw more hw at the problem" is my preferred 1st step ;P [14:23:50] my thinking on the "more servers" thing was that the mem needs of the puppetservers might be related to how many distinct clients they're handling and some per-client state [14:24:17] but then again, I guess with the current 5m SRV record fanning, probably the clients hit random servers on each run, so eventually they'd still all see them all. [14:24:20] <_joe_> it's most probably more related to the average size of the puppet manifest [14:24:52] <_joe_> which I supposed has increased a lot over the last few years [14:24:57] yeah, assuming no leaks [14:25:03] <_joe_> in the past I'd analyze and trim it regularly [14:25:25] <_joe_> it's very easy to pass large data structures around without thinking about it [14:25:51] I mean, truly-bad leaks would just grow very quickly over time and we'd have an obvious problem [14:26:20] <_joe_> bblack: I don't think of leaks, but rather objects not garbage-collected that might be massive [14:26:26] but if it's a more-subtle leak that grows by the number of distinct clients over time (some per-client-host state that sticks around even though it probably shouldn't, but is refreshed next time they connect) that's different