[07:58:21] I need very simple help with hiera, something probably stupid I am missing: [07:58:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/521852#message-661efec4336533f5eaf8b64a4753d92198efd2a1 [08:04:10] jynus: side node, I recommend you use `lookup()` instead [08:06:08] https://wikitech.wikimedia.org/wiki/Puppet_coding#Profiles says to use hiera [08:07:55] true [08:07:59] but also https://puppet.com/docs/puppet/5.2/hiera_use_function.html [08:08:03] the wikitech docs are unfortunately outdated in that regard: https://phabricator.wikimedia.org/T220820 [08:11:08] thanks for that heads up, I had no idea [08:13:04] side note*, also [08:13:18] jynus: do you have the full PCC run, does it only miss $mysql_password or e.g. also $mysql_host? [08:13:28] only the password [08:14:20] and with lookup now the host fails [08:14:49] jynus: is that key in labs/private.git? [08:14:54] yes [08:15:10] and what's in labs/private.git looks entirely correct [08:15:27] in any case, I had the same issue on production [08:15:39] it got the password as the empty string [08:16:08] this is the original failure https://puppet-compiler.wmflabs.org/compiler1002/17312/prometheus1004.eqiad.wmnet/change.prometheus1004.eqiad.wmnet.err [08:16:56] and this is the failure if lookup is used instead: https://puppet-compiler.wmflabs.org/compiler1001/17313/prometheus1003.eqiad.wmnet/change.prometheus1003.eqiad.wmnet.err [09:26:51] FYI, I [09:27:10] FYI, I'll disable puppet fleet-wide in about 5 mins for reboots of the puppetdb hosts [09:33:50] thanks, please announce again when reenabled for something I need to do (I am not in a hurry, take as much time as you want) [09:35:38] ack, will do [09:59:19] jynus: puppet is back on [09:59:27] thanks! [12:00:04] I've added a note to that section https://wikitech.wikimedia.org/wiki/Puppet_coding#Profiles about using lookup() instead of hiera() but have not updated the docs proper [12:35:13] ... there is both https://wikitech.wikimedia.org/wiki/Puppet_coding/Testing and https://wikitech.wikimedia.org/wiki/Puppet_coding/testing and they are different pages 😱 [12:39:58] cdanis: and that they're both outdated I suppose [12:40:43] ema: one of them is both marked as a work-in-progress draft AND likely outdated [20:29:06] herron, are you around? I'm seeing a class conflict on the cloud mx servers — iirc you built them initially [20:29:18] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Exim4] is already declared in file /etc/puppet/modules/standard/manifests/mail/sender.pp:2; cannot redeclare at /etc/puppet/modules/profile/manifests/mail/smarthost.pp:39 at [20:29:18] /etc/puppet/modules/profile/manifests/mail/smarthost.pp:39:5 on node mx-out01.cloudinfra.eqiad.wmflabs [20:29:51] andrewbogott: hey, hmm [20:29:54] looking [20:30:11] I'm digging through the git log trying to see what changed… Exim4 has been defined in both those places for ages [20:30:41] ed2d0e202980d50b5fbc62505f081a7580b7fd9b was merged right about when it broke but it's not clear to me that that's the culprit [20:32:51] hm, I bet it's 94882711a45769126d2cf54f55f9bf7161865993 [20:37:41] going to see if adding 'profile::standard::has_default_mail_relay: false’ to the instance hiera solves it [20:38:12] yes, i think that is it, i was about to say that [20:38:17] herron: yeah, I bet it's just that [20:38:17] that is if they are full mailservers [20:38:20] I just found that too :) [20:38:22] and not satelitte systems [20:38:24] yep that did the trick! now to see if the flood of updates breaks something new [20:38:27] you switch between those with hiera [20:39:49] ok, so it's just that the switch was renamed, simple enough [20:39:50] thank you [20:40:11] yep seems like, it had standard::has_default_mail_relay: false in place which stopped having an effect [20:47:43] i tried to use wmf-auto-reimage and it failed waiting for the initial puppet run. wait_puppet_run: Timeout reached. the cumin.out log file shows "(1/1) of nodes failed to execute command 'source /usr/loca...PUPPET_SUMMARY}"':" and then it aborts because of 0% success ratio running the command on nodes [20:49:36] the puppet run did not get to the part where it adds ssh users but it's also after the part where the install-console part works or so it seems. so i can't get on the host [20:49:59] will try to just repeat the same thing again first [20:50:52] it's cumin1001:/var/log/wmf-auto-reimage/201907102053_dzahn_114135_restbase1017_eqiad_wmnet* [20:56:01] Does anyone know what exactly constitutes failure for a systemd timer? I have a job that returns '1' but systemd just says that it's 'active' and offers no complaint [20:59:03] I likely clue is that it shows the time for the next run as 'n/a' [21:12:56] andrewbogott: i found systemctl list-timers --all has a column "PASSED" and that says it is "since the timer ran" but i find that still ambigious, could mean "i tried to run" or "it finished". and others point out the command could still be running [21:13:45] so i'm not sure if there is a clear "failed" state for them. but one could grep the output of journalctl for "failed" [21:13:58] journalctl --unit=$name should have something? [21:29:49] it looks like maybe the timer isn't active, which is surely part of the problem [21:33:15] n/a n/a Mon 2019-06-03 15:10:04 UTC 1 months 7 days ago designate_floating_ip_ptr_records_updater.timer designate_floating_ip_ptr_records_updater.service [21:33:24] So I guess it's not even trying... [21:34:15] cdanis: in both cases jornalctl just says '-- No entries --' [21:34:31] yeah, that definitely sounds like it's not active and has never been run [21:35:22] except it shows up in list-timers [22:16:04] cdanis: ok, we figured out the issue (in #wikimedia-cloud-admin). The script launched by the timer hung forever, causing systemd to never reschedule. The question now is if we can configure a timer to notice that and regard it as a failure if it times out waiting for a return code [22:17:31] ahhh [22:39:59] JobTimeoutSec=, JobRunningTimeoutSec= [22:40:02] when i asked if it looks at return codes i was told "what if it's still running". but this ^ [22:40:49] and then there is "write another timer that is scheduled to run at the shutdown times you would like. They can run" a service which stops the you want to top, by running ExecStart=/bin/systemctl stop other.service in the service file called your shutdown timer" [22:42:22] but it if it times out that also does not mean it goes into "failed" state. "When a job for this unit is queued, a timeout JobTimeoutSec= may be configured. Similarly, JobRunningTimeoutSec= starts counting when the queued job is actually started. If either time limit is reached, the job will be cancelled, the unit however will not change state or even enter the "failed" mode. " [22:43:29] ah, you can kill it with "JobTimeoutAction= optionally configures an additional action to take when the timeout is hit, " [22:44:49] so you could have a script that does both, kill the process and tell monitoring about it and have that as your TimeoutAction command [22:53:17] mutante: that's promising! [22:56:53] andrewbogott: yep. there is also FailureAction= but that does not apply to timers then because of the above [23:05:31] repeating my wmf-auto-reimage attempt. install was fast but it failed to detect the puppet run was finished. it is at "Still waiting for Puppet after 70.0 minutes" but actually puppet wasn't running. this time i was able to ssh to the machine and just use it though. that is different from last time [23:06:01] ran puppet again manually and it just took 16 seconds [23:23:03] that's very strange mutante [23:23:49] yep, it's done but the script does not notice. Still waiting for Puppet after 90.0 minutes . even after manual runs [23:24:12] but i can use it as if it worked. so there's that. yesterday i couldn't