[07:22:38] <moritzm>	 FYI, I'm disabling puppet on the install* servers for ~20-30 mins, testing something in d-i
[07:23:42] <jijiki>	 moritzm: ping me after this to merge the poolcounter change
[07:23:51] <jijiki>	 poolcounters*
[07:24:42] <moritzm>	 ack!
[08:16:26] <moritzm>	 puppet is re-enabled on the install* servers
[09:19:48] <moritzm>	 headsup: Rebooting grafana1001 in a few minutes, should be quick since it's on Ganeti
[10:12:41] <moritzm>	 I'll reboot icinga1001 in ~ 5mins unless there are any objections
[11:01:09] <jijiki>	 objection!
[11:01:16] <jijiki>	 joking
[11:02:12] <apergos>	 good because that was a lot longer than 5 minutes!
[11:08:01] <jijiki>	 haha, I know :p
[11:10:26] <arturo>	 :-P
[11:23:53] <vgutierrez>	 moritzm: are you rearming keyholder in icinga1001?
[11:26:21] <moritzm>	 ah, right, forgot about, doing that now
[11:26:34] <vgutierrez>	 I took the liberty of doing it myself
[11:27:12] <moritzm>	 ack, thx
[15:22:05] <shdubsh>	 I'm running into a problem with the PCC: [ 2019-07-17T15:15:19 ] ERROR: Unable to find facts for host cp1075.eqiad.wmnet, skipping
[15:22:30] <shdubsh>	 following: https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#FAQ I don't have access to the puppet-diffs project
[15:22:42] <shdubsh>	 anyone can help?
[15:23:21] <bblack>	 cp1075 isn't new, that's odd
[15:25:02] <bblack>	 shdubsh: I'm assuming you mean https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/17434/console and the other one right after it
[15:25:18] <shdubsh>	 yes
[15:25:26] <bblack>	 error: prune died of signal 15
[15:25:27] <bblack>	 error: failed to run prune
[15:25:41] <bblack>	 something going wrong with the basic git operations, maybe disk/network issues for the PCC runners?
[15:26:03] <bblack>	 --
[15:26:04] <bblack>	 error: The last gc run reported the following. Please correct the root cause
[15:26:07] <bblack>	 and remove /var/lib/catalog-differ/production/.git/gc.log.
[15:26:10] <bblack>	 Automatic cleanup will not be performed until the file is removed.
[15:26:36] <bblack>	 I don't know much about where to go hacking on that at though
[15:26:44] <bblack>	 compiler1001.puppet-diffs.eqiad.wmflabs ?
[15:26:57] <cdanis>	 shdubsh: added you to the puppet-diffs project in horizon
[15:27:23] <shdubsh>	 thanks!
[15:27:41] <cdanis>	 (should probably be done as part of SRE onboarding imo)
[15:28:17] <shdubsh>	 first sign of the prune issue: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17425/console
[15:30:48] <bblack>	 try manual git gc there?
[15:31:59] <bblack>	 or just remove the gc.log file, I donno
[15:32:12] <bblack>	 maybe it was a one-off issue and it's just stuck on the existence of the logfile of the signal termination
[16:08:01] <shdubsh>	 it looks like compiler1002 is ok, compiler1001 has the issue (the failing job succeeds on 1002)
[17:27:29] <jbond42>	 i emoved the log file and it seems to have worked https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17443/console https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17444/console
[17:33:39] <cdanis>	 mutante: btw, I wanted to point out to you https://gerrit.wikimedia.org/r/c/operations/puppet/+/523248/10/modules/nrpe/manifests/monitor_service.pp
[17:33:51] <cdanis>	 and also mention that I've since had the thought that... maybe this functionality should just be part of monitoring::service
[17:37:58] <mutante>	 cdanis: ah! adding a default is a good idea. but also it's so close to done now there are only 3 or 4 left before i can make the parameter mandatory
[17:38:19] <cdanis>	 mutante: yeah, it is needed to make the URL mushing-together work properly
[17:38:33] <cdanis>	 but I think monitoring::service should support $dashboard_links and know how to DTRT
[17:40:44] <mutante>	 Sure, i mean, historically just actual grafana checks had grafana links but if there are good dashboards for general Icinga checks,why not
[17:42:02] <mutante>	 $dashboard_link would stay optional though, right. while $notes_url becomes required.
[17:46:54] <cdanis>	 +1
[17:47:07] <cdanis>	 we support up to 4 dashboard links btw, which I think is useful
[17:47:21] <cdanis>	 but for instance -- the patch I linked also adds a link to the grafana host console for disk space NRPEs
[17:47:32] <cdanis>	 which is, like, one of those obvious-in-retrospect thing
[17:47:38] <cdanis>	 s/$/s/
[17:48:05] <mutante>	 heh, yea, that's true
[17:50:32] <mutante>	 i had a very few cases where i was looking for a notes_url and didn't have anything on wikitech but a Phabricator link to a ticket or workboard (where do i report this issue / who can i ping for it) seemed not a bad option. so maybe you even have cases for dashboard_link to a ticket or a dashboard of tickets. from there there are links to 'members'
[17:51:01] <mutante>	 and if you then use the office wiki contact page to translate nick names you get to phone numbers, heh
[17:53:59] <cdanis>	 heh heh
[17:54:23] <cdanis>	 we need to look in the service catalog for who is responsible for finishing the service catalog
[17:54:47] <mutante>	 well. this was actually a thing i meant to do right now
[17:54:52] <mutante>	 wiki pages for services
[17:55:00] <cdanis>	 yeah it's a good idea
[17:55:02] <mutante>	 i kind of want to merge just the part of your change that adds the default
[17:55:29] <mutante>	 and you are right..then add the parameter to monitoring::service instead of having to build  it here
[17:56:06] <cdanis>	 sadly i made doing this a bit more work by merging the change to nrpe::monitor_service before having the thought it should just be more generic
[17:57:23] <mutante>	 ah, i see. it's merged. well that is also nice. it means there is the default now, so i can rebase/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830
[17:57:32] <mutante>	 and then get back to the wiki thing
[17:57:57] <mutante>	 instead of wondering about the 4 remaining missing links that is
[17:59:04] <mutante>	 or just 2 ... modules/profile/manifests/proxysql.pp  , modules/postgresql/manifests/slave/monitoring.pp , modules/nrpe/tests/monitor_service.pp , modules/nrpe/spec/defines/monitor_service_spec.rb
[17:59:16] <mutante>	 proxysql and postgres have no runbooks
[17:59:42] <mutante>	 and won't be easy to write one besides "make a ticket"
[22:20:45] <XioNoX>	 if I have service Y running on host X, and I make the service Y icinga check paging. Will it page if the host X goes down first?
[22:22:15] <XioNoX>	 Eg. I think https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/bird/anycast_monitoring.pp#L15 should page, but I also want it to page if the host check (line 11) fails too. Should I make both paging?
[22:23:43] <mutante>	 XioNoX: yes, use contact_group => 'sms' on both
[22:24:03] <XioNoX>	 mutante: critical => true actually
[22:24:03] <mutante>	 or critical => true
[22:24:14] <XioNoX>	 ok, thx!
[22:24:19] <mutante>	 yes, that adds that sms group too
[22:24:36] <XioNoX>	 we need to fix our paging :)
[22:24:52] <mutante>	 i like the change to stop paging for mgmt :)
[22:25:08] <mutante>	 for that we can add some that actually should page in return
[22:25:12] <XioNoX>	 haha for sure
[22:26:57] <XioNoX>	 bblack: let me know if you agree that the two recdns checks (VIP/service) on https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/bird/anycast_monitoring.pp should page
[22:27:23] <XioNoX>	 as it mean there is no recdns server working anywhere
[22:27:33] <mutante>	 i think we also want 2 contacts each in Icinga. one that sends SMS and one that does not
[22:27:49] <mutante>	 and then you can do contactgroups that are "paging but only subteam X"
[22:28:15] <mutante>	 after all "make it page" is actually with contacts and their notification options and not really with the service or host
[22:28:21] <XioNoX>	 yeah, it should map the service owners
[22:28:39] <mutante>	 we never had more than one group but also SMS notifications
[22:28:41] <mutante>	 yep
[22:29:34] <mutante>	 owner (team) gets SMS, others get email/irc/web
[22:31:24] <XioNoX>	 maybe in scope for the observability team goal "Improve our alerting capabilities"