[07:22:38] FYI, I'm disabling puppet on the install* servers for ~20-30 mins, testing something in d-i [07:23:42] moritzm: ping me after this to merge the poolcounter change [07:23:51] poolcounters* [07:24:42] ack! [08:16:26] puppet is re-enabled on the install* servers [09:19:48] headsup: Rebooting grafana1001 in a few minutes, should be quick since it's on Ganeti [10:12:41] I'll reboot icinga1001 in ~ 5mins unless there are any objections [11:01:09] objection! [11:01:16] joking [11:02:12] good because that was a lot longer than 5 minutes! [11:08:01] haha, I know :p [11:10:26] :-P [11:23:53] moritzm: are you rearming keyholder in icinga1001? [11:26:21] ah, right, forgot about, doing that now [11:26:34] I took the liberty of doing it myself [11:27:12] ack, thx [15:22:05] I'm running into a problem with the PCC: [ 2019-07-17T15:15:19 ] ERROR: Unable to find facts for host cp1075.eqiad.wmnet, skipping [15:22:30] following: https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#FAQ I don't have access to the puppet-diffs project [15:22:42] anyone can help? [15:23:21] cp1075 isn't new, that's odd [15:25:02] shdubsh: I'm assuming you mean https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/17434/console and the other one right after it [15:25:18] yes [15:25:26] error: prune died of signal 15 [15:25:27] error: failed to run prune [15:25:41] something going wrong with the basic git operations, maybe disk/network issues for the PCC runners? [15:26:03] -- [15:26:04] error: The last gc run reported the following. Please correct the root cause [15:26:07] and remove /var/lib/catalog-differ/production/.git/gc.log. [15:26:10] Automatic cleanup will not be performed until the file is removed. [15:26:36] I don't know much about where to go hacking on that at though [15:26:44] compiler1001.puppet-diffs.eqiad.wmflabs ? [15:26:57] shdubsh: added you to the puppet-diffs project in horizon [15:27:23] thanks! [15:27:41] (should probably be done as part of SRE onboarding imo) [15:28:17] first sign of the prune issue: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17425/console [15:30:48] try manual git gc there? [15:31:59] or just remove the gc.log file, I donno [15:32:12] maybe it was a one-off issue and it's just stuck on the existence of the logfile of the signal termination [16:08:01] it looks like compiler1002 is ok, compiler1001 has the issue (the failing job succeeds on 1002) [17:27:29] i emoved the log file and it seems to have worked https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17443/console https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17444/console [17:33:39] mutante: btw, I wanted to point out to you https://gerrit.wikimedia.org/r/c/operations/puppet/+/523248/10/modules/nrpe/manifests/monitor_service.pp [17:33:51] and also mention that I've since had the thought that... maybe this functionality should just be part of monitoring::service [17:37:58] cdanis: ah! adding a default is a good idea. but also it's so close to done now there are only 3 or 4 left before i can make the parameter mandatory [17:38:19] mutante: yeah, it is needed to make the URL mushing-together work properly [17:38:33] but I think monitoring::service should support $dashboard_links and know how to DTRT [17:40:44] Sure, i mean, historically just actual grafana checks had grafana links but if there are good dashboards for general Icinga checks,why not [17:42:02] $dashboard_link would stay optional though, right. while $notes_url becomes required. [17:46:54] +1 [17:47:07] we support up to 4 dashboard links btw, which I think is useful [17:47:21] but for instance -- the patch I linked also adds a link to the grafana host console for disk space NRPEs [17:47:32] which is, like, one of those obvious-in-retrospect thing [17:47:38] s/$/s/ [17:48:05] heh, yea, that's true [17:50:32] i had a very few cases where i was looking for a notes_url and didn't have anything on wikitech but a Phabricator link to a ticket or workboard (where do i report this issue / who can i ping for it) seemed not a bad option. so maybe you even have cases for dashboard_link to a ticket or a dashboard of tickets. from there there are links to 'members' [17:51:01] and if you then use the office wiki contact page to translate nick names you get to phone numbers, heh [17:53:59] heh heh [17:54:23] we need to look in the service catalog for who is responsible for finishing the service catalog [17:54:47] well. this was actually a thing i meant to do right now [17:54:52] wiki pages for services [17:55:00] yeah it's a good idea [17:55:02] i kind of want to merge just the part of your change that adds the default [17:55:29] and you are right..then add the parameter to monitoring::service instead of having to build it here [17:56:06] sadly i made doing this a bit more work by merging the change to nrpe::monitor_service before having the thought it should just be more generic [17:57:23] ah, i see. it's merged. well that is also nice. it means there is the default now, so i can rebase/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830 [17:57:32] and then get back to the wiki thing [17:57:57] instead of wondering about the 4 remaining missing links that is [17:59:04] or just 2 ... modules/profile/manifests/proxysql.pp , modules/postgresql/manifests/slave/monitoring.pp , modules/nrpe/tests/monitor_service.pp , modules/nrpe/spec/defines/monitor_service_spec.rb [17:59:16] proxysql and postgres have no runbooks [17:59:42] and won't be easy to write one besides "make a ticket" [22:20:45] if I have service Y running on host X, and I make the service Y icinga check paging. Will it page if the host X goes down first? [22:22:15] Eg. I think https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/bird/anycast_monitoring.pp#L15 should page, but I also want it to page if the host check (line 11) fails too. Should I make both paging? [22:23:43] XioNoX: yes, use contact_group => 'sms' on both [22:24:03] mutante: critical => true actually [22:24:03] or critical => true [22:24:14] ok, thx! [22:24:19] yes, that adds that sms group too [22:24:36] we need to fix our paging :) [22:24:52] i like the change to stop paging for mgmt :) [22:25:08] for that we can add some that actually should page in return [22:25:12] haha for sure [22:26:57] bblack: let me know if you agree that the two recdns checks (VIP/service) on https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/bird/anycast_monitoring.pp should page [22:27:23] as it mean there is no recdns server working anywhere [22:27:33] i think we also want 2 contacts each in Icinga. one that sends SMS and one that does not [22:27:49] and then you can do contactgroups that are "paging but only subteam X" [22:28:15] after all "make it page" is actually with contacts and their notification options and not really with the service or host [22:28:21] yeah, it should map the service owners [22:28:39] we never had more than one group but also SMS notifications [22:28:41] yep [22:29:34] owner (team) gets SMS, others get email/irc/web [22:31:24] maybe in scope for the observability team goal "Improve our alerting capabilities"