[10:16:44] I am doing some maintenance on bacula Databases- alerts may happen due renaming of jobs (eg. 0 jobs found) you can ignore those [10:17:07] as old jobs will not be detected, and the new ones haven't run yet [10:23:47] It only shows a warning for now: No backups: 2 (dbprov2001, ...), [12:02:31] fyi, icinga is full of alerts, I'm working on clearing the ones I can but could use a hand [12:04:16] XioNoX: critical or warning? [12:04:28] volans: all the colors [12:07:32] boron is for the docker-reporter-k8s-images.service [12:07:40] it failed to update [12:07:41] docker-registry.wikimedia.org/wikimedia/mediawiki-services-chromium-render:2020-02-05-224151-production[FAIL] [12:10:04] I'm looking at hte netbox reports [12:15:33] XioNoX: aye I'll take a look today too [12:16:27] <_joe_> the report on boron... the docker daemon having issues? [12:17:10] _joe_: apparently it failed to run for that image [12:17:16] not sure, didn't dig yet [12:17:22] <_joe_> yeah the logs should have more info [14:29:26] really sorry for the post mortem meeting people, I forgot to decline since it overlaps with another one [14:29:31] just realized in the calendar [14:29:32] np elukey [14:29:33] :( [14:29:47] cdanis: I closed my icindent reports, didn't update the gdoc [15:48:05] cdanis: one thing that I wanted to ask during the meeting if I was there would have been - who manages https://phabricator.wikimedia.org/project/board/2143/ ? [15:48:21] I believe greg-g [15:48:45] yeah, ish :) [15:49:13] yeah I guessed that it was a best effort shared among people :) [15:49:36] I am asking since the tasks that I checked today, linked in an incident report, were months old [15:49:43] and not sign of actions was taken [15:49:55] it's not consistently used for incident actionables, despite the note in the document template to do so [15:50:53] I don't want to blame the author of course :) I only wondered about doing a more proactive "we do follow up on problems to avoid their occurrence in the future" [15:51:13] otherwise the whole incident report thing that we do is half useful [15:52:55] I've struggled with that a lot myself [15:53:05] in a few different ways -- [15:53:44] there's followups that don't happen, there's bigger-picture things (say, planning to redesign a component) we could do someday that we don't write down [15:54:03] there's the issue of potentially spending more time tracking things and paper-pushing than it's worth [15:54:47] yes you are right [15:54:58] I have similar feelings [15:55:00] I have many more questions than answers here :) [15:55:25] and even spending more time following up on actionables than it's worth! a lot of actionables are "everyone agrees this would prevent recurrence, but it's a bigger project than we have time for, and we'd rather accept the risk" [15:56:01] sure if this is the case we could simply close the task and mark the task as "Declined" [15:56:21] but there might be some easy ones that can be done if we push the authors :D [15:56:37] like updating docs, create an alarm, etc.. [15:57:02] this is why it would be great to have a sort of clinic duty for that workboard [15:57:07] best effort of course [15:57:19] am I crazy? [15:57:33] no, I don't think so [15:57:39] and there's been other things where... [15:57:56] e.g. I went back and made https://wikitech.wikimedia.org/wiki/Category:Maps_outages,_2018_and_later [15:58:22] to demonstrate "there are deeper, ongoing issues with this service; there's a cost to not staffing it" [15:59:33] yeah that sounds smart -- IIRC one of the benefits of the incident review group is supposed to be "someone has enough persistent context to notice when we have the same incident over and over" [16:00:24] (in a meeting still, will read backlog) [16:01:29] take your time, we're about to start one ;) [17:58:52] in netbox, it would be a nice touch if the custom field "procurement ticket" could contain a link to click on. currently the text is like "RT #8786" but yes [18:00:03] kind of a combo of the device links and custom fields [18:03:07] Oh [18:03:16] look at the top of the device view [18:03:19] at the right [18:03:36] chaomodus: oooh! thank you :) had not noticed [18:03:42] that works [18:23:39] yeah it's not possible to do it in-place unfortunately [18:24:09] kuddos to volans for the extra links [18:25:19] :) [18:52:46] cdanis: elukey rlazarus: agreed with the above, I should (will try) to clean up that workboard this week/next week (my half day is pretty filled with meetings, and I have new kid duties in the afternoon now). But really it's self-explanatory on how to sort/triage that board. I appreciate the one list of follow-ups from incidents as it at least gives a clue to triagers/prioritizers when [18:52:48] looking at their own backlogs "huh, this is tagged as an incident follow-up, that might raise the priority". And, I would also *love* to do more regular reviews of those follow-ups on a per-incident basis with the people involved. [18:53:14] greg-g: in the middle of a possible outage rn but will get back to you later :) [18:53:19] shit, sorry [18:53:30] dang, mutante I had two different gerrit tabs open and merged the wrong one, reverting with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/580109/ [18:53:39] np, not a bad one [19:00:20] andrewbogott: ok, np [19:02:54] * greg-g is going afk for lunch, and then only coming back for a quick 5-10 minute sync with the perf team before their team meeting [21:12:10] heya shdubsh yt? [21:12:31] ottomata: hey, what's up [21:12:33] trying to get grafana set up with the GC histogram stuff you changed for nodejs gc stuff [21:12:46] am not sure what was intended by the grafana node service template I startd with [21:13:08] working on this [21:13:09] https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&refresh=1m&from=1584306779395&to=1584393179395&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventstreams [21:13:16] at the bottom there are 3 GC related graphs [21:15:50] looking... [21:17:51] ottomata: to confirm "Garbage Collections" graph is a count of GCs per $interval? [21:17:56] i guess? [21:18:07] i don't hvea $interval, i just edited that one to get something there [21:18:27] those 3 charts came from a pre-canned service runner grafana template [21:18:35] ah, ok [21:18:41] iirc [21:19:42] ah this one shdubsh https://grafana.wikimedia.org/d/stpmz_7Wz/template-dashboard?orgId=1&refresh=1m&from=1584389977398&to=1584393577398&var-dc=eqiad%20prometheus%2Fk8s&var-service=REPLACEME [21:19:45] is what I copied it from [21:24:31] heh, you just saved what I came up with, except irate vs rate [21:24:43] oh sorry [21:24:50] yeah am editing too sorry about that [21:25:01] irate is better? [21:26:00] iirc, irate compensates for resets of the counter [21:27:30] i think maybe i have somehting for gc quantiles... [21:28:01] eh, ^^ is not true. irate should only be used when graphing volatile, fast-moving counters. Use rate for alerts and slow-moving counters, as brief changes in the rate can reset the FOR clause and graphs consisting entirely of rare spikes are hard to read. [21:28:22] yeah, both rate and irate compensate for counter resets [21:28:25] ook saved my gc quantiles change [21:28:30] irate is more meaingful for anything bursty, [21:29:32] hm actually i think i need queries for each gc type / event [21:31:03] ok saved again [21:31:56] but for duration i'm not so sure [21:31:59] am trying stuff [21:32:07] ottomata: looking good! [21:34:27] hmm doesn't seem right [21:35:38] 0.5 microseconds [21:35:38] ah [21:35:42] shdubsh: what is the unit here? [21:35:54] for the duration? [21:36:14] OH iut says it [21:36:17] seconds :p [21:36:23] just not in the histograph metrics in prometheus [21:36:27] by default, the gc library uses nanoseconds. the pr did the conversion to seconds [21:36:30] but on client it is computed seconds? [21:36:31] ah k [21:36:34] great [21:42:12] ok shdubsh does what I have now make sense? [21:43:32] looks good to me! [21:45:40] thank you for the proof reading! :) [21:48:39] np! glad to see it all come together :) [21:55:18] shdubsh: why do we want sum(rate of the gc buckets? [21:55:56] that gives us the total amount of time spent GCing? per bucket per second? [21:55:59] ? [21:56:22] if you remove the sum(), it will probably split on instance [21:56:30] would avg make more sense [21:56:34] or, avg and max? [21:57:16] avg probably would make more sense in the buckets, unless you're looking for cumulative time spent across instances. [21:57:57] ok i think so to, maybe avg and max are interesting as separate charts [21:58:00] sum seems less interesting [21:58:09] same with the totla GCs per second count [21:58:14] avg seems more interesting [21:58:18] than summing them