[07:57:42] it would be nice to add emoji reactions to phab comments, like in Github [08:53:00] Anyone with prometheus experience around? We are having some issues on our instance, with an exploding (in size) WAL directory, any pointers are apperciated [08:55:47] dcaro: I'm about to jump in a meeting, can help later though [08:57:22] godog: that'd be appreciated :), it's kinda contained, but it's growing back (slower now) [08:57:34] godog: let me know when you have some time [08:59:16] XioNoX jbond42 moritzm akosiaris can you check https://phabricator.wikimedia.org/T276448#6993361 - thanks! [10:31:29] dcaro: back, so the issue is a big WAL ? [10:35:32] godog: hey, yep, increasing little by little (until it gets out of space, more info T279990) [10:35:32] T279990: [tools] prometheus out of space - https://phabricator.wikimedia.org/T279990 [10:36:02] since I pinged you it has grown from ~1G to >3G [10:41:23] dcaro: ack, yeah right off the bat my hunch is lots more metrics to ingest, could that be? e.g. a "spammy" service creating lots of labels/metrics ? [10:43:11] maybe? how do I find out? [10:45:24] dcaro: sth like 'topk(10, count by (__name__)({__name__=~".+"})) [10:45:26] ' [10:47:57] I think it died xd [10:49:52] yep, it crashed the prometheus server, trying with a shorter time span [10:51:20] yeah not necessary a time span to start with, i.e. the "console" tab not "graph" [10:52:49] console gives me '3' [10:56:50] mmh? just "3" ? [11:00:57] godog: it's kinda dying again :/ [11:04:26] that number is wrong yes, it's for another metric (it just did not update the page) [11:06:43] godog: I got to go, but I'll be back at it later, can you give me some things I can look to not come empty handed next time? (the machine died, will take a bit to come up) [11:08:09] dcaro: sure! I have to go shortly as well, other things to check if whether there are older/stale wal files that could be removed (with data loss of course) [11:09:06] dcaro: and/or nuke the wal altogether [11:10:53] the other step of course is to identify the "big" metrics [11:10:55] gotta go [11:10:55] I kinda did that already, it went down, but it's growing again [11:11:13] ack, thanks! [14:01:54] what was the channel for folks on clinic duty again? I have an email alias question [14:02:01] re https://phabricator.wikimedia.org/T280026 that is [14:02:22] godog: #wikimedia-clinic [14:03:08] thank you rzl [14:03:25] I'll add that to the wikitech clinic duty [14:03:35] was it not on there?? thanks! [14:04:11] lol, I thought was there [14:04:49] yeah I was surprised too [14:05:18] now I'm second-guessing whether it was supposed to be Super Secret, but if it was I guess I've already leaked it :) [14:06:04] that's ok, we'll move to #wikimedia-clinic-hunter2 [15:29:57] Hello SRE friends. Can someone +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/677676 please? [15:47:12] oops, I missed that in my queue [15:50:50] Thanks ! [15:53:23] Error: Facter: error while resolving custom fact "block_devices": A JSON text must at least contain two octets! [15:53:51] dancy: done, and ran puppet on mwlog1001 [15:54:39] Gracias [17:06:53] legoktm: the block_devices spam is an issue which has affected jessie instances since a few weeks, it'll only be fixed by removal of the remaining jessies :-) [17:07:17] gotcha :D