[06:05:38] arturo dcaro let me know when you are around and available to do https://phabricator.wikimedia.org/T279657 [08:04:46] marostegui: /me here [08:07:12] I'll downtime labstore1004 and stop puppet, while arturo joins [08:07:25] dcaro: excellent [08:07:30] Let me know when I can proceed then :) [08:30:05] marostegui: I"m around [08:30:21] Can I go then? [08:30:22] sorry I'm a bit late [08:30:31] yes, go marostegui [08:30:38] ok going for it! [08:30:52] thanks dcaro ! [08:31:35] all done [08:32:03] that was fast marostegui :-) [08:33:05] :) [08:33:43] I can loging on wikitech just fine [08:34:00] Edits also work [08:34:55] dcaro arturo everything looks good from my side [08:35:47] 👍 [08:39:35] ack [08:43:05] thanks marostegui !! [08:43:13] thank you guys :) [09:46:39] godog: hey, when you have a moment, can you help me trying do troubleshoot the tools-prometheus-03 space usage? we have ~2 days until it fills up the space again xd [09:56:27] dcaro: sure, were you able to run the query to get high cardinality metrics ? [09:58:39] yep, I'll try to rerun again though, fresher [10:01:01] there were ~6 with ~140k, all nginx_ingress_controller_*, there's node_systemd_unit_state with a similar value, then with ~60K kube_resourcequota, and the next is ~30k already [10:01:09] (no idea if that's a lot or not) [10:02:54] btw. I got this dashboard with the space left (and days before out of it), and the expected WAL size according to https://devops.stackexchange.com/questions/9298/how-to-calculate-disk-space-required-by-prometheus-v2-2 (https://grafana-labs.wikimedia.org/goto/J6pJJyuMk) [10:04:31] nice! [10:04:47] I'm taking a look [10:05:49] dcaro: looking at the free space graph it looks like space started significantly decreasing on april 5th, do you know what might have happened that day ? [10:07:55] let me recheck [10:10:10] my apologies, I think I have overloaded the server by loading the dashboard above for the last 15d [10:14:00] np [10:14:33] it's probably the top10k query though [10:15:43] I don't see anything special that day :/, horizon was upgraded to wallaby on the secondary DC (very small deployment), and gitlab-test was resized, but not much either (afaict) [10:21:46] dcaro: ack, yeah the best thing is probably to run the big metrics query (only instant, not over time) say april 5th morning and compare the results with say top cardinality today [10:21:53] and see if there are obvious offenders [10:23:42] ack, will try (when the server comes back xd) [10:29:25] ok! [10:33:02] is there a way to configure grafana/prometheus to just give up on a query if it takes too long? It seems that the server is getting overwhelmed but not just aborting, and hanging instead [10:35:17] yeah IIRC you can configure limits on the prometheus side, to e.g. not run queries that would load more than X samples [10:36:13] --query.timeout=2m --query.max-samples=50000000 specifically [10:44:19] okok, got the query, the results are very very similar, the top10 are the same, with a little smaller numbers (<10% of difference) [10:45:52] ah! the other possibility then is "metric spam" [10:46:27] for example if there is sth creating metrics with "ids" in them, e.g. metric_1, metric_2, etc [10:47:26] gotta go now, it should be possible to check which metrics are being created/polled though [10:47:31] hmm... there's some CORS error logs in the js console of the browser... will check later [10:47:46] ok! going to lunch [10:47:48] do you know where can I look it up? [10:48:01] ack, bon app! [11:02:57] godog: (for later) checking the 'total series' value, it's growing ~20k/week (currently ~1.83M), but there's no change on the rate on Apr 5 or surroundings, trying to get the list of metrics... [11:26:43] the prometheus service crashed before (when it was unresponsive), and that flushed the WAL it seems, so it freed all the space, I'm guessing that that has been a pattern so far [12:00:04] dcaro: yeah for the list of metrics what I did was click the dropdown in the web UI, though I don't see any obvious spam there [12:13:35] dcaro: I'm looking on the host, sth I think worth doing is removing the old wal segments from april 15th [12:13:44] april 14th even [12:15:32] dcaro: in addition to that perhaps limiting on disk space rather than time would work in this case [12:55:31] dcaro: one other thing that might help is analyzing the blocks already written by prometheus with https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality [13:22:44] ack, the shift from time to size I think would help yep [13:31:23] I see no incidents listed in the SRE Monday update [13:32:01] I don't think that we've had 0 incidents in the past month :) [13:32:34] so please try to complete that list before today's meeting [13:37:05] godog: about limiting on disk space, what I find is settings for the tsdb as a whole (I think, --storage.tsdb.retention.size, experimental), is that what you mean? (might work though xd, as we don't want the rest to grow either) [14:30:11] dcaro: yeah that's right! [14:37:58] ack, working on a patch, will add you as reviewer if you don't mind :) [14:40:07] for sure! no problem [17:01:26] godog, my question was because I did recently a "check --icinga" and "check --prometheus web service", and I would prefer to maintain only 1 asap :-) [17:02:44] jynus: hehe yeah, largerly depends on the use case at this point, probably the best bet is to use check_prometheus_metric in icinga, from there it is relatively straightforward to make a prometheus native alert [17:04:19] gotta go [17:04:31] thank you, filippo [17:12:05] I wonder what disk space alerts look like, they aren't really about some service (mw parsoid etc), ... I think I need to know more about what we mean by services, more about when puppet fail to run is noteworthy and when we coalesce alerts because there's a zillion across all the servers and likely a single source of the problem,e tc [17:12:25] very glad to hear the new setup can be public though