[06:05:38] <marostegui>	 arturo dcaro let me know when you are around and available to do https://phabricator.wikimedia.org/T279657
[08:04:46] <dcaro>	 marostegui: /me here
[08:07:12] <dcaro>	 I'll downtime labstore1004 and stop puppet, while arturo joins
[08:07:25] <marostegui>	 dcaro: excellent
[08:07:30] <marostegui>	 Let me know when I can proceed then :)
[08:30:05] <arturo>	 marostegui: I"m around
[08:30:21] <marostegui>	 Can I go then?
[08:30:22] <arturo>	 sorry I'm a bit late
[08:30:31] <arturo>	 yes, go marostegui
[08:30:38] <marostegui>	 ok going for it!
[08:30:52] <arturo>	 thanks dcaro !
[08:31:35] <marostegui>	 all done
[08:32:03] <arturo>	 that was fast marostegui :-)
[08:33:05] <marostegui>	 :)
[08:33:43] <marostegui>	 I can loging on wikitech just fine
[08:34:00] <marostegui>	 Edits also work
[08:34:55] <marostegui>	 dcaro arturo everything looks good from my side
[08:35:47] <arturo>	 👍
[08:39:35] <dcaro>	 ack
[08:43:05] <arturo>	 thanks marostegui !!
[08:43:13] <marostegui>	 thank you guys :)
[09:46:39] <dcaro>	 godog: hey, when you have a moment, can you help me trying do troubleshoot the tools-prometheus-03 space usage? we have ~2 days until it fills up the space again xd
[09:56:27] <godog>	 dcaro: sure, were you able to run the query to get high cardinality metrics ?
[09:58:39] <dcaro>	 yep, I'll try to rerun again though, fresher
[10:01:01] <dcaro>	 there were ~6 with ~140k, all nginx_ingress_controller_*, there's node_systemd_unit_state with a similar value, then with ~60K kube_resourcequota, and the next is ~30k already
[10:01:09] <dcaro>	 (no idea if that's a lot or not)
[10:02:54] <dcaro>	 btw. I got this dashboard with the space left (and days before out of it), and the expected WAL size according to https://devops.stackexchange.com/questions/9298/how-to-calculate-disk-space-required-by-prometheus-v2-2 (https://grafana-labs.wikimedia.org/goto/J6pJJyuMk)
[10:04:31] <godog>	 nice!
[10:04:47] <godog>	 I'm taking a look
[10:05:49] <godog>	 dcaro: looking at the free space graph it looks like space started significantly decreasing on april 5th, do you know what might have happened that day ?
[10:07:55] <dcaro>	 let me recheck
[10:10:10] <godog>	 my apologies, I think I have overloaded the server by loading the dashboard above for the last 15d
[10:14:00] <dcaro>	 np
[10:14:33] <dcaro>	 it's probably the top10k query though
[10:15:43] <dcaro>	 I don't see anything special that day :/, horizon was upgraded to wallaby on the secondary DC (very small deployment), and gitlab-test was resized, but not much either (afaict)
[10:21:46] <godog>	 dcaro: ack, yeah the best thing is probably to run the big metrics query (only instant, not over time) say april 5th morning and compare the results with say top cardinality today
[10:21:53] <godog>	 and see if there are obvious offenders
[10:23:42] <dcaro>	 ack, will try (when the server comes back xd)
[10:29:25] <godog>	 ok!
[10:33:02] <dcaro>	 is there a way to configure grafana/prometheus to just give up on a query if it takes too long? It seems that the server is getting overwhelmed but not just aborting, and hanging instead
[10:35:17] <godog>	 yeah IIRC you can configure limits on the prometheus side, to e.g. not run queries that would load more than X samples
[10:36:13] <godog>	 --query.timeout=2m --query.max-samples=50000000 specifically
[10:44:19] <dcaro>	 okok, got the query, the results are very very similar, the top10 are the same, with a little smaller numbers (<10% of difference)
[10:45:52] <godog>	 ah! the other possibility then is "metric spam"
[10:46:27] <godog>	 for example if there is sth creating metrics with "ids" in them, e.g. metric_1, metric_2, etc
[10:47:26] <godog>	 gotta go now, it should be possible to check which metrics are being created/polled though
[10:47:31] <dcaro>	 hmm... there's some CORS error logs in the js console of the browser... will check later
[10:47:46] <godog>	 ok! going to lunch
[10:47:48] <dcaro>	 do you know where can I look it up?
[10:48:01] <dcaro>	 ack, bon app!
[11:02:57] <dcaro>	 godog: (for later) checking the 'total series' value, it's growing ~20k/week (currently ~1.83M), but there's no change on the rate on Apr 5 or surroundings, trying to get the list of metrics...
[11:26:43] <dcaro>	 the prometheus service crashed before (when it was unresponsive), and that flushed the WAL it seems, so it freed all the space, I'm guessing that that has been a pattern so far
[12:00:04] <godog>	 dcaro: yeah for the list of metrics what I did was click the dropdown in the web UI, though I don't see any obvious spam there
[12:13:35] <godog>	 dcaro: I'm looking on the host, sth I think worth doing is removing the old wal segments from april 15th
[12:13:44] <godog>	 april 14th even
[12:15:32] <godog>	 dcaro: in addition to that perhaps limiting on disk space rather than time would work in this case
[12:55:31] <godog>	 dcaro: one other thing that might help is analyzing the blocks already written by prometheus with https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality
[13:22:44] <dcaro>	 ack, the shift from time to size I think would help yep
[13:31:23] <paravoid>	 I see no incidents listed in the SRE Monday update
[13:32:01] <paravoid>	 I don't think that we've had 0 incidents in the past month :)
[13:32:34] <paravoid>	 so please try to complete that list before today's meeting
[13:37:05] <dcaro>	 godog: about limiting on disk space, what I find is settings for the tsdb as a whole (I think, --storage.tsdb.retention.size, experimental), is that what you mean? (might work though xd, as we don't want the rest to grow either)
[14:30:11] <godog>	 dcaro: yeah that's right!
[14:37:58] <dcaro>	 ack, working on a patch, will add you as reviewer if you don't mind :)
[14:40:07] <godog>	 for sure! no problem
[17:01:26] <jynus>	 godog, my question was because I did recently a "check --icinga" and "check --prometheus web service", and I would prefer to maintain only 1 asap :-)
[17:02:44] <godog>	 jynus: hehe yeah, largerly depends on the use case at this point, probably the best bet is to use check_prometheus_metric in icinga, from there it is relatively straightforward to make a prometheus native alert
[17:04:19] <godog>	 gotta go
[17:04:31] <jynus>	 thank you, filippo
[17:12:05] <apergos>	 I wonder what disk space alerts look like, they aren't really about some service (mw parsoid etc), ... I think I need to know more about what we mean by services, more about when puppet fail to run is noteworthy and when we coalesce alerts because there's a zillion across all the servers and likely a single source of the problem,e tc
[17:12:25] <apergos>	 very glad to hear the new setup can be public though