[08:10:44] <godog>	 cdanis: data has already finished replicating afaics, although yes deleting partitions no longer present on the decom hosts takes longer
[09:49:46] <godog>	 looks like tomorrow's pad isn't there yet, creating it
[09:52:28] <godog>	 01/05 and 08/05 that is
[10:55:43] <volans>	 godog: AFAIK tomorrow's meeting has been cancelled
[10:55:50] <volans>	 it's bank holiday for the majority of us
[11:02:30] <mark>	 is it
[11:02:36] <mark>	 i live in the wrong country it seems
[11:03:42] <volans>	 or you're just in the wrong team :-P
[11:04:13] <mark>	 that's certain
[11:06:06] <jijiki>	 but we are so lovely
[11:46:10] <cdanis>	 godog: ah, are you using the dispersion numbers to assert the former?  I'm still unsure whether or not I trust those haha
[12:05:53] <godog>	 volans: ah yeah I remember it was canceled but I assumed we'll be doing async, either way works
[12:06:56] <volans>	 I'm fine either way
[12:07:33] <godog>	 cdanis: partly dispersion yeah but as you noticed it is only a partial measure, the other thing I was looking at is that disk utilization on non-decom hosts has stopped growing "fast"
[12:09:11] <cdanis>	 mmm
[12:09:29] <cdanis>	 does it make sense to crank up object-replicator allowed concurrency on hosts that are being decommed btw?
[12:16:20] <cdanis>	 I'm skeptical of Swift here, perhaps unfairly, but I've never been really comfortable with the lack of an actual report on which objects are known, replicated properly, misplaced, etc
[12:19:27] <godog>	 that's fair yeah, covering 100% objects in our case we could report on them but I think it'll take a long time, anyways re: concurrency yes cranking that up will speed up things
[12:27:53] <cdanis>	 godog: so for Ie4a82eb09 I'm going to disable puppet on any host matching R:File = /srv/prometheus -- so bast[3002,4002,5001].wikimedia.org,labmon[1001-1002].eqiad.wmnet,prometheus[2003-2004].codfw.wmnet,prometheus[1003-1004].eqiad.wmnet -- and then run puppet on one or two, inspect the result, and then slowly run puppet on the others, SGTU?
[12:30:30] <godog>	 cdanis: sounds good to me! FWIW I use R:prometheus::server as a selector
[12:30:39] <cdanis>	 ah I was trying to find the proper role one
[12:31:02] <cdanis>	 same set of hosts but much nicer, ty
[12:32:28] <volans>	 is prometheus::server a define? :D
[12:32:58] <volans>	 oh, it is
[12:33:50] <godog>	 cdanis: ah you'll find bast3002 with puppet disabled, that's me doing prometheus v2 migration, although feel free to reenable
[12:34:15] <cdanis>	 reenabling won't break anything for you?
[12:34:44] <godog>	 not atm no
[13:01:41] <cdanis>	 bad news, the new values don't save us from OOMs
[13:08:00] <godog>	 sigh
[13:08:16] <cdanis>	 I think I figured out the problematic queries though
[13:11:00] <cdanis>	 it was with only 30 days of history this time as well, my change very well might have made things worse!
[13:18:31] <godog>	 ouch, yeah I see prometheus1004 still using ~all memory
[13:19:05] <cdanis>	 yeah it's using a ton but i think not so much it will crash
[13:19:31] <cdanis>	 i'm reloading 1003 with the previous/default setting and will try it with 2 weeks history :|
[13:19:39] <cdanis>	 once 1004 is back to normal-ish
[14:05:45] <jbond42>	 godog, herron ready when yu are.  this is my prposed plan https://phabricator.wikimedia.org/P8459
[14:07:02] <herron>	 jbond42: ok will be ready in a few
[14:07:14] <volans>	 kafaka?
[14:07:45] <volans>	 jbond42: enable-puppet needs the reason
[14:07:53] <volans>	 also you can just run run-puppet-agent --enable "reason"
[14:07:57] <volans>	 to have it all in one command
[14:08:13] <volans>	 and if you use multiple commands in cumin you need to specify the -m/--mode
[14:08:55] <jbond42>	 volans: ack cheers
[14:08:56] <volans>	 quick question, do we need to force the puppet run everywhere? seems a bit of overkill
[14:09:06] <volans>	 given that in 30m it will run anyway everywhere
[14:09:26] <volans>	 I would rather pick some hosts and enable it to test if all works
[14:09:47] <volans>	 and then just re-enable puppet everywhere and let the normal cycle do its course
[14:09:48] <jbond42>	 herr.on or god.og  wanted the feature enabled slowly so we can monitor load.  the -s and -b flags may need to be ajusted to make this a much slower roll out
[14:10:07] <jbond42>	 monitor load on kafka
[14:10:21] <cdanis>	 how slowly is slow? would we be leaving puppet disabled across the fleet for hours?
[14:10:32] <volans>	 yeah we talked about that yesterday in the monitoring meeting, I asked if we needed to tweak kafka for the 1.3k clients
[14:10:35] <cdanis>	 (I'm also worried about overlapping puppet-disables) 
[14:10:53] <volans>	 cdanis: first reason win
[14:11:16] <volans>	 disable foo; disable bar keeps it disabled with foo, so when you re-enable with bar it's a noop
[14:11:25] <godog>	 jbond42: re: phaste looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/505737 should be added too to the list
[14:11:53] <godog>	 I'm ok btw with just reenabling puppet and it'll DTRT over 30m
[14:13:53] <jbond42>	 godog: you are right on the change, thats the only one that needs a slow roll out.  the other two only affect cumin hosts and the url_downloaders (the only system,s with firewall::logging
[14:13:59] <cdanis>	 godog: I can't conclusively say anything either way about a difference in memory consumption running the 'bad' queries with max-samples changed vs not, so, I'm just going to proceed rolling it out
[14:14:02] <jbond42>	 i have update the past https://phabricator.wikimedia.org/P8459
[14:14:45] <godog>	 cdanis: ack, sounds good
[14:16:48] <godog>	 jbond42: last thing, lines two is missing a double quote at the end, other than that LGTM
[14:17:51] <jbond42>	 godog: good catch updated, will just wait for herron then will start
[14:18:09] <herron>	 let’s list down the hosts we want to test before re-enabling across the fleet
[14:19:11] <herron>	 we can start with something low profile like people1001
[14:19:59] <godog>	 indeed, also the "misc-nonprod" cumin alias
[14:20:51] <herron>	 and also in the puppet disable reason let’s include a contact person
[14:21:27] <volans>	 or use spicerack :D
[14:21:55] * volans has mixed feelings of using spicerack from a REPL
[14:23:06] <jbond42>	 updated https://phabricator.wikimedia.org/P8459 let me know if you want more hosts tested 
[14:25:02] <herron>	 jbond42: looks good to me
[14:25:26] <jbond42>	 godog: can you give the past one more pass and i will start
[14:25:32] <godog>	 jbond42: sure, lgtm!
[14:25:44] <jbond42>	 great starting
[14:28:40] <cdanis>	 jbond42: is it okay if I enable & run puppet on a small number of hosts (just five of them) to complete my prometheus rollout?
[14:29:15] <jbond42>	 cdanis: let me test on one host first and then you should be ok
[14:33:25] <jbond42>	 running on people1001 now
[14:37:45] <jbond42>	 this was no op on people1001 has been applied to bast4002.  herror or godog can you check bast4002 and just make sure things look sane to you before progressing 
[14:38:03] <godog>	 jbond42: sure
[14:38:48] <jbond42>	 cdanis: i thik you are testing on the prometheous servers if so my change should be a noop their so you should be good to go.  let me know if thats not the case
[14:39:12] <cdanis>	 jbond42: rgr!
[14:39:18] <cdanis>	 will watch the output
[14:39:26] <jbond42>	 thanks
[14:39:38] <cdanis>	 bast4002 was one of the ones on my list as well ;)
[14:39:45] <cdanis>	 did you see the prometheus change there?
[14:39:59] <godog>	 yeah prometheus on bast4002 was restarted because of cdanis' change, other than that lgtm jbond42 
[14:40:05] <jbond42>	 ahh yes i did do you want the output in a paste?
[14:40:27] <jbond42>	 ack great thanks godog il move forward with the other test cases
[14:42:56] <cdanis>	 no that's fine, ty jbond
[14:47:51] <jbond42>	 ok i have rolled out to all the test hosts ill wait 10 mins before renabling puppet just incase something creeps 
[14:48:15] <cdanis>	 i'm continuing with the few hosts i have left
[14:48:18] <herron>	 jbond42: nice, sounds good
[14:48:21] <jbond42>	 yes thats fine
[14:57:54] <jbond42>	 ok im going to re-enable puppet now
[15:28:21] <godog>	 jbond42: lgtm so far
[15:29:23] <jbond42>	 cool it should be applied accross the fleet by now so thats good
[15:33:26] <volans>	 cdanis, godog: all teh varnish backend child restarted UNKNOWN on Icinga are expected?
[15:33:35] <volans>	 http://prometheus.svc.eqiad.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0)
[15:34:00] <volans>	 same for other checks too
[15:34:25] <volans>	 95% of the ~70 unknown have the same message as above
[15:34:37] <volans>	 could be the new query limit?
[15:34:42] <godog>	 volans: not expected no, unless prometheus is in trouble
[15:34:59] <cdanis>	 looking
[15:36:32] <cdanis>	 cannot repro when I run an affected command by hand on icinga1001
[15:36:32] <volans>	 most of them are 1/3 so it seems are transient failures
[15:36:40] <volans>	 compatible with rate-limiting
[15:36:46] <cdanis>	 that isn't a rate limit
[15:36:54] <volans>	 yeah sorry, bad wording :)
[15:36:59] <cdanis>	 it's the max number of tsdb points a given query can 'look' at once it starts executing
[15:37:11] <volans>	 mmmh
[15:37:21] <volans>	 than a query should either succeeed or fail
[15:37:25] <volans>	 not transient, right?
[15:37:28] <cdanis>	 it shouldn't cause the same query on mostly the same data to intermittently fail, yeah
[15:38:03] <godog>	 next thing I can think of is mod_proxy on apache and its timeout
[15:39:11] <godog>	 but nothing in apache logs, so no
[15:39:44] <cdanis>	 apache on prom1003 did serve 503s
[15:39:57] <cdanis>	 Apr 30 15:31:42 prometheus1003 prometheus@ops[21443]: level=warn ts=2019-04-30T15:31:42.782646042Z caller=main.go:467 msg="Received SIGTERM, exiting gracefully..."
[15:40:27] <godog>	 ah yeah, the restart is likely it then
[15:40:58] <volans>	 did we just restart it?
[15:41:26] <godog>	 looks like puppet did nine minutes ago
[15:41:37] <volans>	 so expected :D
[15:41:45] <cdanis>	 oh, that a Puppet run un-doing my temporary undo of the 10M max-samples limit
[15:41:55] <cdanis>	 right, I re-enabled puppet on the host but did not trigger a run
[15:42:41] <cdanis>	 is there a technique for propagating the 'liveness' of an apache-fronted service to LVS?
[15:43:04] <cdanis>	 that would avoid blips like these; ops prometheus in the big DCs take ~2-3 minutes to restart
[15:52:29] <godog>	 not a standard technique afaik, what comes to mind is an url that 200s if all prometheus instances on the host also 200s, then pybal proxyfetch would DTRT
[18:02:17] <cdanis>	 godog: you still around?