[08:10:44] cdanis: data has already finished replicating afaics, although yes deleting partitions no longer present on the decom hosts takes longer [09:49:46] looks like tomorrow's pad isn't there yet, creating it [09:52:28] 01/05 and 08/05 that is [10:55:43] godog: AFAIK tomorrow's meeting has been cancelled [10:55:50] it's bank holiday for the majority of us [11:02:30] is it [11:02:36] i live in the wrong country it seems [11:03:42] or you're just in the wrong team :-P [11:04:13] that's certain [11:06:06] but we are so lovely [11:46:10] godog: ah, are you using the dispersion numbers to assert the former? I'm still unsure whether or not I trust those haha [12:05:53] volans: ah yeah I remember it was canceled but I assumed we'll be doing async, either way works [12:06:56] I'm fine either way [12:07:33] cdanis: partly dispersion yeah but as you noticed it is only a partial measure, the other thing I was looking at is that disk utilization on non-decom hosts has stopped growing "fast" [12:09:11] mmm [12:09:29] does it make sense to crank up object-replicator allowed concurrency on hosts that are being decommed btw? [12:16:20] I'm skeptical of Swift here, perhaps unfairly, but I've never been really comfortable with the lack of an actual report on which objects are known, replicated properly, misplaced, etc [12:19:27] that's fair yeah, covering 100% objects in our case we could report on them but I think it'll take a long time, anyways re: concurrency yes cranking that up will speed up things [12:27:53] godog: so for Ie4a82eb09 I'm going to disable puppet on any host matching R:File = /srv/prometheus -- so bast[3002,4002,5001].wikimedia.org,labmon[1001-1002].eqiad.wmnet,prometheus[2003-2004].codfw.wmnet,prometheus[1003-1004].eqiad.wmnet -- and then run puppet on one or two, inspect the result, and then slowly run puppet on the others, SGTU? [12:30:30] cdanis: sounds good to me! FWIW I use R:prometheus::server as a selector [12:30:39] ah I was trying to find the proper role one [12:31:02] same set of hosts but much nicer, ty [12:32:28] is prometheus::server a define? :D [12:32:58] oh, it is [12:33:50] cdanis: ah you'll find bast3002 with puppet disabled, that's me doing prometheus v2 migration, although feel free to reenable [12:34:15] reenabling won't break anything for you? [12:34:44] not atm no [13:01:41] bad news, the new values don't save us from OOMs [13:08:00] sigh [13:08:16] I think I figured out the problematic queries though [13:11:00] it was with only 30 days of history this time as well, my change very well might have made things worse! [13:18:31] ouch, yeah I see prometheus1004 still using ~all memory [13:19:05] yeah it's using a ton but i think not so much it will crash [13:19:31] i'm reloading 1003 with the previous/default setting and will try it with 2 weeks history :| [13:19:39] once 1004 is back to normal-ish [14:05:45] godog, herron ready when yu are. this is my prposed plan https://phabricator.wikimedia.org/P8459 [14:07:02] jbond42: ok will be ready in a few [14:07:14] kafaka? [14:07:45] jbond42: enable-puppet needs the reason [14:07:53] also you can just run run-puppet-agent --enable "reason" [14:07:57] to have it all in one command [14:08:13] and if you use multiple commands in cumin you need to specify the -m/--mode [14:08:55] volans: ack cheers [14:08:56] quick question, do we need to force the puppet run everywhere? seems a bit of overkill [14:09:06] given that in 30m it will run anyway everywhere [14:09:26] I would rather pick some hosts and enable it to test if all works [14:09:47] and then just re-enable puppet everywhere and let the normal cycle do its course [14:09:48] herr.on or god.og wanted the feature enabled slowly so we can monitor load. the -s and -b flags may need to be ajusted to make this a much slower roll out [14:10:07] monitor load on kafka [14:10:21] how slowly is slow? would we be leaving puppet disabled across the fleet for hours? [14:10:32] yeah we talked about that yesterday in the monitoring meeting, I asked if we needed to tweak kafka for the 1.3k clients [14:10:35] (I'm also worried about overlapping puppet-disables) [14:10:53] cdanis: first reason win [14:11:16] disable foo; disable bar keeps it disabled with foo, so when you re-enable with bar it's a noop [14:11:25] jbond42: re: phaste looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/505737 should be added too to the list [14:11:53] I'm ok btw with just reenabling puppet and it'll DTRT over 30m [14:13:53] godog: you are right on the change, thats the only one that needs a slow roll out. the other two only affect cumin hosts and the url_downloaders (the only system,s with firewall::logging [14:13:59] godog: I can't conclusively say anything either way about a difference in memory consumption running the 'bad' queries with max-samples changed vs not, so, I'm just going to proceed rolling it out [14:14:02] i have update the past https://phabricator.wikimedia.org/P8459 [14:14:45] cdanis: ack, sounds good [14:16:48] jbond42: last thing, lines two is missing a double quote at the end, other than that LGTM [14:17:51] godog: good catch updated, will just wait for herron then will start [14:18:09] let’s list down the hosts we want to test before re-enabling across the fleet [14:19:11] we can start with something low profile like people1001 [14:19:59] indeed, also the "misc-nonprod" cumin alias [14:20:51] and also in the puppet disable reason let’s include a contact person [14:21:27] or use spicerack :D [14:21:55] * volans has mixed feelings of using spicerack from a REPL [14:23:06] updated https://phabricator.wikimedia.org/P8459 let me know if you want more hosts tested [14:25:02] jbond42: looks good to me [14:25:26] godog: can you give the past one more pass and i will start [14:25:32] jbond42: sure, lgtm! [14:25:44] great starting [14:28:40] jbond42: is it okay if I enable & run puppet on a small number of hosts (just five of them) to complete my prometheus rollout? [14:29:15] cdanis: let me test on one host first and then you should be ok [14:33:25] running on people1001 now [14:37:45] this was no op on people1001 has been applied to bast4002. herror or godog can you check bast4002 and just make sure things look sane to you before progressing [14:38:03] jbond42: sure [14:38:48] cdanis: i thik you are testing on the prometheous servers if so my change should be a noop their so you should be good to go. let me know if thats not the case [14:39:12] jbond42: rgr! [14:39:18] will watch the output [14:39:26] thanks [14:39:38] bast4002 was one of the ones on my list as well ;) [14:39:45] did you see the prometheus change there? [14:39:59] yeah prometheus on bast4002 was restarted because of cdanis' change, other than that lgtm jbond42 [14:40:05] ahh yes i did do you want the output in a paste? [14:40:27] ack great thanks godog il move forward with the other test cases [14:42:56] no that's fine, ty jbond [14:47:51] ok i have rolled out to all the test hosts ill wait 10 mins before renabling puppet just incase something creeps [14:48:15] i'm continuing with the few hosts i have left [14:48:18] jbond42: nice, sounds good [14:48:21] yes thats fine [14:57:54] ok im going to re-enable puppet now [15:28:21] jbond42: lgtm so far [15:29:23] cool it should be applied accross the fleet by now so thats good [15:33:26] cdanis, godog: all teh varnish backend child restarted UNKNOWN on Icinga are expected? [15:33:35] http://prometheus.svc.eqiad.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) [15:34:00] same for other checks too [15:34:25] 95% of the ~70 unknown have the same message as above [15:34:37] could be the new query limit? [15:34:42] volans: not expected no, unless prometheus is in trouble [15:34:59] looking [15:36:32] cannot repro when I run an affected command by hand on icinga1001 [15:36:32] most of them are 1/3 so it seems are transient failures [15:36:40] compatible with rate-limiting [15:36:46] that isn't a rate limit [15:36:54] yeah sorry, bad wording :) [15:36:59] it's the max number of tsdb points a given query can 'look' at once it starts executing [15:37:11] mmmh [15:37:21] than a query should either succeeed or fail [15:37:25] not transient, right? [15:37:28] it shouldn't cause the same query on mostly the same data to intermittently fail, yeah [15:38:03] next thing I can think of is mod_proxy on apache and its timeout [15:39:11] but nothing in apache logs, so no [15:39:44] apache on prom1003 did serve 503s [15:39:57] Apr 30 15:31:42 prometheus1003 prometheus@ops[21443]: level=warn ts=2019-04-30T15:31:42.782646042Z caller=main.go:467 msg="Received SIGTERM, exiting gracefully..." [15:40:27] ah yeah, the restart is likely it then [15:40:58] did we just restart it? [15:41:26] looks like puppet did nine minutes ago [15:41:37] so expected :D [15:41:45] oh, that a Puppet run un-doing my temporary undo of the 10M max-samples limit [15:41:55] right, I re-enabled puppet on the host but did not trigger a run [15:42:41] is there a technique for propagating the 'liveness' of an apache-fronted service to LVS? [15:43:04] that would avoid blips like these; ops prometheus in the big DCs take ~2-3 minutes to restart [15:52:29] not a standard technique afaik, what comes to mind is an url that 200s if all prometheus instances on the host also 200s, then pybal proxyfetch would DTRT [18:02:17] godog: you still around?