[15:34:15] godog, shdubsh: the relaxed checks didn't seem to have helped much unfortunately :( [15:38:18] volans: the patch isn't merged yet though, I'll merge in 10/15 [15:38:47] assuming you are talking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/580327 [15:39:34] adjusting check_ping to 2 packets looks to be helping, at least average check execution time has gone down considerably [15:39:39] https://usercontent.irccloud-cdn.com/file/4YZLGHrP/Screen%20Shot%202020-03-17%20at%2011.37.07%20AM.png [15:39:40] lol, I thought it was merged and saw a temporary dip in the graph [15:40:01] herron: that's not really affecting the host though [15:40:12] as that's the time a check runs in a subprocess [15:40:25] it doesn't really matter, in particular for pings, where most of the time is just wait [15:40:38] it's not affecting the single core-bounded main application process [15:40:49] it's correct to think of the main icinga process as a single-threaded event loop, right? [15:40:55] yes [15:40:59] basically a while true [15:41:00] basically all it is doing is fork/wait/read/write? [15:41:02] fork [15:41:27] what counts more should be the number of forks [15:41:51] and this seems to also fit the fact that icinga2001 is in a better shape because of less core but higher frequency CPU [15:42:04] the total # of checks correlates pretty well with latency [15:44:51] the two CPUs are these for the curious: https://ark.intel.com/content/www/us/en/ark/compare.html?productIds=83354,123547 [15:45:15] the E5 is icinga2001 [15:46:13] we could potentially fail over icinga to 2001 [15:46:32] yes that's an option we've discussed also a bit [15:46:39] in yesterday's meeting [15:46:48] you sure don't want to join those meetings? :-P [15:47:06] I needed to skip yesterday, but I usually join [15:47:38] * volans trolling [15:49:10] failing over makes sense to me [15:49:57] definitely, especially if relaxing checks doesn't help a whole lot [15:49:59] also has anyone looked already at optimizing the icinga1001 config for cpu frequency? I'm not seeing any of the cores at the 3GHz "turbo boost" frequency listed for this CPU model. if not we could experiment with the config on icinga1001 while it's not live [15:50:07] patch is merged btw, should be fully active in ~30min [15:50:23] IME failing over is reasonably easy as long as puppet hasn't been broken on icinga2001 for weeks ;) [15:50:32] (and currently it's fine, says puppetboard) [15:51:09] and it's even documented [15:51:11] and/or get some better hardware for icinga1001 [15:59:38] shdubsh: something that might be interesting for prometheus-icinga-exporter: not just total # checks, but the number of 'work-units/second' the checks imply (number of checks / check interval) [16:00:49] or number of forks :D [16:05:05] cdanis: I'm not sure I follow. AFAIK check interval is configurable. We would have to aggregate check intervals somehow first? [16:05:53] sum(1/check_interval) for each host+service checks [16:06:07] that gives you the checks per minute [16:08:12] that is what's really hammering icinga, or, if it's easier to gather, how many forks the main icinga process does [16:09:10] the latter I'm imagining would be a constant like "all of them" :D [16:09:28] volans: we could package and deploy https://github.com/cloudflare/ebpf_exporter to count that ;) [16:09:51] the relationship between checks per minute scheduled vs forks might be interesting [16:11:34] although, if forks is flatlined that likely means we've reached or exceeded capacity [16:14:28] so far is going down, let's see where it settle [16:22:57] aye, looks like the recent stock market [16:30:22] it seems to be settling around 25s, let's keep an eye on it for a bit, but surely a much better place to be in [16:30:36] we should add a low-frequency check on that metric [16:41:56] aye, I got a patch out [16:47:21] volans: better than 3 months ago [18:15:33] someone remembers what's the url of the cloud instance of netbox? [18:15:50] XioNoX: https://af-netbox.wmflabs.org/ [18:16:03] volans: ok, does it load for you? [18:16:10] I though we migrated away from it [18:16:18] the host is af-nb-fe.automation-framework.eqiad.wmflabs [18:16:18] or something [18:16:46] no, it doesn't, I thought we kept the same hostname publicly but just changed the hosts underneath, chaomodus ^^^ [18:17:21] yep that's the case [18:25:01] any idea where to look? is it the wmcs proxy or the Netbox server? [18:27:35] chaomodus: could you have a look or tell me how to look? it's blocking two of my Q goals :( [18:34:29] tcpdump sees the queries from the proxy, so I think the proxy is fine [18:34:32] 18:34:06.444558 IP proxy-01.project-proxy.eqiad1.wikimedia.cloud.57978 > af-nb-fe.automation-framework.eqiad.wmflabs.http: Flags [S], seq 1328600185, win 29200, options [mss 1460,nop,nop,TS val 3937258561 ecr 0,nop,wscale 9], length 0 [18:35:29] sure thing [18:36:20] thx [18:48:31] interesting [18:48:54] a good or a bad interesting? [18:48:59] bad. [18:49:11] hm [18:49:27] may I ask, what changed? [18:52:34] It has been through some debugging for the slowness and the previous upgrade [18:55:01] hrm weirdly even if i runserver it isn't hitting it [18:56:12] have you checked horizon? [18:56:19] maybe is just the config wrong [18:56:57] maybe [18:57:09] i did try nc the proxy port and nope so [19:03:50] shdubsh: FYI latency got back to ~1m average for a while [19:04:02] then again down, let's see how fluctuating it is [19:05:57] That's my thought too. I'm curious to see the new pattern. [19:26:47] XioNoX: okay it should work well enough to do testing, i'm still investigating the exact cause but it isn't getting the security policy or whatever that makes the ferm rule open up port 80 for the proxy [19:32:56] chaomodus: proxy only have https open iirc [19:34:22] thx, looks up for me [19:39:47] [reminder] to update the pad for tomorrow's meeting ;) [19:54:47] chaomodus: is it broken again? [20:09:32] erm [20:10:24] yep one sec [20:14:59] okay should be fixed in a future proof kinda way [20:39:11] thanks!