[13:53:09] XioNoX: I just quickly cross-checked netflow data (src IP: text/upload LB v4 and v6 in each DC) vs their transit/peering bills, and it's reasonably close now [13:53:43] cdanis: fyi we've been working with Luca to have pmacct send netflow to kafka over TLS [13:53:57] yeah I saw the task get created [13:54:14] it's done in ulsfo, we had some weird bugs but now it looks smooth [13:54:50] nice! [13:56:07] cdanis: so esams netflow is close to it's librenms stats? [13:56:34] 753e9 bytes/5 minutes =~ 20 Gbps egress [13:57:18] woot! [13:57:21] awesome [13:57:36] cdanis: what was the issue? [14:00:00] looks like it was the absurd flow limits [14:00:12] here's esams: https://w.wiki/Lm8 [14:02:05] you can see the instant where i enabled it on cr2, lol [14:03:16] ah yeah! [14:03:29] I thought even after that there was a difference in numbers [14:06:14] there's still some [14:07:24] eqiad is showing ~6.2 Gbps, which is still a bit short after subtracting away eqord from https://librenms.wikimedia.org/bill/bill_id=16/ [14:08:47] for the usage we have of it it's still a significant progress [14:09:19] yeah subtracting away eqord I see 9 Gbps expected and we're showing 6.2 [14:09:24] at 00:00 [14:11:22] codfw, eqsin, ulsfo, esams look reasonable, though [14:11:29] within 10% or better [16:24:21] fyi here is my onboarding on officewikie https://office.wikimedia.org/wiki/Technology/Onboarding/Checklists/John_Bond [16:24:44] I do like the "who starts it" part of the officewiki template [16:25:00] moritzm: ^^^ as an example [16:26:00] ack, will have a closer look when onboarding Janus [16:26:16] actually john's one might be old see the template [16:26:18] https://office.wikimedia.org/wiki/Technology/Onboarding/Checklists/Template [16:26:37] yes i think i was an early adopter ;) [17:13:11] volans: hello, icinga expert, do you have a minute for a quick question? [17:13:15] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-esams&service=Router+interfaces [17:13:47] this alert has been acked for a while, but ofc we care if a 3rd interface fails -- that would invalidate the ack and cause a new critical, because the check output message changed, right? [17:22:56] cdanis: sure [17:23:27] cdanis: no it would not with current config [17:23:35] but is possible to make do it [17:23:45] because of how the ack was set up, or because of how icinga is set up? [17:24:57] * volans rewinding memory [17:28:20] cdanis: https://phabricator.wikimedia.org/T173806 [17:28:34] hah [17:28:35] okay [17:28:56] I dont' recall if it overrides an ACK thoudh [17:28:58] *though [17:29:02] surely it does re-alert [17:30:22] that's past volans from 2017 so don't trust him :) [17:34:38] XioNoX: ok a proposal: it looks like we use check_ifstatus_nomon, how about we edit the description of the broken esams interfaces to include 'no-mon', make sure alert clears, and remove the ack? [17:35:52] cdanis: to test it? [17:36:28] XioNoX: no, intentionally, since AIUI what volans says, an ack will mean we don't get re-notified on a 3rd interface failing [17:37:27] smart [17:38:19] if it sounds good to you i'll edit right now [17:39:04] cdanis: yeah, and we can keep the cr3 state the same [17:39:43] oh? i was thinking edit there as well [17:43:45] that works for me too [17:44:42] ok, I'll change cr3 first and make sure alert clears [17:47:55] thanks [17:49:03] +1 [18:09:54] OK: host '91.198.174.245', interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 [18:10:06] made the task # also part of the interface description btw [18:14:09] {{done}} on both routers [18:14:30] and of course the ack auto-cleared, so, we'll get a new CRITICAL if we go N+0 in esams