[09:58:56] 10netops, 06Infrastructure-Foundations: BGP peers with missing descriptions - https://phabricator.wikimedia.org/T387220#10581935 (10cmooney) 05Open→03Resolved a:03cmooney I had a quick look and added these. [10:17:49] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10581998 (10cmooney) >>! In T384731#10579181, @ayounsi wrote: >>> And what happens if peer_descr is mis... [10:56:31] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287 (10ayounsi) 03NEW [11:01:17] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582145 (10ayounsi) I forked the discussion to {T387287} and {T387288} as that task was becoming more... [11:01:35] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582150 (10ayounsi) [11:01:38] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582148 (10ayounsi) [11:25:01] we got a page regarding http_aux_k8s_eqiad_kube_apiserver_ip6 probe failing [11:33:02] auto-resolved right? [11:33:44] yes [11:33:51] up to you if this requires further investigation :) [11:34:55] makes sense yes, thanks! [13:56:42] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582948 (10ayounsi) Another question is how to name those new metrics ? One suggestion, to stay generic as well, is to do something like `gnmi_bgp_... [14:03:48] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582973 (10fgiunchedi) Thank you for kickstarting this @ayounsi! I think I like `remote_instance` though don't feel strongly. re: `:0` in `instanc... [14:55:14] 10netops, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583319 (10cmooney) >>! In T387287#10582948, @ayounsi wrote: > Another question is how to name those new metrics ? > One sugg... [14:56:20] XioNoX: I had a stab at some bgp dashboards in grafana [14:56:34] definitely a million ways to approach it we will likely end up with more but it's a start [14:56:37] https://grafana.wikimedia.org/goto/pS45GMtHg [14:56:49] https://grafana.wikimedia.org/goto/pc0pGMtNR [15:11:46] topranks: https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&theme=dark&var-site=eqiad&var-device=lsw1-e1-eqiad&var-bgp_group=PyBal&var-bgp_neighbor=All&viewPanel=868 [15:11:59] topranks: something is telling me that that data has some issues O:) [15:12:15] not it's just how reliable PyBal is [15:12:25] Liberica better be pretty good to compete :P [15:12:26] hmm... [15:12:39] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583404 (10lmata) [15:12:57] topranks: that's liberica alreaedy [15:13:08] lvs1013 is running liberica since November [15:13:13] it happens on magru as well [15:13:26] see https://grafana.wikimedia.org/goto/lKeKnMtNR?orgId=1 [15:13:49] presume Jan 1970 [15:14:06] that's 55 years away already, damn we are old [15:14:08] at a guess they are exporting different number of digits in the unix timestamps.... [15:14:16] I'll have to try and thnk how to deal with it [15:30:11] topranks: BTW.. related to BGP metrics, I'm enabling metrics on gobgpd as we speak [15:30:32] that's a list of the available bgp metrics https://www.irccloud.com/pastebin/D8HaZ6MJ/ [15:32:14] so those will be available for every liberica instance [15:38:41] oh very nice that's great :) [15:47:37] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583538 (10cmooney) @fgiunchedi I wonder if you might have any ideas on this. Our routers and our switches are exporting timestamps with different number of digits: ` gn... [15:48:20] vgutierrez: yeah we're getting stats with different scales: https://phabricator.wikimedia.org/T369384#10583538 [15:48:49] that's a nice gnmi bug IMHO [15:50:28] shouldn't be any room for error, the YANG model says it should be "timestamp in nanoseconds" [15:50:30] https://github.com/openconfig/public/blob/b0d7b808f02c99a5307368604a638d52b98dd593/release/models/bgp/openconfig-bgp-neighbor.yang#L340 [15:50:48] good old Juniper keeping it consistent [15:51:46] yeah I'd bet on Junos not following specs [15:54:26] topranks: the detailed dashboard is nice!! [16:01:08] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10583616 (10xcollazo) @cmooney, should we move forward with this patch sometime soon? [16:01:28] the great thing is neither output format is what Grafana wants [16:01:37] you need to multiply what the switch sends by 1,000 to get there [16:01:43] or divide what the MX sends by 100,000 [16:01:45] :D [16:03:14] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583630 (10ayounsi) > I'm wondering what the benefit is to having the additi... [16:14:19] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583672 (10cmooney) >>! In T387287#10583630, @ayounsi wrote: > No strong fee... [16:33:26] vgutierrez: I solved it with some funny maths courtesy of a robot friend :) [16:34:42] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583733 (10cmooney) My robot friend suggested this which works to adjust the result of the promql to the right units: ` gnmi_bgp_neighbor_last_established{instance="$devi... [16:38:14] topranks: lovely 🤖 [19:49:55] FIRING: MaxConntrack: Max conntrack at 82.03% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:54:55] RESOLVED: MaxConntrack: Max conntrack at 82.03% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:01:55] FIRING: MaxConntrack: Max conntrack at 85.14% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:06:55] RESOLVED: MaxConntrack: Max conntrack at 81.31% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack