[00:00:49] hmm [00:00:54] hieradata/role/common/cirrus/cloudelastic.yaml:profile::lvs::realserver::ipip::enabled: true [00:02:25] and further, include profile::lvs::realserver::ipipyou mentioned talking to 10.64.166.2 above. from where though? [00:08:45] now the problem is that we have Puppet disabled on the LVSes [00:09:26] I am really not sure what the best course of action is [00:12:45] understanding is one thing, I am missing a lot of context here too :P [00:26:23] ok I am going to revert this patch. I can't seem to make sense of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/948551b7a98e299d5acd6c811bbff670cfeba75a%5E%21/#F0 looking at https://netbox.wikimedia.org/dcim/devices/3654/interfaces/ [00:26:58] and since I lack the full context and understanding and given everything was working fine and that lvs1020 is the backup host, it makes sense to revert and let puppet run everywhere else [00:27:16] it's late and I need to head out and I am quite sure I won't solve this right now [00:27:51] going to send an email to catha.l and v.g so that they can look at their AM [00:28:43] also since _in theory_, nothing was broken, it's safest to revert this for the backup LVS [00:28:56] (something should be broken but I don't know what and where?) [00:35:41] ok, NOOP on eqiad LVS hosts (except lvs1020) as expected. for lvs1020, -vlans added, so back to where we were [00:49:36] 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10814578 (10BCornwall) [07:09:40] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991 (10LSobanski) 03NEW [07:11:41] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#10814904 (10ayounsi) a:05ayounsi→03Papaul Re-assigning it to Papaul to do the change on `ulsfo` and `eqsin`. It is a good training opportunity, and would remove moving p... [07:12:08] topranks: how you wanna proceed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144690 aka E8/F8 vlans [07:49:30] 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996 (10ayounsi) 03NEW [08:00:18] vgutierrez: not sure actually, it confused me and Sukhbir last night [08:00:50] as in, we disabled puppet on all but lvs1020, but when we ran puppet on it the patch didn't seem to make any edits to create the new vlan interfaces [08:01:03] there may be something related to ipip toggle we missed though [08:01:12] what do you recommend? [08:01:31] I believe it's needed on lvs1019 anyway, the cirrussearch boxes sit behind that afaik [08:05:05] topranks: so it was a NOOP in puppet terms? [08:05:51] the IPIP toggle shouldn't mess with vlan interfaces AFAIK [08:05:57] no it did make one change on lvs1020, it added "vlan1061" (the correct new one) to the list of interfaces in /lib/systemd/system/ipip-multiqueue-optimizer.service [08:06:17] yeah looking at the puppet code I didn't see that the toggle should stop them being created [08:07:19] IPIP puppetization just takes care of deploying ipip-multiqueue-optimizer, ipip0, ipip60 interfaces and clsact qdisc (required to be able to attach ipip-multiqueue-optimizer to the various network interfaces) [08:08:01] makes sense [08:08:06] (that's on the LBs, in the realservers it takes care of enabling MSS clamping of course) [08:08:09] why is the clsact qdisc needed? [08:08:19] (^^ you can ignore this one for now btw, just being curious :P) [08:08:31] ok yep [08:08:37] that's how you send the egress traffic to eBPF code [08:08:44] ah ok [08:08:50] (the alternative would be XDP) [08:09:13] Ok TIL. I had assumed we used XDP / it was the only way. [08:09:17] actually as soon as we switch to katran we won't need ipip-multiqueue-optimizer anymore as it's included as a core feature there [08:10:44] you have the two options... for katran ipip multiqueue optimization happens in XDP... cause it gets an inbound packet, performs LB duties and then it applies the IPIP mq optimization before leaving the packet again on the NIC queue [08:11:15] with IPVS i can't do that [08:11:32] I need to let IPVS do its magic and afterwards apply the IPIP mq optimization [08:11:45] ah ok cool, so this is an alternate approach in the between time when we are using IPIP and IPVS [08:11:51] yes [08:11:56] makes sense, thanks for the explanation! [08:12:22] so.. back to E8/F8 patch, let's see what PCC tells us and debug it [08:12:42] good call [08:18:24] topranks: https://puppet-compiler.wmflabs.org/output/1145098/3803/lvs1020.eqiad.wmnet/index.html this matches what you described [08:20:25] hmm ok yeah [08:20:48] I wonder is it something in puppet changed how "interface::tagged" is working? [08:22:14] * 2f3d53a85e - interface: add explicit Augeas lens (2024-02-01 15:25:20 +0000) [08:22:22] that's the latest change on interface::tagged [08:23:41] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991#10815152 (10ayounsi) a:03ayounsi sent an email to PCH [08:24:10] hmm, think we definitely added some since then [08:25:07] yep [08:25:20] but vlan1061 configuration isn't happening at all [08:25:29] and I think that everything happens via augeas [08:26:11] yeah, I'm not too familiar with augeas, other than knowing it's a little old fashioned at this point :P [08:28:08] from the lvs puppetization, ipip-mq-optimizer config happens in profile::lvs [08:28:15] and the interface config itself happens on profile::lvs::tagged_interface [08:28:25] yeah [08:28:39] which should in turn call interface::tagged [08:28:50] both interface::tagged and interface::cslact [08:29:23] maybe I can ask Jesse if he understands it a little better later [08:29:23] and even if augeas tells the change is a NOOP, we should have interface::tagged resources for the new vlans on the change catalog [08:29:47] and that's definitely not true [08:31:18] I must have missed something else [08:31:44] we get a "Profile::Lvs::Tagged_interface[private1-e7-eqiad]" (or similar) for each vlan/subnet in the PCC output [08:31:47] but not the new ones [08:32:01] let me double check, maybe this vlan needs to be included somewhere else also [08:32:30] hieradata/role/eqiad/lvs/balancer.yaml [08:32:33] doh! [08:33:06] :) [08:33:31] let me amend that revert CR :) [08:36:24] I submitted a new one [08:36:35] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145099 [08:37:20] ok, we need to abandon one of those then :D [08:38:02] full diff looks good: https://puppet-compiler.wmflabs.org/output/1145099/3806/lvs1020.eqiad.wmnet/fulldiff.html [08:39:05] yeah we can abandon the old one... my bad my git fu escaped me I didn't realise we could still use the old one [08:39:19] I've abandoned my CR [08:39:52] ok cool [08:39:58] +1ed yours, nice catch :D [08:40:05] so now we can maybe try again? will I follow the same pattern? [08:40:21] disable puppet on lvs1017-1019, merge, then do a puppet run on lvs1020? [08:41:11] sounds good, how you usually proceed with these? rebooting the lvs after adding the new interfce? [08:41:36] tbh I've not done it so often, I was hoping we'd run on lvs1020 and it would add the interfaces without a reboot [08:41:56] or worst case "ifup vlan1061" would add it without a reboot [08:42:19] that's not enough, we need to restart ipip-mq-optimizer too [08:42:32] and that requires depooling the instance [08:43:05] ok... never as easy as one hopes [08:43:14] that's why in the past those were restarted [08:43:26] yeah we may as well restart if we have to do the depool dance [08:43:36] +1 [08:43:40] that's just a matter of stopping puppet and pybal service, waiting for failover, then reboot? [08:44:04] yes [08:44:22] I usually stare at the dashboards to see the pps graph flatlining to 0 [08:44:38] yeah, and the connections drop off ? [08:44:42] I can take care of those if needed [08:44:50] connections is gonna be way slower [08:44:52] but yes [08:45:30] yeah the connections is just the expire time right? I was never too sure it was 100% needed but erred on the side of caution [08:46:05] yep [08:46:23] LVS only see the inbound side of the traffic so they don't really know much about TCP connection state [08:46:40] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10815286 (10ayounsi) 05Open→03Stalled Current status: * `fasw`: done * `pfw`: {T393996} should be the last step * `mr`: waiting on JTAC case 2025-0506-688713 [08:48:51] ok thanks for confirming [08:49:19] vgutierrez: how about I do lvs1020 and lvs1019, you do the other two then? [08:49:33] topranks: I can take care of the 4 lvs if needed [08:49:40] just let me know [08:50:04] if you're sure? [08:50:23] that'd be a help if you could thanks [08:51:06] * vgutierrez notices that today is Tuesday 13th... [08:51:18] ok.. let's break some LVS :) [08:51:23] famous for it's good luck :P [08:54:44] topranks: could you add `lvs:` in the beginning of the commit message? [08:54:51] and I'll take it from there [08:55:33] yep one sec [08:59:17] ok done [09:12:17] o/ I have another small change for the trafficserver gateway routing stuff. I'm adding a regex to tighten the criteria for the ignorelist on enwiki, essentially allowing "A.*" articles to bypass restbase. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144581 [09:12:40] it's a little ugly but it'll get less ugly as we start to allow more traffic through throughout the day [09:23:53] vgutierrez: when you have a minute I'd be interested to chat about LVS BGP session alerting [09:24:17] sure [09:24:31] hnowlan: lua regex is always fun [09:29:35] (topranks is probably interested too) [09:29:35] vgutierrez, the current/old Icinga alerting that we're slowly removing in favor of Prometheus checks for all BGP session on the routers/switches. So if the session on for example cr3-ulsfo to lvs4008 goes down it shows the alert under cr3-ulsfo. Now that T387287 is done, it's possible instead to attach the router side alert to the host itself (so lvs4008). That way service owners can set their alerting as they wish. The rationale is [09:29:35] that the network device is usually not responsible for a server BGP session going down. [09:29:35] T387287: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287 [09:30:30] TIL T387287 [09:32:42] so for BGP sessions like LVS, before we remove the old icinga alert, I'd like to make sure there is something that satisfies the traffic team in place [09:33:23] the alternative is to not care about the network side, and only alert if the server side BGP session goes down, as it's almost impossible that only one side goes down and not the other one [09:34:07] where are we with that one actually? [09:34:24] when I look in thanos for series under gnmi_bgp_neighbor_session_state{}, the only "instances" I see are routers/switches [09:34:42] but we merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122957 [09:34:54] remote_instance:gnmi_bgp_neighbor_session_state [09:35:30] yep [09:35:30] https://w.wiki/E6SC [09:35:32] thanks [09:36:33] but yeah +1 from me to change our alerts to not include the PyBal group, and then have a separate alerting defined as is appropriate for the LVS from remote_instance:gnmi_bgp_neighbor_session_state [09:36:54] the obvious big benefit of using the latter series is a downtime should suppress those alerts [09:38:25] pybal already has its own BGP alerts [09:38:46] re liberica.. we got the metrics in there but apparently I haven't added any alerts yet [09:39:43] see https://grafana.wikimedia.org/goto/JTiTHSaHg?orgId=1 && https://grafana.wikimedia.org/goto/Nd0THI-NR?orgId=1 [09:41:34] vgutierrez: thanks! I'll do the usual puppet toggle dance to do the first cp host [09:41:42] hnowlan: cool [09:53:14] XioNoX, topranks from what you're saying I wouldn't even need to set alerts based on our own metrics [09:54:34] vgutierrez: yeah there is definitely limited reason to alert on the data from both sides (host and router) [09:55:31] if picking only one side to alert on the question is which is better? makes little difference, I'd probably err on the router side but if there are already alerts set up for the host side I'm not sure it's worth much effort to change [09:56:44] XioNoX: sounds like it might be safe for us to not alert for the PyBal group/ASN in alertmanager/Icinga? [09:57:02] and leave traffic to decide whether to use their existing ones or remote_instance:gnmi_bgp_neighbor_session_state? [09:57:22] the other consideration is the severity of the alert [09:57:37] for the pybal alerts we set the severity to p.a.g.e [09:58:04] and AFAIK at the moment network side BGP alerts don't page [09:58:21] (I hope that didn't trigger any IRC based ping) [10:01:03] do you have the granularity to be able to trigger a p.a.g.e. for BGP sessions going down in the router side for the pybal group? [10:01:17] (but not others groups if that's not needed) [10:03:10] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10Observability-Logging: Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011 (10JAllemandou) 03NEW [10:03:55] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10Observability-Logging: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772#10815510 (10JAllemandou) Hey folks, I'm sorry I was OoO last week and missed the timeline. I'm happy for VK to be shut-down when you wish. we don't g... [10:05:51] topranks: yep, totally agreed [10:07:19] vgutierrez: it's all in Prometheus, so yeah [10:07:55] we don't rewrite the instance for the numbers of received prefixes though, only status for now iirc [10:08:32] ack [10:08:35] SGTM [10:16:55] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops-radar, 10Content-Transform-Team (Work In Progress): Block traffic to RESTBase /page/related endpoint and sunset it - https://phabricator.wikimedia.org/T376297#10815563 (10MSantos) >>! In T376297#10814281, @DDFoster96 wrote: > The API document... [10:36:20] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10815665 (10MSantos) [10:54:38] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10815724 (10cmooney) Link remains stable that I can see, there are no errors reported in either the switch or host side stats. For the record the device is a BCM57414 NIC, in PCIe... [10:57:11] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10815745 (10Ifrahkhanyaree_WMDE) 05Open→03Resolved Closing the ticket as all t... [11:03:22] vgutierrez: thanks for taking care of those LVS reboots!! [11:03:35] I'll feed back to the search team and see if that resolves their issue [11:26:02] 06Traffic, 10Data-Engineering (Q4 2025 April 1st - June 30th): Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011#10815806 (10JAllemandou) [11:26:12] 06Traffic, 10Data-Engineering (Q4 2025 April 1st - June 30th): Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011#10815807 (10JAllemandou) a:05Fabfur→03JAllemandou [12:13:37] 10netops, 06Infrastructure-Foundations, 06SRE: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021 (10cmooney) 03NEW p:05Triage→03Medium [12:14:06] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10816007 (10cmooney) [12:14:07] 10netops, 06Infrastructure-Foundations, 06SRE: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10816008 (10cmooney) [12:27:19] vgutierrez: I'm going to start working on the cr3-eqsin upgrade, will depool the site shortly [12:27:49] ack [12:30:44] vgutierrez: I've never used the cookbook, that's the proper command ? `sudo cookbook --dry-run sre.dns.admin -t T364092 -r "cr3-eqsin upgrade" depool eqsin` (without dry-run, but dry-run is working fine) [12:30:45] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:31:10] yes [12:31:15] cool beans [12:36:13] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10816116 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a82fd52b-494d-4956-9f75-7cd844fe0007) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their servic... [12:41:39] XioNoX: it will be nice to check how gobgp handles cr3 going away [12:46:39] funny.. I was expecting `cr3-eqsin# set protocols bgp graceful-shutdown sender` to bring the BGP sessions down [12:47:36] vgutierrez: so this only tells the peers to downpref the route it learns as much as possible [12:47:42] if the peers are compatible [12:47:47] and the other way around [12:48:10] for gobgp it looks like a NOOP with the current configuration [13:06:43] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10816238 (10ssingh) Thanks @RobH! @cmooney: yeah, I updated the task description to reflect that but we though we should get this checked out anyway, since it's the integrated NIC.... [13:08:51] actually interested to see does gobgp do anything with the "graceful-shutdown sender" community? [13:09:21] vgutierrez: for reference it realtes to RFC8326 [13:09:33] I don't think it does anything by default [13:10:17] the idea is to give operators a simple configuration mode they can enable which will make the rotuer its applied on reduce local-pref on all routes it learns, plus add a specific community on all routes it sends [13:10:44] most vendors have added support so if they receive routes with this community they will reduce the local-pref automatically, it's been quite useful for us [13:11:06] XioNoX: yeah thinking it through we don't send routes to the LB so it's irrelevant [13:11:09] sorry ignore me [13:15:12] rebooting [13:19:03] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10816308 (10ssingh) a:05RobH→03BCornwall [13:22:20] back up on console waiting for it to be reachable externally [13:25:31] fully up [13:27:35] https://www.irccloud.com/pastebin/usLLoICs/ [13:28:03] and as expected the session got restored [13:28:08] https://www.irccloud.com/pastebin/hzCDqSqj/ [13:30:12] but liberica dashboard could use some love... it doesn't show the BGP session bump [13:40:04] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10816412 (10ayounsi) [13:47:35] eqsin repooled [14:07:34] XioNoX: thx [14:53:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10816875 (10Jhancock.wm) @cmooney got them swapped for you [15:20:58] FIRING: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:21:20] FIRING: SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:23:25] FIRING: SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:26:20] FIRING: [2x] SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:28:25] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:30:58] RESOLVED: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:31:20] RESOLVED: [2x] SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:33:25] RESOLVED: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:33:44] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10817471 (10Jhancock.wm) a:03Jhancock.wm idrac said DIMM_B3 was at fault. i swapped it with DIMM_A3. if reseating it did not fix the issue, we should be able to... [16:36:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Upgrade management switches to Junos 21.4 - https://phabricator.wikimedia.org/T390814#10817487 (10Papaul) We don't' have the updated version pushed to msw1-codfw (JUNOS 21.4R3.15 built 2022-09-03 07:18:28 UTC) we supposed to use (JUNOS 21.4R3-S10.9 built 2025... [16:48:31] FIRING: SLOMetricAbsent: trafficserver-combined - https://slo.wikimedia.org/?search=trafficserver-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:50:12] So what's going on with these? [16:51:47] I remember https://phabricator.wikimedia.org/T369854 [16:53:31] RESOLVED: SLOMetricAbsent: trafficserver-combined - https://slo.wikimedia.org/?search=trafficserver-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:54:02] re-opened the ticket [17:00:59] 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#10817615 (10bking) 05Open→03Resolved a:03bking Per IRC conversation with @cmooney , it seems that once all LVS poo... [17:19:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Upgrade management switches to Junos 21.4 - https://phabricator.wikimedia.org/T390814#10817789 (10Papaul) 05Open→03Resolved This is complete [17:29:19] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10817838 (10bd808) If I'm understanding @ssingh's comments... [17:30:12] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10817841 (10bd808) [17:35:36] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10817876 (10ssingh) >>! In T358887#10817838, @bd808 wrote:... [17:44:40] 06Traffic, 06Data-Engineering-Radar, 10Observability-Logging: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772#10817926 (10Ahoelzl) [17:52:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10817962 (10cmooney) 05Open→03Resolved a:03cmooney Super @Jhancock.wm that all looks good now and links are working :) ` c... [17:53:59] 06Traffic, 06Data-Engineering-Radar, 10Observability-Logging: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772#10817970 (10Fabfur) Sorry the first date has been postponed to tomorrow (May 13th) [17:56:05] Hi Fabfur, I noticed the Varnishkafka hosts are about to be shut-down. Do you know if they could be related to this issue? T394080 [17:56:06] T394080: Duplicate SLOMetricAbsent rules generated by Pyrra for varnish-combined-eqsin-cache_upload.yaml - https://phabricator.wikimedia.org/T394080 [17:56:36] fabfur* [17:56:46] hi denisse, I don't think that this activity could be related, we still haven't start it [17:56:52] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10817983 (10bd808) >>! In T358887#10817876, @ssingh wrote:... [17:57:08] so the varnishkafka is still active and sending data everywhere [17:58:02] Thank you, I didn't ask the right question. I was mostly wondering if that alert is related to the varnishkafka hosts to see whether or not I should work on it or prioritize something else (in case the hosts are going to be shutdown, the duplicate alert would be gone). [17:58:08] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10817987 (10ssingh) >>! In T358887#10817983, @bd808 wrote:... [17:59:14] I'm mostly asking because I'm unfamiliar with the varnish and varnishkafka hosts. [18:00:27] denisse: Note that the search team is also getting those alerts [18:01:16] 06Traffic, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887#10818002 (10ssingh) a:03ssingh [19:25:28] 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10818210 (10BCornwall) 05In progress→03Resolved [19:25:42] 06Traffic: Move varnish pseudo-headers to vmod_var variables - https://phabricator.wikimedia.org/T373550#10818213 (10BCornwall) 05Open→03In progress [21:28:13] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#10818585 (10greg) @ssingh update on our end. We identified the domain we'd like to use and Chuck R from Legal has already acquired it for this... [21:49:59] sukhe just a heads-up on the patches we talked about last week: https://gerrit.wikimedia.org/r/c/operations/dns/+/1145276 https://gerrit.wikimedia.org/r/c/operations/dns/+/1145277 . No time pressure here, we won't be ready until tomorrow anyway [21:50:53] err...I guess that was yesterday? Me is losing track of time [21:54:00] I've also cc'd you on the related puppet patches, such as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145278 [22:21:55] inflatador: happy to look tomorrow (and will do)