[03:31:17] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10194593 (10Papaul) @ayounsi I will soon be setting up interfaces and assigning them to VLAN's. I wanted to know if we are... [03:42:55] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10194601 (10Papaul) @ayounsi @cmooney I have been working on the migration process and put together the proposal below. I also had a meetin... [04:01:01] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10194641 (10Papaul) [06:34:03] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10194752 (10Aklapper) > the line mentioning that enwp.org is in widespread use. If it is (it'd be good to see some stats If nobody provides such stats I again propose to decline this task. Folks are welcome to use https... [07:09:37] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10194783 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94132346-5cb8-4ed8-b2f6-868a8962928b) set by vgutierrez@cumin1002 for 4:00:00 on 2 host(s) and their services with reaso... [08:06:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3066&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:06:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp3067:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:07:38] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3066 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [08:11:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3066&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:11:40] FIRING: [3x] VarnishHighThreadCount: Varnish's thread count on cp3067:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:12:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3066 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [08:16:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp3067:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:21:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp3067:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:36:40] RESOLVED: [2x] VarnishHighThreadCount: Varnish's thread count on cp3070:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:56:15] topranks: can I pick your brain a few seconds? [08:56:27] or peek lo [08:56:29] lol [08:56:41] vgutierrez: just be gentle :P [08:56:46] hahaha [08:57:20] so apparently gobgp doesn't track in memory details about exported prefixes [08:58:08] hmm... it ought to have some sort of structure for the adj-rib-out (the formal name for that afaik) [08:58:12] hmm scratch that [08:58:16] yeah [08:58:19] https://www.irccloud.com/pastebin/ZO6D6LmW/ [08:58:38] so I do have the AS_PATH prepending on adj-out [08:59:04] that's actually pretty cool [08:59:20] on many routing platforms it doesn't show the full properties of the routes _after_ policy [08:59:26] given that AS PATH is applied per prefix [08:59:27] i.e. if you have a pre-pend set you can't see it [08:59:44] if a liberica instance gets promoted or demoted [08:59:51] should the as-path here not be empty though? [09:00:05] I need to withdraw all the prefixes and send them again [09:00:13] as-path is not a good way to do this in our infra - as we use EBGP internally these days [09:00:36] so the "shorter" as-path may still seem less-good from a switch at the other side of the network [09:00:51] we should use an agreed set of communities to express any kind of preference if we need it [09:01:06] oh ok [09:01:15] I went with AS prepending based on the liberica design doc [09:01:19] and a comment you made there [09:01:37] what would be the desired way of implementing it then? [09:03:04] best way is to attach a community string (tag basically) to the routes to express intent [09:03:12] and then on the switches we can have a policy to match them [09:03:38] the communities stay on the routes wherever they go, so say a "backup" route can be seen as such regardless of what device and what location on the network it is [09:03:44] this is kind of what we do now for lvs [09:03:44] https://phabricator.wikimedia.org/T348446#9254207 [09:04:09] also somewhat relates to this task: https://phabricator.wikimedia.org/T354839 [09:04:25] what we would need to do is agree on a set of "states" we want to support [09:04:50] something like preferred/normal/backup or similar [09:05:03] and then we can agree on a BGP community string to use to represent each [09:05:37] the problem with as-paths is they get modified as the routes propagate through the network [09:06:06] so say on lsw1-b2-codfw it might receive a route from a Liberica host with as-path "64600 64600 64600 64600" [09:06:13] that should be backup and not used [09:06:36] but the route it sees from the primary might be "64600 " [09:06:47] so actually is longer at that point [09:06:56] gotcha [09:07:11] so for lvs you set a community string based on MED [09:07:22] yeah, that is a bit of a hack [09:07:25] with liberica we should set the community string directly [09:07:35] it would be better for PyBal to set the community, but we didn't want to touch the PyBal code [09:07:36] and let the switches take care of the rest, right? [09:07:42] yep exactly [09:07:50] cool... I'll work on that [09:08:51] Our current setup just has a single community in the policy [09:08:59] 14907:0 - which we call 'avoided path' [09:09:06] so that can be used [09:09:24] but if we want to add more levels, or a "preferred path" to be above "normal" or something we can also do that [09:09:31] so 14907:0 is the community that a backup instance should send? [09:09:39] yep [09:12:22] just replied to your comment on the liberica doc to reflect the current approach [09:24:12] vgutierrez: ah ok, I'd forgotten about that one [09:24:17] also followed up with a comment [09:25:27] cheers [09:25:54] it shouldn't be a huge change on liberica code [09:26:21] I'm guessing I need to modify the current policy to set a community rather than setting AS prepending [09:27:38] yeah.. it looks like that [09:37:00] ok [09:37:12] we will also need to adjust the policy on the routers/switches [09:37:32] do you have a live session working right now from one of the test hosts? [09:37:42] nope [09:37:53] I'm testing with a pair of gobgpd instances [09:38:00] one running on 127.0.0.1 and the other one on 127.0.0.2 [09:38:10] ha nice [09:38:19] I'll let you know as soon as I'm ready to talk to a big metal box [09:38:30] yeah, it's not at all tricky to do will only take a few mins [09:38:48] makes sense we create a new bgp group for the junipers and the appropriate policy [09:38:54] whenever you're ready just let me know [09:49:58] === RUN TestEnableCommunities [09:49:58] --- PASS: TestEnableCommunities (1.03s) [09:49:58] === RUN TestDisableCommunities [09:49:58] --- PASS: TestDisableCommunities (1.02s) [09:50:11] at least gobgpd in ::1 is happy about it [09:53:29] \o/ [09:53:38] policy is as idiotic as Community: add[14907:0] [09:54:00] cool! no need to make it complex :) [09:54:02] totally configurable of course, we will have a list of communities [09:54:06] on the config file [10:36:45] 06Traffic: Test liberica BGP support - https://phabricator.wikimedia.org/T375464#10195341 (10cmooney) >>! In T375464#10170262, @ayounsi wrote: >> it should be enough to add lvs1013 to the list of eqiad lvs_neighbors in homer or do we need something on top of that? > > In eqiad they're still configured manually... [11:02:09] https://www.irccloud.com/pastebin/G89gUGcD/ [11:02:15] topranks: ^^ [11:02:44] nice! [11:03:06] and the receiving peer sees the communities as well [11:03:10] https://www.irccloud.com/pastebin/0IYZ6Tgd/ [11:03:33] great to be able to see that - like I said many network vendors don't let you see the modified prefixes post-policy [11:03:57] memory/performance concerns I guess [11:04:13] cool, the second paste is the received routes on the other gobgpd process? [11:04:18] yes [11:04:24] nice, looks correct to me! [11:05:31] so.. given this is applied by prefix if liberica changes its communities configuration it must withdraw the old prefixes and send them again, right? [11:06:04] no I don't think so, a normal BGP announce message is actually called UPDATE [11:06:22] if the attribute changes liberica can just send a new UPDATE for the same prefix, with the newer attributes [11:06:28] so I need to see how I can trigger than update [11:06:29] which the switch then processes [11:06:30] *that [11:06:59] the policy is defined in a file on disk right? [11:07:24] on a juniper or cisco it normally does it as soon as you modify the policy, but that's slightly different [11:08:14] nope.. policy is just sent via the grpc API [11:08:35] ok... well perhaps it will apply it as soon as it's sent then? [11:08:51] gobgpd configuration is the bare minimum required to get gobpd up and running [11:08:53] with bird - where you change a config file - you can issue "configure" to it to tell it to re-read the file [11:09:01] https://www.irccloud.com/pastebin/KgJQD0lm/ [11:09:11] as well as "reload out " to force it to re-send routes with new policy [11:09:29] yeah.. probably a softreset of the neighbor should be enough [11:09:38] ok yep.. so conf file is unlikely to change [11:09:43] current concern is https://github.com/osrg/gobgp/blob/master/docs/sources/policy.md#policy-and-soft-reset [11:09:55] yeah 'softreset' sounds like using route-refresh to send updates to all routes [11:09:58] When you change an import policy and reset the inbound routing table (aka soft reset in), a withdraw for a route rejected by the latest import policies will be sent to peers. However, when you change an export policy and reset the outbound routing table (aka soft reset out), even if a route is rejected by the latest export policies, a withdraw for the route will not be sent. [11:11:46] the first part of that is fine, relates to routes being learnt inbound and propagated to other bgp routers - not a sceanario we have [11:11:58] the second line doesn't make for great reading :( [11:12:51] effectively they are saying that changes to outbound policy - at least changes that remove some routes that were announced by the previous policy - do not cause a WITHDRAW and thus don't properly take effect with a soft reset [11:13:11] which I guess means it needs a protocol reset, i.e. BGP session teardown/re-establishment :( [11:14:14] so... [11:14:22] that's probably not relevant for the community discussion [11:14:22] on policy update [11:14:38] it will send UPDATE with the community if the policy is modified to add community to a route [11:15:10] but if we modify the policy to not allow certain routes - previously allowed - then it won't take effect as the router will still have the old route [11:16:35] sorry.. got interrupted by a phone call :) [11:16:53] on policy update, gobgp already reflects the community update on its local rib table [11:17:22] the other side still reports the community [11:18:08] a softresetout doesn't seem to fix that [11:18:43] the example policy update in this case [11:18:46] sorry [11:18:50] it changes the policy to include the community? [11:18:55] it fixes the issue [11:18:55] or changes it to remove the community? [11:19:05] it should yes [11:19:08] removes/updates the policy [11:19:12] soft reset out re-sends ALL routes [11:19:19] receiving side reports this [11:19:21] and if a community should be on one it should send that [11:19:35] https://www.irccloud.com/pastebin/OFLFRTUh/ [11:19:44] "Implicit withdrawal of old path, since we have learned new path from the same peer" [11:19:51] the problem, as I read it, is they don't compare the previous policy to the new one, calculate what routes were previously announced but not by the new policy, and generate WITHDRAWS [11:20:01] so it performs an BGP UPDATE as expected [11:20:13] yep [11:20:24] cool.. I need to teach liberica how to do that :) [11:20:29] the question in my mind is whether the soft reset is needed at all, or if it will send an UPDATE when policy changes anyway [11:20:31] (the softreset) [11:20:42] topranks: it's needed, it's not implicit [11:20:50] ok [11:21:04] that really sucks if one were to use this as say an internet router [11:21:18] one community added to one route - we need soft reset and to resend 1 million routes :( [11:21:23] not an issue here thankfully [11:21:50] well.. the only difference is triggering the update manually or automatically on policy update, right? [11:22:14] most bgp speakers work that if the route changes in the local RIB (i.e. any attributes change), it'll send an UPDATE to anyone it was announcing it to [11:22:32] gobgp sends the UPDATE message [11:22:38] but only when you ask it to do it [11:22:44] no the difference is a soft reset out is telling it to resend everything - it means "resend all routes to this neighbor" [11:23:12] it seems to be smart in that [11:23:14] at least in the bgp implementations I'm familiar with that's what soft out means [11:23:21] see that the receiving side reports an update message [11:23:22] {"Key":"127.0.0.1","Topic":"Peer","attributes":[{"type":1,"value":0},{"type":2,"as_paths":[{"segment_type":2,"num":1,"asns":[64600]}]},{"type":3,"nexthop":"10.64.130.16"}],"level":"debug","msg":"received update","nlri":[{"prefix":"208.80.154.232/32"}],"time":"2024-10-02T11:18:26Z","withdrawals":[]} [11:23:27] let me get a traffic capture on port 179 :) [11:23:50] in this case it's no issue anyway, we don't have too many routes so processing all the UPDATES on the switch is fine [11:28:32] UPDATE message to add 14907:0 https://usercontent.irccloud-cdn.com/file/lKYELQYB/image.png [11:28:40] slide 68 here seems to allude to this: https://blog.netravnen.com/wp-content/uploads/2019/08/ixbrforum10day3gobgptutorial-161205210258.pdf [11:29:11] from 2016 though [11:29:48] traffic capture seems to confirm that's sending an UPDATE message rather than withdrawing all the prefixes [11:31:30] yeah the docs/code are clear that it will never send any WITHDRAWs on an export policy change + soft reset [11:31:45] that's a big issue if we change the policy to not send routes we previously were [11:31:55] but not relevant to the community [11:32:10] as the route is allowed (now with community, or now without community) and the soft reset resends all routes [11:33:04] the inefficiency is that a softresetout - resending all routes - is required before it will send an UPDATE if the export policy has changed [11:33:37] on say a Juniper it will process the new policy, and send UPDATEs for any routes that have attributes different from before [11:33:54] reading the docs/some github issues that does make sense though [11:34:43] I don't think an issue for us here as the number of routes are small though [11:34:47] For gobgpd it looks like the softresetout is smarter than juniper's one [11:34:54] eh..... [11:35:04] Cause it's sending the update and not withdrawing the routes first [11:35:15] if by "smarter" you mean "practically unworkable" then yeah :P [11:35:22] Uh? [11:35:23] Juniper wouldn't send a WITHDRAW would it? [11:35:51] routing breaks if you no longer wish to announce a route - but fail to tell your neighbor to WITHDRAW it [11:35:59] Right, gobgp isn't sending it either [11:36:24] gobgp doesn't store the previous set of what it was doing, to compare to what the new policy says, and only send the appropriate WITHDRAWs/UPDATEs to implement the new policy [11:36:46] it can only be told "this is the new policy" and then do softresetout, which sends everything the new policy says to [11:37:00] Hmm it actually stores the exported communities [11:37:21] I think that the only thing that doesn't track is the list of rejected prefixes by the other side [11:38:01] no router can ever know if the other side accepted it [11:38:09] this is the thread I was basing that on mostly: https://github.com/osrg/gobgp/pull/1910 [11:38:19] for our use it seems ok [11:38:42] 1) we have a small number of routes so requiring a softresetout is not a big deal (not huge headache for switch to receive everything again) [11:39:04] 2) we don't expect to stop announcing routes do we? [11:39:10] ^^^ this second one I'm not sure on [11:39:21] if a health-check fails we want to stop announcing the route presumably? [11:40:13] For anycast services we will stop announcing routes [11:40:21] ok [11:40:27] that's a headache then [11:40:46] And of course if a service gets decommissioned we will stop announcing its route [11:41:19] The only way we can get the previously-announced route removed from the switch routing table is to tear down the BGP session (which will cause the switch to remove all routes learnt over it) [11:41:39] then restart it, and gobgpd no longer sends the route so switch doesn't have it in its table [11:41:46] but that's quite disruptive [11:42:00] I don't believe that's the case [11:42:32] A withdraw of that specific route should be enough? [11:42:32] ah - maybe that's only if we block route by policy? [11:42:55] yes to do what we need without disruption a WITHDRAW needs to be sent [11:43:03] what action do we take when a health-check fails? [11:43:25] if we remove the VIP from the system loopback or something that probably causes a WITHDRAW to get sent [11:48:45] Hopefully we won't be something drastic like that [11:48:59] *won't need [11:49:30] Ok [11:49:52] it's clear that we can't simply filter the outbound routes in our policy and make it disappear on the switch side [11:49:57] without resetting the session [11:50:12] so we need some mechanism for gobgpd to consider the route invalid or "not there" [11:50:30] causing a normal WITHDRAW outside of the policy framework [11:51:21] perhaps using a policy on what is imported from the kernel-routing table to gobgpd rib would have the same effect [11:53:56] "ip addr del dev lo" doesn't seem like a bad approach to me however [12:04:20] gobgpd doesn't track system interfaces, at least not by default [12:05:29] Also removing the VIP from lo puts the load balancer on a state where its unable to handle traffic for that service [12:06:51] Removing the path from gobgpd should be enough to trigger that withdraw [12:06:57] I'll test it after lunch [12:34:18] cool yep that’s what we need to do whatever way it’s implemented [12:34:36] if the source of the route isn’t from the system interface then yeah, that’s not gonna work :D [12:38:35] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195673 (10aborrero) >>! In T375847#10187153, @cmooney wrote: > @aborrero the network assignment is incorrect also. > [[ https://netbo... [12:46:16] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195699 (10aborrero) I guess next bits to test with neutron would be to enable north-south traffic, meaning working on these two ticke... [12:49:58] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195719 (10cmooney) >>! In T375847#10195699, @aborrero wrote: > I guess next bits to test with neutron would be to enable north-south... [12:58:24] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10195740 (10cmooney) At a high level I think we need to: * Create an aggregate policy on //cloudsw1-b1-codfw// to generate 2a02:ec80:a100::/48 if par... [12:58:50] topranks: nope.. the source of the route is an API call [12:59:53] yeah, so if you use an equivalent API call to remove it that should work fine [13:00:14] think that means that overall we are good here [13:00:21] if we change the export policy we need to do a softresetout [13:00:41] but we are not using the export policy to block routes, so lack of WITHDRAW in that case doesn't affect us [13:02:04] nope, no such thing so far [13:21:23] topranks: removing the path triggers the WITHDRAW as expected [13:21:38] great :) [13:22:57] and as expected re-adding it triggers an UPDATE with the new prefix [13:29:00] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10195922 (10Dzahn) One domain/line more in an existing list like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1069643/12/modules/ncredir/files/nc_redirects.dat won't make a big difference either way. But we als... [14:18:19] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10196133 (10ssingh) Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks! [15:22:23] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10196335 (10DLynch) > If nobody provides such stats I again propose to decline this task. Folks are welcome to use https://w.wiki/ instead. Not really comprehensive, but just scanning [these google results](https://www... [15:27:37] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10196360 (10violetwtf) Popping in to mention that I haven't spoken to Thomas since March 2023 when I first opened this thread. Happy to reach back out if WMF reaches a decision to take the domain though. As of then, h... [15:32:44] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291 (10cmooney) 03NEW p:05Triage→03Medium [15:32:47] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196414 (10cmooney) [15:32:53] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10196415 (10cmooney) [15:38:51] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196446 (10cmooney) [15:39:48] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196447 (10cmooney) [15:40:55] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196467 (10cmooney) [15:42:24] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196463 (10Volans) Is there plan to try to get away from the very long hardcoded lists in hiera? How often do you expect the data to change? This mi... [15:48:33] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10196558 (10aborrero) here is a proposal: * 2a02:ec80:a100:fe01::/64 - cr1-codfw uplink * 2a02:ec80:a100:fe02::/64 - cr2-codfw uplink * 2a02:ec80:a10... [15:49:03] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196562 (10cmooney) >>! In T376291#10196463, @Volans wrote: > Is there plan to try to get away from the very long hardcoded lists in hiera? I'm mor... [15:49:46] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196564 (10cmooney) [15:50:29] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196561 (10CDanis) >>! In T376291#10196463, @Volans wrote: > Is there plan to try to get away from the very long hardcoded lists in hiera? No idea... [15:58:12] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10196613 (10RobH) [15:59:25] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10196625 (10RobH) >>! In T373993#10196133, @ssingh wrote: > Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks! The panels we... [15:59:49] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10196610 (10RobH) 05Open→03Resolved Thanks @Vgutierrez for the assist, I was ready to go to bed and they took over supporting the remote tech doing the cpu thermal paste swaps. This is no... [16:03:36] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10196649 (10ayounsi) No interface range as each switch will be independent. [16:04:19] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10196661 (10Papaul) Thanks [16:06:54] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10196688 (10CDanis) OK, one weird issue I've found which is confounding but not fatal: the NodePort isn't working on v6. ` Capturing on 'eno1' 1 0.00... [18:24:58] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197377 (10Dzahn) >>! In T332220#10195922, @Dzahn wrote: > One more domain/line .. won't make a big difference I have to add an important part here. Redirecting a domain (for example the various typo domains) to the... [18:25:44] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197378 (10BCornwall) > Not really comprehensive, but just scanning [these google results](https://www.google.com/search?q=%22enwp.org%22) I see a decent amount of usage of it across a range of applications, not just e... [18:28:05] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197381 (10Dzahn) >>! In T332220#10196335, @DLynch wrote: > I see a number of academic papers using it. This could be seen as unfortunate but to me it's a very good pro argument to take it over and ensure it keeps wo... [18:33:15] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197398 (10violetwtf) > Folks are welcome to use https://w.wiki/ instead. > But using it as an active URL shortener AND not breaking existing URLs that are already in use is a whole project that isn't that cheap. When... [18:41:23] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197419 (10Dzahn) > enwp.org does not operate like w.wiki and covers a separate use-case. Thanks! That's an important distinction. Indeed, if it's possible to rewrite everything just with a simple rewrite rule from... [19:04:42] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10197533 (10Hghani) @Vgutierrez Just an FYI if you have any thoughts on the above findings? [19:09:30] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197554 (10DLynch) Yeah, calling it a "shortener service" is very misleading really. In practice it's literally just a way to not have to type out "en.wikipedia.org/wiki" because you can replace it with "enwp.org". The... [19:14:29] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197583 (10Dzahn) Just keep in mind that, as far as I can tell, we wouldn't want the combination where WMF owns the domain while it points to Thomas' servers. So I think it's either Thomas keeps running the service a... [19:33:09] 06Traffic: Write a cookbook that performs a rolling restart of the pdns-recursor service on the DNS hosts - https://phabricator.wikimedia.org/T374891#10197653 (10CDobbins) The above change has a cookbook that's used for both haproxy and pdns-recursor. [19:34:20] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197663 (10BCornwall) Sorry for the confusion, and thanks for pointing out that the domain is a simple redirection and not a shortener. What a detail to miss! Indeed, this should be simple enough to fit into our infras... [19:34:24] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197665 (10BCornwall) 05Open→03In progress [19:43:57] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197677 (10ssingh) Thanks for filing this task! 1. So it seems like there is a possibility that this list (or rather, these lists) can be maintaine... [19:48:44] 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197699 (10violetwtf) I've reached out to Thomas and will notify here if/when I get a reply. I've also made a WMF developer account to comment on Gerrit to ensure we support c.enwp.org, which was... [19:51:13] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197703 (10ssingh) Re: the point regarding snippets and using `INCLUDE`: I think that's not optional anyway -- we have to keep only one `10.in-addr.... [19:53:04] 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197705 (10BCornwall) I've updated the CR to include c.enwp.org. [19:54:33] 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197706 (10BCornwall) [20:16:21] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10197756 (10Papaul) The recommended Junos version for srx300 is 23.4R2-S2 as for 2024-9-10. Are going for version 23 or 22? [21:04:33] 06Traffic, 06Infrastructure-Foundations, 06SRE: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197890 (10cmooney) >>! In T376291#10197677, @ssingh wrote: > * It seem the network data in `dns_reverse_zones.yaml` and the corresponding reverse... [22:28:39] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10198161 (10cmooney) That seems fine to me @aborrero thanks! [23:39:50] 10netops, 06Infrastructure-Foundations, 06SRE-OnFire, 10Sustainability (Incident Followup): Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10198393 (10Dzahn)