[08:52:25] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005 (10ayounsi) 03NEW [08:52:38] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10186833 (10ayounsi) [09:02:59] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10186931 (10cmooney) > We should write a cookbook that iterate over all network devices and run the safe command request system configuration rescue save > Anot... [09:55:44] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10187090 (10cmooney) >>! In T375847#10182667, @aborrero wrote: > I see the dhcp6 packets from my test VM arriving into neutron: > > `... [10:06:06] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10187153 (10cmooney) @aborrero the network assignment is incorrect also. 2a02:ec80:a100::/56 is the entire public IPv6 allocation for... [10:34:58] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10187215 (10cmooney) Guys I would propose the following: * We delegate the allocated 'public' and 'private' ranges to the codf... [11:49:41] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10187345 (10ayounsi) From JTAC : > You can periodically take a vmhost snapshot of the device to avoid losing configurations. > On the device, back up the snapsho... [12:58:58] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10187510 (10cmooney) Ok. Well that's not something we can realistically do after every homer run. Cookbook might be an idea? Although the interactive yes/no p... [14:04:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10187693 (10Jhancock.wm) 05Open→03Resolved taken care of! [15:02:47] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10187945 (10Papaul) [15:16:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188045 (10ssingh) I am assuming this means a site depool during this period? [15:20:57] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10188101 (10ssingh) Question: > Cookbook might be an idea? Although the interactive yes/no prompt could be annoying to deal with. The 'yes/no' prompts I am gu... [15:33:35] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188140 (10Papaul) @ssingh no site depool only management will not be available during the maintenance [15:34:01] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188142 (10ssingh) >>! In T369504#10188140, @Papaul wrote: > @ssingh no site depool only management will not be available during the maintenance Thanks Papaul! [15:44:21] 06Traffic, 13Patch-For-Review: Support RFC 8914 [Extended DNS Errors] in internal recursors - https://phabricator.wikimedia.org/T375414#10188182 (10ssingh) 05Open→03Resolved a:03ssingh This has been rolled out to all DNS hosts. [16:29:25] 10netops, 06Infrastructure-Foundations: Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10188431 (10Dzahn) I am wondering if you can (one-time) configure the devices to "never boot rescue" config and then not have to worry about running the "save" c... [16:55:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10188560 (10RobH) Entered ticket 00981959 for this swap to take place today: > Support, > > We would like to request remote hands to assist in retriving a s... [17:15:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10188649 (10RobH) Ticket accepted and they've retrieved the replacement shipment, should get pinged from them shortly to start the swap. [18:02:49] FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [18:03:01] ^ ok to ignore, ulsfo is depooled and cr is being replaced [18:03:02] I will downtime [18:05:01] thanks [18:27:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10189124 (10RobH) Swap completed and @papaul confirms they can attach via serial console. The onsite portion of this troubleshooting and repair should now be... [20:21:14] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10189598 (10Hghani) Another observation looking at the unique devices data: We notice there are a lot of actor signatures that don't appear as automated traffic (i.e., not triggering threshold... [21:18:28] FIRING: NodeTextfileStale: Stale textfile for ncredir1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:42:50] sukhe, ChrisDobbins901_ any of you tested the ferm prometheus exporter on ncredir1001? [21:43:51] Yes, the other day. I forgot to comment out the part that writes the file [21:44:07] Ok.. that explains the alert [21:44:17] Could you delete the offending file? [21:44:27] Yes, will do. Sorry about that [21:48:01] sorry folks just saw [21:48:03] thanks! [21:49:29] done [21:52:26] thx [21:53:28] RESOLVED: NodeTextfileStale: Stale textfile for ncredir1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:50:50] FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [23:55:49] RESOLVED: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable