[10:35:17] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) >>! In T245606#5910457, @Krenair wrote: > I've been reading the linked proposal and noticed this: > "the internal flat network CIDR. This is 1... [10:54:44] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [10:55:03] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) p:05Triage→03Medium [10:55:12] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [10:55:15] 10Traffic, 10Operations, 10Pybal: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10Vgutierrez) [12:07:02] 10netops, 10Operations: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) 05Open→03Resolved a:03ayounsi All BGP are now re-enabled. [12:24:59] 10Traffic, 10Operations, 10Pybal: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [12:25:22] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [12:28:51] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) NOTE: apparently the neutron BGP implementation doesn't support ingesting routes using BGP, only advertising. In our neutron setup, the defaul... [12:47:04] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10ayounsi) For the routers to cloudnet hosts traffic, we should only establish the BGP sessions over the transport network, doing it over the hosts vlan w... [14:55:26] 10netops, 10Operations: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10jbond) p:05Triage→03Medium [14:56:03] 10netops, 10Operations: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10jbond) [15:29:23] 10netops, 10DC-Ops, 10Operations, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) From JTAC: > To answer your question “For the list of devices/serials that are decommissioned and we don't own anymore, is there a process so they don't sho... [16:41:48] 10netops, 10Operations: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10ayounsi) 05Open→03Declined Indeed, and it can be for many reasons (cost, list saturation, etc...). As long as it's a minority and not the norm it's not an issue. Thanks for reporting it though, it's... [18:47:39] 10netops, 10DC-Ops, 10Operations, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10RobH) I can provide a list of sold network gear from before 2017, otherwise all network gear is still in our sites in storage, even decom gear. I'm not exactly sure... [19:11:06] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) [19:11:26] 10Traffic, 10Analytics, 10Operations, 10Research, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) [19:21:44] bblack: interested in few random ferm reload failures due to DNS resolution timeout? I'm fixing the underlying cause, but maybe you're interested to look at the DNS side of it [19:45:55] 10netops, 10DC-Ops, 10Operations, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) Ok! From https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States I thought that if a device was not in netbox it was not in our possession anymore. So I... [20:08:29] 10netops, 10Operations: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) a:05ayounsi→03RobH From JTAC (over the phone), tl;dr; If the FPC reboot didn't solve the issue, we need to re-seat all linecards and SCBs (should solve the issue in 90% of the cases). And only if that doesn... [20:15:41] volans: got a link/ticket/etc? [20:16:20] volans: (or more info dump, whatever) [20:16:54] 10netops, 10Operations: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) Said one thing over the phone, but sent a RMA email. Gave him the same details than the cr2-esams linecard. [20:24:51] 10netops, 10Operations: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) FPC1 only have 2 ports curently: ` et-1/0/0 up up Core: asw2-esams:et-6/0/50 {#20049} et-1/0/1 up up Core: asw2-esams:et-6/0/51 {#20042} ` [20:26:31] 10netops, 10Operations, 10ops-eqiad: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) [20:34:35] 10netops, 10Operations, 10ops-esams: cr3-esams:fpc1 RMA - https://phabricator.wikimedia.org/T245825 (10RobH) [20:41:02] bblack: not really, just ferm logging into syslog: [20:41:03] DNS query for 'ms-be2035.codfw.wmnet' failed: query timed out [20:41:32] having a look I discovered we were doing 2k DNS resolutions on ferm reload, hence I prepared: [20:41:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/574426 [20:42:22] volans: does @resolve in this case happen on the target host while booting, or on the puppetmaster while compiling? [20:42:37] on the target on ferm restart/reload [20:42:43] by "happen" I mean execute a dns query [20:42:44] ok [20:42:55] the definition in /etc/ferm/conf.d/* are hostnames injected by puppet [20:43:00] and ferm resolve them into IPs [20:43:19] it's entirely possible it's just dropping some on send, or on receive [20:43:26] (on either side, really) [20:44:03] the correct-est general solution is probably to move forward with some kind of systemd-resolved solution for local caching, combined with a reduction of internal-only TTLs [20:44:03] 10Traffic, 10Operations: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10Reedy) Can you access https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue or https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue and follow the instructions? [20:44:11] that's in a ticket, somewhere, I think [20:44:37] for now that patch in this specific case should reduce from ~2k to ~90 resolutions [20:45:07] basically this morning a patch that changed an unrelated ferm rule was merged into puppet and 4~5 ms-be hosts got ferm failing to reload for this [20:47:44] 10Traffic, 10Operations: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10JoeHebda) Microsoft Windows [Version 10.0.18363.657] (c) 2019 Microsoft Corporation. All rights reserved. C:\Users\Admin>tracert en.wikipedia.org Tracing route to en.wikipedia.org [91.198.174.192] over a ma... [20:51:30] 10Traffic, 10netops, 10Operations: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10Dzahn) [23:33:10] 10Traffic, 10netops, 10Operations: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10JoeHebda) Now connected...running OK. Ticket can be closed. Done