[08:57:15] Krenair: https://phabricator.wikimedia.org/P8566 SNI prevalidation looking good on acmechief-test1001 [09:10:45] 10Acme-chief, 10HTTPS, 10Traffic, 10Operations: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Maintenance_bot) [09:22:00] Krenair: required config change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/512871 [13:43:15] hi all, i have just enabled some ulogd logging and im seeing a lot of packets comming from the corere routers hitting rhenium on port 2100 they look like this [13:43:18] 13:42:50.914549 IP 208.80.154.197.50101 > 208.80.154.52.2100: UDP, length 100 [13:43:53] the mainly come from 208.80.154.197 bvut i have seen them from 208.80.154.196 as well [13:44:06] anyone know what it may be? [13:45:08] jbond42: netmon hosts store a copy of all network devices config, maybe rhenium's ip is still there somewhere [13:47:04] godog: i think i may be missing something. is rhenium an old netmon box? Either wayt it looks like the routers are sending data to rhenium so not sure if that helps? [13:48:39] alo i noticed that 208.80.154.197 sends packets constantly at least 4pps where as 208.80.154.196 sends a burts of 8 packets every minute. im gussing 208.80.154.196 is the passive router but lookng at the raw packets i can tell what the traffic is [13:53:28] jbond42: sorry if that wasn't clear, I meant grepping for rhenium's name or ip in one of the netmon hosts' /var/lib/rancid/core should reveal if/how it is still used [13:53:49] godog: ahh ack thanks [13:58:03] godog: yes i see some config for flow-server and i was getting hints it was sFlow traffic so that looks like the culprit ill raise a ticket [14:04:58] 10Traffic, 10Operations: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10jbond) p:05Triage→03Normal [15:10:45] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10Maintenance_bot) [15:29:19] some stuff to keep in mind with various ICMP / PMTU strangenesses, some of it (maybe not what we've seen recently, but other related things) could be parts of attempts to use PMTU to execute a different attack: [15:29:23] https://indico.dns-oarc.net/event/31/contributions/692/attachments/660/1115/fujiwara-5.pdf [15:29:53] when adding a new ip_blockXXX to hieradata/common/lvs/configuration.yaml internal ips (10.x.x.x) seem to be used sequentially, but where should i look to choose a public ip (for fronting cloudelastic100X.wikimedia.org) [15:33:07] ebernhardson: generally operations/dns reverse zone files are the source of truth on public address allocations, but there's some considerations as to which subnet that IP should even be in (whether it should be behind LVS or not, and otherwise what row it might be homed in, etc) [15:33:57] (eventually netbox will be where it happens, but I don't think we're quite there yet) [15:34:41] bblack: ok i'll review that and try and figure something out, thanks! [15:35:22] if it ends up behind prod LVS in eqiad, it would be in the section marked: ; - 208.80.154.240/28 (240-255) LVS high-traffic2 (Multimedia & Misc) [15:35:34] at the bottom of templates/154.80.208.in-addr.arpa [15:36:14] looks like 3 of the 4 servers are also in that file, the 4th is in 155.80.208.in-addr.arpa [15:36:16] If it's not behind LVS, then the server(s) have to all be in the same row to have a shared/failable public IP, and it would be in one of the per-row public subnets [15:36:38] (or we have to come up with some new strategies about these things vs the cases we have today) [15:37:10] tbh i'm not sure on the tradeoff between LVS or not? I was assuming LVS, essentially the goal though is simply to have a single name for usage, and have it handle depooling and whatnot relatively easily (via pybal i imagine) [15:40:06] looks like the split is because we have 4 servers, one racked in each dc row. Seems that means has to be lvs and can't have a shared/failable public ip [15:40:44] yeah, I think there was some discussion on the ticket, but I haven't looked back at it yet and don't recall all the context, I can maybe look again later today [15:40:56] or maybe that convo was just on IRC [15:41:17] but I think someone thought LVS wouldn't be an option for this, and then we said there's no reason it shouldn't work fine... and I'm not sure if that whole thing got resolved or not [15:41:23] ok, i'm mostly trying to get a "most of the way there" patch put together, figuring someone who knows what they are doing can review and help fix up the minor parts [15:41:44] maybe assume LVS for now, and use the mentioned LVS subnet above to allocate an IP from [15:41:46] bblack: i was told by someone lvs wouldn't work, but then a few months later when finally setting it up i can't remember who said that, and seems worth trying at least... [15:42:42] (and when doing reviews on that patch, we may end up moving the IP around... it might be good to look at the git history and see which were most-recently used for some other purpose and avoid those, etc) [15:42:48] i think there was some early confusion regarding how networking was going to work too, maybe the lvs problem was when we were pondering the labsupport network instead of public networking [15:43:16] yeah I think that's what's going on. The way it looks now with where cloudelastic100x are currently, LVS should be an option. [15:43:59] alright, seems i have some things to go on. thanks! [15:45:07] np! [16:16:53] jbond42: those packets are netflow [16:18:18] jbond42: I enabled sampling on cr2 to troubleshot an issue so it's sending real data, while cr1 is only sending keepalive-like data [16:18:31] (called templates) [16:23:52] I see that you opened a task [16:24:15] and netflow got moved to a different server? https://phabricator.wikimedia.org/T212011 [16:24:22] XioNoX: thanks, yes i created a ticket i assume that rhenium still needs to be removed as its a spare? [16:24:39] I was not aware of any of those changes [16:24:49] I'll have to update the router configs [16:24:53] ahh ok [16:25:19] its not a priority for me just causing nbosie and i was looking :) [16:25:28] please CC netops and/or me for any network related tasks, otherwise I'll miss them [16:25:52] ahh ok i thught adding traffic as enough will do in future [16:26:53] I'll do it this week ideally if it's not urgent [16:27:15] thanks for the CR reviews btw :) I'll get to them asap too [16:28:23] yes not urgent and most comments i made are very minor and personal style so feel free to ignore if you disagree :) [16:32:17] I'll tell you I disagree if I disagree but from a quick glance it looks like a lot of valuable comments :) [16:33:24] ack cheers [16:39:42] 10netops, 10Operations: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) p:05Triage→03Normal [16:45:22] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) Hello here is the sample output. There are several inconsistencies that I can see the fix for that I'd already attempte... [18:03:30] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) p:05Triage→03High [18:04:33] paravoid, mark, bblack, could you review https://phabricator.wikimedia.org/T224511 and especially the next steps? [18:05:36] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) [18:14:19] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10BBlack) Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic? [18:32:21] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) I picked 2h for the sake of picking a number that //sounds// right, but it's not backed by anything. Any value works for me. [18:34:50] new version of LibreNMS support recieving and alerting on SNMP traps - https://github.com/librenms/librenms/pull/10136 - https://github.com/librenms/librenms/releases [19:21:31] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) Note that this already helped find inconsistencies: [x] Some devices had status active while they shouldn't (cr3-esams... [19:25:14] 10Traffic, 10Operations: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) [19:25:16] 10netops, 10Operations: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) [19:26:54] 10netops, 10Operations: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) task description says sulfur.wikimedia.org but the link is to sodium ( https://netbox.wikimedia.org/dcim/devices/1171/ ) Which one is correct? [19:36:05] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) Adding #performance because handshakes on TLS 1.3 are 100ms faster and also it caches handshakes (https://kinsta.com/blog/tls-1-3/). Hope that's fine for you. [19:39:53] 10Traffic, 10Operations: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) a:03ayounsi Network devices have to have their target changed, see T212011. Note that only cr2-eqiad is actively sending netflow, the other routers are only sending keepalives. [20:13:50] 10netops, 10Operations: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10MoritzMuehlenhoff) >>! In T212011#5218478, @ayounsi wrote: > task description says sulfur.wikimedia.org but the link is to sodium ( https://netbox.wikimedia.org/dcim/devices/1171/ ) > Which one is... [20:57:59] !log phab1003 / phab2001 - removing 'apache restart' from root's crontab (gerrit:512977) (T187790) [20:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:05] T187790: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790