[04:26:02] 10netops, 10Operations, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) Good news, this is already implemented with: https://github.com/librenms/librenms/pull/9879 Bad news, for unknown reasons so far, the switches don't expose the proper interface data. For... [07:07:00] 10netops, 10Operations: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) The link went down again: ` elukey@re0.cr2-eqiad> show interfaces descriptions | match down xe-4/1/3 up down Transport: cr2-esams:xe-0/1/3 (Level3, BDFS2448,... [07:25:14] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10elukey) 05Resolved→03Open Couple of notes about the anycast-healthchecker: 1) the `anycast-healthchecker` is not in jessie-wikimedia, so puppet on lithium/wezen is currently broken: ` r... [07:25:30] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10elukey) [08:20:52] 10Traffic, 10Operations, 10Security-Team: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) @dduvall: any reason not to proceed with the removal? [08:31:20] 10Traffic, 10Operations, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Volans) @Joe any feedback on the above proposal? I'd really like to split the users ASAP given that dbctl is being deployed. [09:16:12] 10Traffic, 10Operations: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) I've been digging a bit further and reproduced this on my workstation with the following program: `lang=go // growmem.go package main import ( "io/ioutil" "os" ) func main()... [09:20:06] bblack: I did a bit of digging to find out why/how the go runtime made fifo-log-tailer panic before the OOM killer started shooting https://phabricator.wikimedia.org/T229414#5383743 [10:02:17] nice --^ [10:18:04] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10fgiunchedi) Thanks @elukey ! Indeed anycast-healthchecker isn't in jessie-wikimedia, lithium is being decom'd and if wezen gets reinstalled it'll be buster, and I installed anycast-healthche... [11:20:59] 10Traffic, 10Operations, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) >>! In T97972#5353056, @Volans wrote: >>>! In T97972#5352851, @Joe wrote: >> IIRC we already have an account specialized for accessi... [12:12:21] 10netops, 10Operations: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10CDanis) That was scheduled maintenance in Centurylink's ticket 16820717, should be resolved as of about two hours ago. [12:45:51] hi traffic couple of questions; 1. why is cp1008 in the wikimedia.org domain an not eqiad.wmnet? 2. is there any reason why there is no AAAA record for cp1008.wikimedia.org? [12:51:29] jbond42: IIRC 1. is to directly hit it as it's the test host for all-things-traffic [12:51:42] as for 2. I dunno, sorry [12:52:00] but I'll let traffic give you an official answer :) [12:52:04] cheers volans [14:01:07] jbond42: no good reasons at present, it's all just legacy stuff we haven't made time to deal with. cp1008's hardware is also irredeemably-ancient and there's a ticket to use a less-ancient leftover machine to replace it so we can decom it (and the new one would be in private space and eqiad.wmnet). [14:01:36] that ticket is https://phabricator.wikimedia.org/T202966 [14:03:00] I think the last non-trivial thing we were semi-stalled on there (trivial things being time and running reimage commands, etc), is whether to use cp1099 at all. [14:03:09] yeah let's not, it's also ancient [14:03:13] 2013 [14:03:15] as mm points out 71-74 are actually better candidates [14:03:18] 6½ years old [14:03:45] but 71-74 are also still "reserved" as separate-from-production ATS test clusters (like a similar set of 4 in codfw) [14:03:53] fwiw we're probably going to push quarterly goals later in the year to drop ancient hardware [14:04:14] probably multiple quarterly goals, with some progression in % or years-old etc. [14:04:21] we're not using them for that at present, we got past that stage with the upload backends conversion. But there was some ??? last we talked, about whether we'd need it again during the text-backend testing or not. [14:05:02] ema: and I don't even currently remember if we decided we'd need the old 4-node ATS test clusters for text or not last time we talked about it. [14:05:06] maybe you do! :) [14:06:06] (vs just relying on the upload backends, which are technically configured to be shared backends for both clusters, just not being used in that other capacity for prod traffic yet) [14:07:04] really, we should also revisit whether we even need pinkunicorn at this stage or not, I guess [14:07:45] it has and does serve some various, mixed-up convenient testing purposes. but most of those could be accomplished in other ways without a separate physical machine. [14:08:09] the only really clear rationale for it in the past was having something that could test TLS termination from the outside world with nothing else in the way. [14:09:09] thanks for the explanation bblack [14:09:44] (but if we're willing to put cp1099 or whatever in the private net behind LVS for TLS termination testing, which is mostly about software config rather than hardware and doesn't necessarily even need the real prod certs to accomplish that.... it could just as well be a ganeti instance for LVS testing or whatever). [14:10:13] err, I meant ganeti instance for *TLS* testing [14:11:00] XioNoX: do you have time to chat about the anycast distance metrics topic today? [14:12:32] bblack: yep! [14:14:39] XioNoX: mostly I'm just trying to understand what the plan is, if any, especially before doing anything broader with deploy in eqiad for recdns. I know normal iBGP can't really solve this problem automagically. I know we use OSPF for a lot of internal routing stuff which can solve it, but ... yeah [14:15:22] is there a not-hacky plan for how to fix it? [14:19:55] bblack: I've been thinking about it and can't think of a way to use the actual OSPF metrics. I also experimented some in a Juniper lab (redistribute the OSPF metric in BGP), with no luck. I can keep digging, but one thing we could, that doesn't reply on metrics is to have different export policies for CORE->POP / POP->POP, and POP->CORE. So we simply don't advertize the anycast subnets from POP to CORE. [14:21:41] yeah that seems like a reasonable possibility to explore. How would that interact with the network-only pops like eqdfw and eqord? Could e.g. eqiad still learn about codfw's anycast advert if the only remaining redundant path came through one of those? [14:22:23] (I guess treat them as CORE in this scenario as well, which works out ok because they don't have local anycast servers of their own anyways) [14:23:08] really we don't need the advert from POP->POP either [14:24:01] in the simplest possible terms, all we really need is for the real core anycast recdns in eqiad and codfw to be visible to everywhere, and for the other real pops (esams, ulsfo, eqsin) to not export their anycast out to anything remote. [14:28:59] so yeah, that makes sense [14:29:02] I think! [14:29:43] (the cores back each other up, they don't need anything else, as we're dead in the water either way if we lose both cores. The pops in network-link terms and every other sense of redundancy/failback really just fall back towards cores for any kind of redundancy) [14:29:53] on a BGP level, eqdfw is part of codfw and eqord is part of eqiad so it's the same, yeah [14:30:04] yeah ok [14:30:22] it would be for latency to have POPs talking to other POPs [14:30:52] as far as comfort-levels for rollouts, the stuff I've done so far I feel ok with, but coming soon would be turning it on for larger swaths (or all) of codfw then eqiad. [14:31:42] the old setup there under LVS recdns had a sane-ish fallback with relatively low latency (to the other core), but the current anycast setup, the fallback if they lose both local recdns could be very high latency out to some random edge site for some requests, which could be a problem (at least, a regression in fallback capabilities from previous) [14:32:14] yeah I see [14:32:31] XioNoX: do any of our pops have better latency to another pop than the core? I guess maybe a special case there is that eqsin has a backup tunnel that goes to ulsfo instead of straight to codfw or eqiad? [14:32:39] I mean sharing anycast between POP->POP [14:32:55] yeah, currently I don't think it matters much [14:32:58] but for the most part our pops vs cores looks like a star [14:33:09] it's also to pave the way to edge computing :) [14:33:52] we have a much smaller and more-constrained set of software running in the edges too. I doubt DNS latency is as big an issue there for better-designed stuff. [14:34:04] (in the failure scenario anyways) [14:34:13] and we can always depool a whole edge if it were [14:34:19] yep [14:34:47] its in the cores where I feel like we have more of a wild west where we can never gaurantee some one of many critical services isn't spamming and blocking on DNS reqs and falls over on a significant latency bump [14:36:38] ok so recap: current best and simplest plan is basically: edge sites shouldn't export the anycast recdns to remote peers. core sites do export anycast recdns to all peers and let it flood out through the network as fallback routes for any site that loses all its local recdns. [14:36:51] yep [14:38:13] is there a subtask or whatever about this somewhere I can track for when it's done (or make one please!) [14:38:37] bblack: I don't think we need 71-74 for ATS testing, no. We can convert a couple of real text nodes (in eqiad for now as not all origins have TLS yet) [14:39:41] bblack: so partly re-do what has been cleaned up with https://phabricator.wikimedia.org/T227808 [14:40:45] ema: ack. [14:41:03] 10Traffic, 10Operations: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) 05Open→03Resolved The new `fifo-log-tailer` has now been running for one day and shows reasonable memory usage: ` 14:39:53 ema@cp1080.eqiad.wmnet:~ $ ps u -q `pidof fifo-log-tailer` USER... [14:41:15] ema: what do you think about just killing pinkunicorn entirely and not reviving it with some other node? Is it still too useful? [14:42:04] bblack: I personally haven't used it in a long time, mostly testing stuff in labs instead [14:42:11] right [14:42:23] so +1 from my side to sunset it [14:42:29] vgutierrez: any thoughts, esp wrt to ATS TLS stuff? I assume you haven't been using it for that either. [14:42:52] bblack: I added an item to my todo to open a task, feel free too as well [15:26:06] yeah [15:26:09] we can kill it [15:29:54] ok cool [15:30:21] regarding ATS TLS, the current puppetization would allow to deploying it on any cp node without messing up with nginx [15:30:32] so we could use any subset of cp nodes for testing it [15:31:12] I got some negative feedback on my latest PRs to ATS [15:31:34] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10BBlack) 05Open→03Declined We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it... [15:31:37] 10netops, 10Operations, 10decommission, 10ops-eqiad: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10BBlack) [15:32:14] so I need to rework them to get them accepted by upstream [15:40:33] bblack: I don't think there is a decom task for cp1008, should there be? [15:41:03] bblack: there is https://phabricator.wikimedia.org/T208584, but it's not clear if it's part of that, or if that one is resolved [15:45:22] paravoid: I'm making one now, for all the old cp nodes in eqiad that are left (cp1008, cp1071-4, cp1099) [15:45:33] ah cool [15:45:35] thanks :) [15:55:57] 10Traffic, 10netops, 10Operations, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Hi all, I just stumbled upon this task while investigating something else. Its something I'm happy to progress however i wanted to consider if the... [16:09:13] jbond42: lol you've stumbled on the mother of all insanely complex tech debt tickets sitting open for years :) [16:09:32] I'm slowly composing a response intermittently now, it may take a little while! [16:09:49] that's how he rolls :P [16:09:50] bblack: :D thanks and i dont think there is any rush [16:10:08] (a response on the ticket I meant, not here) [16:10:15] yes thats what i assumed [16:19:32] actually now that I'm halfway through it, maybe it's better to just leave it alone honestly [16:20:07] I mean I can record a bunch of historical context on why things are done the now-seemingly-silly ways they are, and record evidence of vague memories of past problems we were trying to solve with some of our hacky partial solutions. [16:20:38] but that's perhaps not productive for anyone in the long run, vs just letting fresh ideas grown on their own without probably-irrelevant baggage. [16:22:29] jbond42: probably the only really useful and universal thing I could say is: when you come up with an answer for all of these addressing woes in our infra, make sure it's a holistic view that takes into account how we manage server lifecycles, installs/reimages, IPv4+IPv6, etc. Don't just hack on a solution to the smaller problems with IPv6, but make a new thing that makes sense for all of it :) [16:25:10] (especially provisioning and automatic DNS management, etc) [16:26:38] bblack: I guess the main thing im missing is why SLACC didn't work for you. My gut feeling is it was probably a lack of some kernel thingy which required hacks. That is likley no longer and issue so slack may just work [16:27:12] SLAAC I think? [16:27:28] yes that one :) [16:27:37] I think at least one of the thoughts going into the SLAAC thing was we wanted persistently-reliable addresses. [16:28:08] in other words, we don't do SLAAC for DNS/config -recorded host addresses for the same reason we don't do random DHCP pool assignment for them for IPv4. [16:28:16] SLACC is tied to the mac so it would be persistently-reliable as long as thi nic dosn;t get replaced [16:28:22] (which it does!) [16:28:30] (I mean, NICs do get replaced) [16:28:34] I think whatever we use for server bootstraping using netbox with v4 we should do the same with v6 [16:28:57] and I thought the latest privacy-enhanced SLAAC actually rotated the IP fairly frequently? [16:28:57] yeah I don't think slaac is the way to go [16:29:13] bblack: we can disable that [16:29:35] the previcy enhancments are off in out install [16:29:38] yeah we don't use privacy stuff for SLAAC at all right now [16:29:59] but still, if you look at the puppet repo, there are hardcoded IPs all over it [16:31:04] if you wanted to move towards some more-ideal state where SLAAC and auto-managed DNS records were running the world, you'd first probably want to eradicate the hardcoding of IPs for various silly reasons in puppet config, etc [16:31:11] ok i can have a think about this perhaps we could add it to the scope of the netbox DNs genration stuff [16:32:09] but assuming we end up sticking to a worldview where physical servers have manually-assigned and persistent IPs for IPv4, there's no good reason not to also assign a matching static IPv6 at the same time through the same mechanisms, IMHO. [16:32:14] Im not tied to using slack, in fact i have mostly switched it of in previous roles as it causes to many issues [16:32:34] the problem we've run into in the past going down that sort of road though, is that SLAAC is on by-default and hard to get rid of cleanly. [16:33:14] hence the "token" approach as a hackaround at least back in jessie or so, where you're abusing SLAAC advert but telling the kernel to use explicitly-mapped lower64 [16:34:29] but I think it would be cleaner to make it so that SLAAC was dead-dead-dead on the hosts from the very first moment (early in installer, and permanent/default sysctl settings from there onwards), and have netbox/dns/installer manage deploying a fixed and matching pair of v4+v6. [16:35:08] but those are just my personal thoughts. I haven't stared at this problem in a long time, and the world is always changing too! [16:35:49] i can did through some old config to see what we did at ICANN to switch it off [16:37:19] thanks for the history i will have a think about this tomorrow and add some thoughts to the ticket [16:51:10] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10greg) a:05greg→03mmodell >>! In T226044#5380942, @greg wrote: >>>! In T226044#5380759... [17:22:46] ema: doing some random cleanup of old oustanding commits... https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473299/ - do you remember there being a reason not to merge this one? it seems trivially-ok, maybe I got hung up on regex sanity review and just forgot it [17:36:25] ebernhardson: ping, the cloudelastic LVS patch LGTM, we need to be careful with the deploy and do a couple manual things on the LVS side. Do you want to be around in case whatever on the cloudelastic side and/or ready for it? [17:36:32] ( https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512925/ ) [17:39:48] bblack: sure i can [17:40:20] bblack: there isn't anything particularly dangerous on my side, nothing will start sending on its own until i turn it on manually [17:41:00] ebernhardson: ok, gonna merge shortly and work on the LVS side [17:47:05] heh, minor syntax error not caught by compiler (because it's off in ferm rules stuff, not puppet-level stuff) [17:47:13] working on a quick fixup patch [17:56:06] still working through some other issue with the cloudelastic LVS_SERVICE_IPS definition, I'm not even sure what's going on there yet [17:56:20] they're showing: [17:56:23] hmm [17:56:23] LVS_SERVICE_IPS="cloudelasticlb6=>2620:0:861:1:208:80:154:241} {cloudelasticlb4=>208.80.154.241" [17:56:32] something odd with the realserver IPs puppetization [17:56:42] should just be the two addresses, space-separated [18:08:48] at a glance is seems like reading the LVS service IPs puppetization stuff, there's like 8 possible different ways you could configure it, and existing examples of at least 5 of them, and they only differ on the surface in annoying meta-terms about role-vs-profile and variables-vs-hieradata, etc... [18:09:16] and it's easy to configure something that looks like it should work, but is a bad cross-section of 3 different pathways to configuring it otherwise-successfully. [18:25:40] :S Indeed i looked at other examples and guessed a bit. I think anything with ip4 and ip6 address should be fine. I doubt it even needs ip6, but it seems time that everything should get ip6 :) [18:26:53] I think I've got it now [18:27:23] (using an older pathway that requires a WMF style-guide ignore comment heh) [18:28:16] I think, there aren't many examples outside of traffic meta-services like the public caches, which puppetize LVS for ipv4+ipv6, and thus such examples aren't well-accounted for in any of multiple recent partial refactorings of all this [18:34:09] ah and they have no add_ip6_mapped either... [18:34:55] (so LVS can't check the v6 on them for pooling) [18:34:58] that can be fixed too [18:35:02] it's not happy with the certs (i'll fix that) but otherwise it seems to be sending requests through [18:36:01] (only ipv4) [18:36:47] oh! indeed they don't have ipv6 addresses assigned to the interfaces. i should have noticed that... [18:37:23] oh maybe it does...i should read closer :P [18:37:25] well [18:37:36] they do have SLAAC, but not fixed ones that can be looked up in DNS [18:37:50] I'll push a couple patches and fix that part up in a few, but I gotta run through a meeting first and then pop back [18:38:00] ok, thanks for the help! [18:55:39] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Smalyshev) a:03Smalyshev [19:25:45] ok back, I think we just need mapped v6 for the hosts now, trying [19:37:59] 10netops, 10Operations, 10ops-eqiad: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10ayounsi) p:05Triage→03Normal [19:46:40] 10Traffic, 10netops, 10Operations, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10ayounsi) This should probably wait on T219908. Whatever solution we find to configure IPv4 based on Netbox data, IPv6 should be the same. [19:49:12] I think it's check_https_on_port, as used by cloudelastic in hieradata/common/lvs/configuration.yaml , doesn't work as expected [19:50:04] or well maybe that does technically work right, but the host definition that gets created for it is wrong [19:53:01] hmmm I think I found it, maybe! [19:57:20] 10netops, 10Operations, 10ops-codfw: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) 05Open→03Resolved This is done. [19:57:23] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [19:57:46] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [20:04:02] 10netops, 10Operations, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) Service Request ID 2019-0801-0611 has been created. [20:57:32] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack)