[07:52:54] 10netops, 10Operations, 10ops-eqiad: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Resolved>03Open Re-opening as we're seeing errors again (at a lower rate, but errors nonetheless) https://librenms.wikimedia.org/graphs/to=1536910800/id=6828/type=port_er... [08:12:26] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) > Oct 14 19:05:27 asw1-eqsin craftd[1962]: Minor alarm cleared, FPC 0 PEM 0 is not powered > Oct 14 19:05:28 asw1-eqsin craftd[1962]: Minor alarm cleared, FPC 1 PEM 0 is not powered So... [08:28:04] 10netops, 10Operations: relabel switch interfaces formerly saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207035 (10ayounsi) 05Open>03Resolved a:03ayounsi Renamed. [10:43:23] 10netops, 10Operations, 10Patch-For-Review: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) I installed Quagga in a VM to verify the commands, but there will most likely be differences with the office Quagga. Only things that might need to be added is updating ipt... [11:10:37] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [11:18:21] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) @Cmjohnson all yours to populate with optics and ship to eqord, @Papaul will be there on Oct. 23rd. If we don't have enough optics, please use the ones received for cr2-eqsin in T205487 and let m... [12:17:21] 10netops, 10Operations, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10ayounsi) 05Open>03Resolved As this was the troubleshooting task and follow up tasks have been open for the software upgrades thems... [12:49:51] bblack: if you have a moment I've a question for you regarding the zone linter [12:50:45] ok [12:51:29] so, in the wikimedia.org zone we have both hosts (dbmonitor1001) and service records (gerrit) that have direct A/AAAA records (not DYNA) and I don't see anything obvious to tell them apart [12:51:44] so right now the tool is reporting as error if 2 services share the same IP [12:51:57] example: [12:52:13] ERROR: Found 2 name(s) for IP '208.80.155.11', expected 1: [12:52:14] wikimedia.org:161 civi1001.wikimedia.org. A 208.80.155.11 [12:52:14] wikimedia.org:812 civicrm.wikimedia.org. A 208.80.155.11 [12:52:27] we have 16 of them [12:52:49] yeah it's a tricky subject :) [12:53:08] is that checking cross-zone too? [12:54:02] it loads all the data first, so IIRC yes, but we're checking only 'wmnet', 'wikimedia.org' zones + reverse [12:54:10] I see [12:55:00] yeah it's really a pretty bikeshedd-y subject at the end of the day [12:55:09] for now, as those are the "infra" zones, and all the others have either DYNA or some custom logic hard to check [12:56:08] the way that would be cleanest from a linting/verification perspective, would be 1:1 matching of PTR<->A for whatever we consider the more-important name (which is ???), and a CNAME for the other. [12:56:18] but then especially with public IPs I donno about the ??? [12:56:57] e.g. if 208.80.155.11 were the primary interface IP of machine civi1001.wikimedia.org on a public vlan, we'd want A/PTR on that hostname, and civicrm.wikimedia.org to be the CNAME record [12:57:36] yeah I was exactly wondering if the "service" one could just be a CNAME [12:57:44] but you could also make the argument that the service name civicrm.wikimedia.org is the important/canonical use of the IP and the revdns should match for it, which leaves the actual host interface hostname/IP as a CNAME, which also seems wrong. [12:57:44] but maybe there are $reasons why it's not [12:58:05] regardless of the duplicate-A-record lint, it's hard to answer that question (which does the PTR point at?) [12:58:28] the PTR in this specific case points to civicrm.wikimedia.org. [12:59:06] yeah, which seems "wrong" if that IP is also the primary interface IP of some machine civi1001 that lives on a public vlan (which might not be the case here, but might in others) [12:59:20] but seems more-right in the public-view sense [12:59:57] either way, then there's two other follow-on questions: [13:00:21] 1) Assume we choose whichever of A or B should have the matching PTR record: should the other be a direct A-alias, or a CNAME? [13:00:46] 2) If it stays as a direct A alias, should we have multiple PTRs? (I think I've seen this before, but I also vaguely recall some RFC or other saying it's bad practice) [13:01:43] sure [13:02:03] on question 1, the basic tradeoffs are this: [13:02:26] also ignore what the tool does now, and let's focus on what we'd like it to do :) [13:03:05] yes of course, but linting is important. I see this as kind of like "it's actually ok to change your C code to please a linter, because being able to cleanly run that linter has lots of long term benefits" [13:03:29] anyways, the pro/con list on the CNAME vs duplicate-A thing: [13:04:27] 1) pro-CNAME: reduces duplication in the "code duplication" sense: nobody will accidentally update one copy of the IP and not the other. This only makes sense if you're manually managing the file though. Ideally should be templating in those IPs from a single source anyways, but not today! [13:04:56] 2) pro-CNAME: allows enforcing this 1:1 rule on PTR matching, which does make lint/validation simpler. [13:05:32] 3) con-CNAME: bloats repsonse packet size, but honestly who cares in most cases; if everything else is well-managed it won't matter in practice. [13:06:13] 4) con-CNAME: the possibility in the long term that if your workflows/patterns are CNAME-centric, you end up building deep CNAME chains accidentally, which are awful for bloating and for processing. [13:06:28] that last point, I think is something an additional linter rule could protect against [13:06:44] chained CNAMEs, I guess so [13:06:49] as in, validate that we don't have CNAME->CNAME->CNAME chains greater than depth X (cross-zone in all our data), for a very small X [13:06:59] ideally like, 2 :) [13:07:03] yeah [13:09:26] so I think the arguments on that level lean in favor of: one of those names should be a CNAME, and we should lint against overdeep CNAME chains. [13:09:54] but picking one to be the canonical A/PTR pair and the other to be the CNAME alias, is tricky, and independent of the CNAME question regardless. [13:10:14] and in the end on a case-by-case basis demanded to the service owner [13:10:32] yeah I think so [13:10:44] arguably, the problem runs a bit deeper than the DNS level here [13:11:06] in that we shouldn't have "service" hostnames (public or private) that merely alias to a machines own IP for the machine's hostname [13:11:23] we have also trickier cases in which none of the 2 IPs is a host [13:11:38] recursor1.wikimedia.org and dns-rec-lb.esams.wikimedia.org for examples [13:11:56] recursor1 is I think legacy bullshit we don't use, but we've never tracked that down [13:12:04] git-ssh.eqiad.wikimedia.org. and git-ssh.wikimedia.org. [13:12:18] things like that :) [13:12:21] that one is a little trickier! [13:12:46] so the high-level pattern for Big Services That Are Well-Configured is something like this: [13:13:06] mw1221.eqiad.wmnet has a singular hostname+IP that's singular in DNS and irrelevant to service consumers. [13:13:44] there's a service hostname like appservers.svc with its own A/PTR, but that lives in an LVS-specific (not row-specific) address space, and LVS balances out the traffic. [13:14:05] or ditto for en.wikipedia.org -> LVS (not row specific) public IP -> fans out to various internal cache hostnames [13:14:30] in the smaller cases, we have some other (anti-?)patterns: [13:15:07] 1) Obviously, having services contacted directly on a machine hostname is Bad (e.g. is configured to contact a hostname like mw1221.eqiad.wmnet) [13:15:45] 2) The next step up in abstraction is someone makes a specialhostname.svc.eqiad.wmnet which just copies mw1221's IP or CNAMEs to mw1221's IP, which is also bad, but common. [13:16:29] 3) An improvement is to at least make a separate IP for specialhostname.svc.eqiad.wmnet, which happens to be allocated from the same row as mw1221's IP, and configure that separately on mw1221's interfaces. [13:17:11] (but the downside is, when we replace that or fail it over, to mw1222, and mw1222 happens to be in a different row, the virtual service IP breaks and nobody realizes it until afterwards and then has to go fix config/dns anyways) [13:17:25] * volans sees himself writing a table of highlight current checks of the tool and warning/error level, to be integrated with TODOs [13:17:48] (this problem is way worse for public services using IPs straight out of our public vlan rows, I think) [13:17:56] right [13:18:32] 4) Even better, is to allocate the service IP from the row-neutral address spaces, which today at least is synonmous with "LVS service IP ranges" [13:19:27] (but then: it won't get routed to your instance by default: you either need to put it behind LVS (which you need to do anyways if >1 machine per DC), or you need some other special BGP/router support to route the IP to the instance (which is the case with e.g. authdns, which is not LVS-managed, but uses LVS space + special rules on router today) [13:20:42] and then there's the cross-DC stuff: git-ssh == git-ssh.eqiad basically means we have some kind of x-dc failover capability baked in, but it's more or less hardcoded instead of geodns and/or discovery -automated. [13:21:50] there's a lot of cleanup we could/should be doing here, to define what the standard acceptable patterns are and enforce it [13:23:13] one of the defining constraints of the problem is we don't do x-row VLANs at all, of course, and that's something we could revisit just for service IP ranges (not host-primary) [13:23:53] e.g. if it's a common thing to want a non-LVS service to have a row-portable IP, we could define a new subnet that's eqiad-row-neutral-service-IPs, and have it available on all switches, for the service hosts' to use with a vlan-tagged interface. [13:24:22] that allows for that sort of "one machine at a time with manual row-netural failover" case, if we really want to support that. [13:24:38] indeed [13:24:48] but it's a PITA on other levels [13:25:09] it breaks our ethernet-level row isolation we have today, which is nice for all kinds of resiliency reasons, and requires we hook up all kinds of cross-row ethernet links [13:25:49] it would seem better to standardize around something more like: if it's a service IP, it goes through LVS and uses LVS service IP ranges, even if you only have one machine. [13:28:19] (but then configuring new LVS services is currently Painful on the puppetization end, and requires pybal restarts) [13:28:36] so there's a lot of deep things to chase down here to reach a cleaner state [13:30:21] yep I knew it from the start, that's why I concentrated on the relatively smaller space of infra/hosts only [13:30:26] but the wikimedia.org zone has both :D [13:35:30] gehel: does the RPS thing seem to make a diff in dropped packets in practice? [13:35:57] bblack: not sure, I haven't seen much packet drop lately [13:36:00] (there could be other causes than the single CPU at the NIC level thing, of course) [13:36:03] ok [13:36:20] the full story is that there contention between NIC and Blazegraph for CPU [13:36:52] so as far as I understand, packet loss is also related to high CPU usage from Blazegraph, which we also addressed (to some extent) [13:37:35] I'll have a better understanding after a week of data [13:38:09] ok, sounds good :) [13:38:59] I've been thinking about whether we should just put interface::rps into some base class (+numa_networking by default), at least for the cases where the host has only one primary interface configured, and the driver's on a whitelist (which now, our whitelist would cover the vast majority of production) [13:39:48] known-supported is bnx2, bnx2x, bnxt_en, and tg3. which are all broadcom, but I think between them covers ~90%+ of prod primary interfaces. [13:40:18] sounds reasonable to me [13:40:40] what are the drawback? [13:40:58] none really that I know of, yet [13:41:14] I've read something about packet reordering being more expensive when processing is spread over multiple CPU, or something like that [13:42:16] well, the way it's supposed to work is that the card divides up the incoming traffic into Flows first (which means IP:port <-> IP:port, basically TCP connections) [13:42:26] and hash flows into queues. So for a given connection, all packets land in the same queue. [13:42:57] (and thus for tcp reordering or fragmentation, not an issue) [13:43:14] but the caveat there is how you handle having more CPUs than queues [13:43:38] e.g. right now, that common tg3 card (your wdqs use case, and many others) only has 4x hardware receive queues. [13:44:19] interface::rps behavior in this scenario is that if you have 8 available cpu cores (physical, not HT, and on the same numa domain as the card. So "8" might be the value for a host that has 32 logical HT cpu cores total) [13:44:35] it will map the first queue to both of the first two CPUs, etc.... [13:44:38] each queue gets two CPUs [13:45:08] You could make the argument that's a bad default behavior, and maybe we should switch its behavior to not even send a queue to more than one cpu [13:45:32] * gehel is wondering how much of that is micro optimization [13:46:10] (all the cases we've cared about hard before, we have more than enough queues to do 1:1 so it's not an issue. I think we put in the math bits to do 1:n queue:cpu kind of arbitrarily just to cover the case, but never really thought about it much) [13:46:51] it's definitely out in the realm of crazy micro-optimization for the general case [13:47:04] it matters more on important high-traffic nodes that can see huge influxes though, e.g. lvs/cp [13:47:17] yeah, I can buy it for those! [13:47:19] but if we've bothered to do the legwork for them and it's generic and automated, why not apply elsewhere? [13:47:30] unrelated, can I pick your brain about T207195 ? [13:47:31] T207195: Configure LVS endpoints for new elasticsearch clusters - https://phabricator.wikimedia.org/T207195 [13:47:56] and the related https://gerrit.wikimedia.org/r/c/operations/puppet/+/467684 [13:48:08] yes. I read a little about it yesterday, but I didn't quite see it all clearly [13:48:21] yeah, it's not all that clear in my head yet [13:48:31] and the context is spread over a number of tickets [13:48:41] for the high level context, we're working on multiple elasticsearch instances on the same server [13:49:00] I guess my first question: is this just about HTTPS flows into these split clusters, or also other protocols LVS is handling? [13:49:01] the goal is to reduce the number of shards in a single cluster, since this has an impact on cluster wide operations [13:49:10] I don't even remember if es has other protocols [13:49:11] just HTTPS [13:49:50] We expose HTTP as well, but since we're changing the endpoints, we should take the opportunity to drop it completely [13:50:14] ok [13:50:18] the goal is to have one large cluster for large / high traffic shards, spreading over all nodes [13:50:23] so this is search.svc.(eqiad|codfw).wmnet [13:50:27] ? [13:50:32] (in LVS terms) [13:50:34] and 2 small cluster, over half of the ndoes each [13:50:41] yes, search.svc [13:51:11] and varnish doesn't map into this either I don't think, just internal direct svc->svc traffic [13:51:18] yep [13:51:53] we're also thinking about taking this opportunity to introduce discovery endpoints, but there are also a couple of loose ends on that side, so maybe better to do it in another step [13:51:54] so right now, for the HTTPS traffic, it's all using https://search.svc.(eqiad|codfw).wmnet/ in terms of SNI and Host: headers? [13:52:12] yep [13:53:01] so, there's a few ways you could do this I guess, at the LVS+backend layer, that seem semi-reasonable: [13:53:50] 1) Configure new listeners on new ports on all the backend nodes for some other hostname like search2.svc, and configure that as a separate LVS service and DNS entry and service IP, etc. [13:54:50] 2) Re-use the same ports and same IPs and same LVS service, and just create a new hostname search2.svc aliasing the old one, and differentiate in the backends' HTTPS configuration by splitting on SNI (e.g. apache server stanzas for the different hostnames, with different certs?) [13:55:12] (but then I suspect you're using puppet certs, I'm not sure if we can even make different certs with different SNIs for the same puppet host?) [13:56:37] I'm suspicious of SNI support in client libraries (I've been bitten before) [13:56:38] the main functional difference (aside from 2 being easier to configure if it works at all) is that with (1) you'd have multiple service defs at the lvs+etcd level to depool machines in and out of independently (the same set of machines, but two distinct lists) [13:56:57] and with (2) they all stay together in terms of shared pooling-state [13:57:24] SNI is pretty ancient at this point, I'd be shocked if anything we can still legitimately run at all doesn't support it [13:58:20] also, since not all the elasticsearch clusters spread over all nodes, we probably need different LVS endoint anyway [13:58:36] yes, I know strange setup [13:58:56] but they can share ports, or is it a given that each cluster must have unique listen ports when they share machines? [13:59:30] there is an nginx doing ssl termination on the nodes, so we can abstract that via nginx [14:00:03] but on the elastic side, we're probably opening a world of pain if we try to have them listen on the same port [14:00:18] we can probably play with lo:aliases or something like that [14:00:19] for some contextual reference: end-user agents which lack SNI support that ssllabs.com bothers to report on are: Android 2.x, IE6-8 on XP, Java6. [14:00:48] ok [14:01:11] So SNI probably OK, unless proven otherwise [14:01:26] yeah but pointless if you don't want shared lists of backends for all clusters [14:01:37] yep [14:01:55] unless there is some magic that I don't know about at LVS level [14:02:11] there's not, because it doesn't see inside protocols for things like SNI or Host: headers [14:02:22] it just routes traffic based on IPs and ports in headers [14:02:24] that's what I thought, not enough magic :) [14:02:40] so we need either different IPs or different ports [14:02:43] or both [14:03:22] right, currently you have 1 IP for the whole DC, flowing port 9243 through LVS for HTTPS [14:03:31] correct [14:03:50] if you keps the same listen IP and differentiated solely on port, that would have to be true on the client end as well [14:04:07] yep, that's not an issue [14:04:11] e.g. a client would have to know to connect to search.svc:9243 for cluster1 and search.svc:9343 for cluster2 [14:04:30] and you'd get separate node lists to configure and/or depool nodes from independently [14:04:57] or you can make it a separate new IP and hostname like search2.svc, keep your same standardized port number. [14:05:14] from that POV, it's all basically the same to LVS, either way works. whatever seems more intuitive or cleaner for client config. [14:05:42] oh except if we keep the same IP+port, then you're back to SNI at the apache level to tell the difference [14:05:44] So if we differentiate on ports we can still have different host list for the same service name but different ports? I'm sure that's no problem on the LVS side, but do you know if there are some assumptions on the puppet side of things? [14:05:59] err sorry, that's not true, I got lost [14:06:41] let me redo the above, clearer: [14:07:21] 1) You could keep your existing search.svc hostname and its IP, and differentiate solely on port number. Two distinct LVS services, with distinct node lists that may happen to overlap, with separate depooling. [14:08:17] 2) You could keep your port number constant, and set up a search2.svc hostname + separate IP to differentiate on. Still two distinct LVS services. Your apache listeners now also differentiate on listen-IP instead of listen-Port from (1), and you still have the distinct, overlappable nodelists. [14:08:31] (1) seems very slightly simpler [14:09:13] (in that you don't have to go make DNS changes and allocate new IPs every time you make new clusters) [14:09:48] and find new names for those clusters (you can't imagine the long discussions we've already had just on that topic :) [14:10:08] there should be no lvs/puppet/pybal-level issue with that. It's essentially the same pattern we use for e.g. supporting HTTP+HTTPS for public wiki endpoint, or HTTP/HTTPS for your current nginx on search.svc [14:10:34] you can think of 9243 and 9343 (or whatever) as being like http+https [14:10:47] except the node list is different [14:10:49] in terms of how puppet->lvs/pybal->nginx sees the listening/forwarding setup [14:11:02] the node list is always different. even for http+https we depool separately [14:11:14] ok, so all good (I have not checked the code lately) [14:11:23] search-http and search-https are the same [14:11:38] Ok, that's a very nice transition for my second question: https://gerrit.wikimedia.org/r/c/operations/puppet/+/467684 [14:11:46] (they use separate confd pools "elasticsearch" and "elasticsearch-ssl") [14:12:12] since we want to have SSL termination on different ports, and that's not supported by our current puppet code, it needs a fix [14:12:26] I'm looking for a reviewer and you seem to have written a good chunk of that class [14:13:36] * bblack disavows all knowledge [14:13:58] * gehel can past a `git blame` if needs be [14:14:13] lol [14:14:19] I think the only thing you have to be careful of, is if you define multiple tlsproxy listeners that differentiate on port instead of ip, you can't use tlsproxy's "redir_port" to auto-redirect port 80. [14:14:29] probably, that didn't work for non-default TLS ports even in the singular [14:14:55] yeah we talked about that yesterday [14:15:09] apparently redir_port is not used, right gehel ? [14:15:17] it's not used anywhere, yet [14:15:24] yep, not used anywhere as far as I can grep :) [14:15:31] it's planned to be used in tlsproxy's primary use-case on the edge caches though [14:15:49] once we get all the non-canonical domains securely redirected from elsewhere, which in turn is dependent on certcentral :P [14:15:59] it's been a very long time coming heh [14:16:05] but there isn't any sane way to redirect if there are multiple TLS prots for a single HTTP port [14:16:11] I see two options, either go towards forcing to have only one port if redir is used, or move it to an hash of redir ports [14:16:22] dest => src [14:16:38] right [14:17:06] I think (not tested) that the current code would work already if you have a 1:1 mapping of redir_port:tls_port [14:17:07] the only problem puppet wise is to check consistency as this is a define [14:17:31] gehel: it wouldn't, because the template for the redirect doesn't even try to set a new port in the redirect destination URI [14:17:43] it juse uses https://$servername$1 or whatever [14:18:14] you could fix that, where you plug in a :port variable there if the tls port isn't 443 [14:18:17] Oh right, so it even only works ifthe TLS port is 443? [14:18:39] but then the conflict you still run into is two such localssl's with only a port differential conflict in their port 80 listeners [14:19:15] since almost surely nobody will ever actually do any of this in practice [14:19:22] unless you have redir_port != 80 (which does not really make any sense) [14:19:35] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [14:19:40] probably the simplest approach would be to put a short snippet in the define that makes it break/fail is the tls_port != 443 and redir_port is defined [14:19:59] and anyway, how would you do the routing to the right port? [14:20:04] yeah, I can add that [14:20:21] and then we don't have to mess with the redirect template, and in the multi-port-listeners case one of them has to be non-443 and it would break that way too [14:20:46] +1 to start simple and just fail if overused [14:23:27] ok, change ammended [14:23:41] the other pathway is complicated: we'd have to template in the non-443 redirect destination, and also require unique redir_ports if defined (by I guess making a $redir_port_str that says "undefined" if not defined, and putting that in the notify hack too) [14:23:59] or I guess, actually in a separate notify hack, or something, since that one only covers the default_server [14:24:08] I don't know [14:24:24] and we don't even know of a use case that would exercise that for real [14:25:18] someday port 80 everywhere will just return ECONNREFUSED [14:26:02] but more-seriously: are you merging that today? (the tlsproxy patch) [14:26:27] it's super trivial, but any change to tlsproxy carries Risks like "omg you broke nginx on all the public terminators" [14:27:01] nice jedi mental trick on gehel, bblack [14:27:02] at least pcc-check on some random cpNNNN hosts probably, and maybe have someone around to do a quick puppet run post-merge on a cache to validate it went fine. [14:27:57] these are not the redir_ports you are looking for [14:28:12] I'll make sure there is someone around when merging! [14:28:28] for now, meeting time [14:28:58] time for me to sort through the 5 different half-done things I have in different open terminals and tabs heh [14:29:20] good luck! [14:29:36] * gehel has no idea what bblack is talking about [14:29:54] oh self-reminder: maybe fix interface-rps for low queue counts [14:30:02] it will change your case too, gehel. [14:30:20] but yeah, it kinda doesn't make sense to splatter-pin small queue counts out across large cpu count, like it does today [14:33:06] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/interface/files/interface-rps.py#223 [14:33:37] ^ it's all in that function. It also splatters when there's an uneven match, e.g. 20 queues + 16 cpus = first 16 queues are 1:1, final 4 get mapped to 4 cpus each. [14:33:53] it's probably a better trade to just always go 1:1 and leave things "uneven" [14:34:39] (for the 10G cards that matter more like bnx2x/bnxt_en, we explicitly set the queue count to the applicable CPU count anyways, forcing 1:1 at a different level) [14:35:06] (but for bnx2 and tg3, we just get whatever queue count the hardware gives (8 for bnx2, 4 for tg3)) [14:40:53] afternoon vgutierrez [14:41:21] hi Krenair [14:41:47] do we want to ask vol.ans and _j.oe_ to re-review? [14:41:57] Krenair: we already have a +1 by bblack [14:42:12] so IMHO I'd go with it [14:42:13] yeah [14:42:30] I've already rebased the authdns one [14:42:43] and added the labs/private secret for authdns-certcentral ssh key [14:42:50] (it was causing pcc to fail in that CR) [14:43:23] ok [14:43:37] regarding the authdns one, I'd move the ssh public key to the secret repo, to be consistent with other ssh keys [14:43:45] yeah I dunno why they're like that [14:43:50] but okay [14:44:05] I'll take care of deployment-prep's key for it later [14:44:48] and regarding https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459809/13/hieradata/role/common/certcentral.yaml [14:44:55] I'm probably missing something there [14:45:11] I'm aware that authdns_servers in common.yaml already contains the sync_dns_servers [14:45:27] but hiera() lookup function only works if the value is a string [14:45:49] wait really? [14:46:08] "The hiera() lookup function performs a Hiera lookup, using its input as the lookup key. The result of the lookup must be a string; any other result will cause an error. [14:46:12] from https://puppet.com/docs/hiera/3.2/variables.html#the-hiera-lookup-function [14:46:36] maybe I'm missing something [14:46:56] don't we use hashes and stuff everywhere? [14:47:32] here's some existing stuff [14:47:45] modules/profile/manifests/elasticsearch.pp: Hash[String, Elasticsearch::InstanceParams] $instances = hiera('profile::elasticsearch::instances'), [14:47:45] modules/profile/manifests/lvs/realserver.pp: Hash $pools = hiera('profile::lvs::realserver::pools'), [14:47:45] modules/profile/manifests/mediawiki/hhvm.pp: Hash $extra_fcgi = hiera('mediawiki::hhvm::extra_fcgi', {}), [14:48:08] of course [14:48:37] I mean the hiera() function to interpolate a whole hash [14:48:47] from one data file to another [14:48:50] not from puppet code [14:48:52] oh [14:50:09] okay so [14:50:26] what it'd be nice to do is get rid of the sync_dns_servers bit from hieradata/role/common/certcentral.yaml [14:50:42] yep, as it is duplicated data [14:50:43] and just reference the 'authdns_servers' key [14:51:21] but the hiera interpolation stuff only works on strings [14:51:25] so we can't do that, right? [14:51:43] not without going through puppet code to do that [14:51:56] sounds correct to me [14:59:49] while we're on the certcentral stuff in general, another longer-term thing we'll have to eventually solve, related to the waiting-for-clock-skew period and such: [15:00:13] is handling client-side integration of OCSP stapling reliably. Which is going to mean shipping the client multiple certs when going through a renewal. [15:00:14] vgutierrez, how about we merge the main commit and come back to the follow-up commit? [15:00:29] (the old solution doesn't do any of this of course) [15:00:51] sure.. let's burn cercentral1001.eqiad [15:00:55] O:) [15:00:57] the idea is that for OCSP stapling to work reliably, the client-https-service has to have the cert ahead of time, to run a local OCSP fetch and populate. [15:01:08] burn? [15:01:20] so you can imagine a hypothetical model like: [15:02:25] CC manages it more like a "set" of overlapping certs: the old one it's renewing well ahead of time, and the new renewal it has just-recently fetched from LE (but is perhaps not "live" yet, because we're waiting a day for clock skew before deployment) [15:02:49] so [15:02:58] it could ship the whole set to the client (multiple files), and an indicator as to which is live [15:03:28] so the old one would still be "live" initially while we wait clock skew, but having the new one sent over lets the client start prefetching OCSP data for it, so that later when it becomes live the OCSP is ready to go. [15:03:53] there's probably like 10 ways to factor that out, and ? about how much of the state is managed client-side or CC-side [15:04:06] I think we already have the distinction between 'new' and 'live' certs, we just don't wait any time after getting it signed by the CA [15:04:14] indeed [15:04:22] we could have it: [15:04:29] yeah we already made a ticket about clock-skew holding for later [15:04:45] a) wait a while before deploying to live, if there is already a trusted live one [15:05:00] b) the API expose the new one anyway for clients to request if they want [15:05:30] and have the clients do their OCSP thing with that [15:06:02] yeah there's some nits around how we handle that, e.g. the interface with puppet file resources, and how we flag which of the multiple we send is live or not [15:06:05] but yeah [15:06:12] something like that :) [15:06:30] just run puppet in certcentral1001 [15:06:39] some minor issues [15:06:42] Error: Cannot create /etc/certcentral/accounts/6e01c693ed6e9d9a6b5930923ecef104; parent directory /etc/certcentral/accounts does not exist [15:06:45] Error: /Stage[main]/Certcentral::Server/File[/etc/certcentral/accounts/6e01c693ed6e9d9a6b5930923ecef104]/ensure: change from absent to directory failed: Cannot create /etc/certcentral/accounts/6e01c693ed6e9d9a6b5930923ecef104; parent directory /etc/certcentral/accounts does not exist [15:07:35] don't we create /accounts somewhere? [15:07:51] I thought either in packaging or puppet... [15:07:53] hm [15:08:23] it's missing here: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/458554/18/debian/dirs [15:08:41] hm [15:08:45] yeah it's not in debian/dirs [15:08:50] and puppet only does the subdirectories of that [15:08:55] yup [15:09:00] we need to fix the .deb package [15:09:19] besides that, puppet failed to restart nginx [15:09:27] well, mkdir for now heh [15:09:37] see what the next issue is :) [15:10:17] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/467992 [15:11:33] vgutierrez, nginx failing to start? [15:12:15] Oct 17 15:11:46 certcentral1001 nginx[17121]: nginx: [emerg] BIO_new_file("/etc/ssl/dhparam.pem") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/etc/ssl/dhparam.pem','r [15:12:19] Oct 17 15:11:46 certcentral1001 nginx[17121]: nginx: configuration file /etc/nginx/nginx.conf test failed [15:12:22] Oct 17 15:11:46 certcentral1001 systemd[1]: nginx.service: Control process exited, code=exited status=1 [15:12:46] hm [15:12:57] apparently that one comes from sslcert::dhparam ? [15:13:34] I imagine this error was introduced when we added the ssl_ciphersuite function to the config [15:13:54] yup [15:14:26] will upload a fix [15:16:19] of course we don't have a proper cercentral config right now, so it fails to start [15:16:24] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467997 [15:16:27] the code needs at least 1 configured certificate [15:16:57] right [15:17:06] we should do the DNS thing before defining a test cert there [15:17:12] indeed [15:17:23] yeah [15:17:33] sslcert::dhparam gets included in all the other foo::cert things [15:17:54] modules/letsencrypt/manifests/init.pp: require sslcert::dhparam [15:18:00] modules/sslcert/manifests/certificate.pp: require sslcert::dhparam [15:18:07] 15:17:23 modules/certcentral/manifests/server.pp:98 wmf-style: class 'certcentral::server' includes sslcert::dhparam from another module [15:18:12] Okay [15:18:12] basically "if your thing installs a cert, it should also require that" [15:18:14] Seriously [15:18:15] Guys [15:18:34] This is already required outside the sslcert module [15:18:58] yes, it is. we can debate the philosophy of the puppet style enforcement some other time and place though [15:19:04] just V+2 it :P [15:19:31] right, pcc looks happy, I'm V+2 it [15:22:21] and nginx starts now :) [15:22:58] nice, I'm gotta go for a few minutes cause I need to eat something.. my brain thinks it's 17:22 and I didn't have lunch [15:23:03] I'll be back soon [15:25:07] it is 17:22 there isn't it? [15:25:54] 16:22 [15:26:03] I'm in Morocco for the week [15:28:18] BTW, we need to come with some cercentral checks for icinga :) [15:28:40] oh yes I remember you mentioning [15:28:52] yeah [15:29:43] I'm gonna create a task for that [15:29:55] and also one for bblack's OCSP thing [15:31:10] 10Certcentral, 10Operations, 10monitoring: Create icinga checks - https://phabricator.wikimedia.org/T207294 (10Krenair) [15:31:16] 10Certcentral, 10Operations, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10Krenair) [15:33:57] 10Certcentral: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 (10Krenair) [15:55:44] vgutierrez, [15:55:52] I just noticed that your snakeoil modules/secret/secrets/ssh/authdns-certcentral [15:55:59] it says RSA instead of OPENSSH [16:06:08] https://gerrit.wikimedia.org/r/#/c/labs/private/+/468012/ [16:13:52] Krenair: SSH actually allows your typical PEM as private key format [16:15:46] hm so it does [16:16:51] bblack, vgutierrez: I'm ready to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/467684 (puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/13014/) [16:17:09] but waiting for your OK and quick check on cp hosts as needed [16:17:23] gehel: ok go for it [16:17:26] vgutierrez, looks like locally my ed25519 key has OPENSSH and my rsa key has RSA, interesting [16:17:33] * gehel is going for it! [16:18:49] bblack, vgutierrez: merged [16:19:36] yeah seems fine, nothing broke on a random cache [16:19:37] Notice: tlsproxy::localssl instance unified on port 443 with server name www.wikimedia.org is the default server. [16:19:40] Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Notify[tlsproxy localssl default_server on port 443]/message: defined 'message' as 'tlsproxy::localssl instance unified on port 443 with server name www.wikimedia.org is the default server.' [16:19:55] nice [16:20:04] kool! [16:20:21] nothing broke on elastic either [16:20:36] thanks for the help! [16:27:31] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10RobH) [16:27:36] 10Traffic, 10Operations, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) 05Open>03Resolved synced with @faidon, sv9 is the virtual site for the eq peering link and thus won't have real notices. the notices for... [16:29:51] Krenair: so volans is telling me that we should use keyholder to manage the SSH key :) [16:31:56] hm [16:32:09] I wonder if keyholder provides us any advantages here [16:32:53] IIRC it solves the problem of letting a group use an SSH key for connections, while not actually exposing the key material to them? [16:33:19] indeed [16:33:32] in this case it's just a service user [16:34:02] it allows to use keys with passphrases as opposed to passwordless keys [16:34:26] in a trivial way, including icinga check for unarmed keyholder, etc... [16:34:40] so that way the key can be stored encrypted in puppet-private, copied to the host, and then ops arms keyholder on the host using the passphrase stored elsewhere? [16:35:13] yes, neeeds to be re-armed on reboot [16:35:20] https://wikitech.wikimedia.org/wiki/Keyholder [16:40:02] adopting it seems pretty easy [16:40:45] yeah [16:40:47] so [16:40:51] from puppet's perspective [16:40:59] we add a Keyholder::Agent resource [16:41:24] point it at the path we deployed the key file to [16:41:33] set the trusted groups list appropriately [16:42:03] change the key file to be only readable by root [16:42:07] and change the script to use the agent instead of interacting with the key itself [16:42:30] indeed [16:42:46] we also move the secrets to keyholder in the labs/private repo [16:42:47] oh it looks like keyholder::agent will handle the file resource for us too? [16:42:50] ok [16:43:13] as for the group... [16:43:43] think it's just the certcentral group [16:43:45] indeed [16:44:04] alright [16:48:28] will upload a new PS [16:48:51] ook [16:49:25] I'm generating the prod ssh key right now [16:51:31] it's mostly 3 lines of code in puppet IIRC [16:53:37] yeah [16:53:49] just trying to remember how to access it once you've got it in keyholder [16:54:01] was it SSH_AGENT_SOCK=/run/keyholder/proxy.sock ? [16:54:04] yes [16:54:13] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh ... [16:55:14] so why doesn't that work for me on deployment-deploy01 :| [16:56:04] have you restarted the keyholder-proxy? [16:56:04] mwdeploy@deployment-deploy01:~$ SSH_AGENT_SOCK=/run/keyholder/proxy.sock ssh-add -l [16:56:04] Could not open a connection to your authentication agent. [16:56:11] is it armed? [16:56:15] sudo keyholder arm [16:56:21] yes [16:56:30] sudo keyholder status [16:56:32] is that ok? [16:56:45] looks ok [16:57:02] https://phabricator.wikimedia.org/P7686 [16:57:23] I don't see certcentral [16:57:29] yeah [16:57:36] haven't got that far yet [16:57:42] still trying to figure out how to use keyholder [16:57:54] you wouldn't find certcentral keys on that host anyway [16:58:08] you have to be in the auth group for those keys [16:58:34] uid=603(mwdeploy) gid=603(mwdeploy) groups=603(mwdeploy) [16:58:40] krenair@deployment-deploy01:~$ sudo cat /etc/keyholder-auth.d/mwdeploy.yml [16:58:40] --- [16:58:40] wikidev: [mwdeploy] [16:58:40] mwdeploy: [mwdeploy] [16:58:41] ops: [mwdeploy] [16:58:56] shy you do ssh-add -l? [16:58:57] no need [16:58:59] just ssh [16:59:02] to the target host [16:59:04] I tried ssh [16:59:06] keyholder is an agent [16:59:11] already [16:59:19] * volans about to enter another meeting [16:59:21] Permission denied (publickey). [17:00:30] try to restart the keyholder-proxy, sometimes it fixes the issue, although this one seems more a config one at first sight [17:00:35] figured a good first step would be to confirm I could at least list what was in the keyholder agent [17:00:38] sorry in a meeting, cannot check details [17:01:32] doesn't seem to have changed anything [17:05:36] secrets uploaded :) [17:10:36] volans, vgutierrez: I uploaded a new version, haven't tested [17:10:47] running pcc now [17:12:33] pcc is not happy: https://puppet-compiler.wmflabs.org/compiler1002/13015/certcentral1001.eqiad.wmnet/change.certcentral1001.eqiad.wmnet.err [17:12:45] right.. authdns-certcentral [17:12:48] instead of authdns_certcentral [17:19:31] vgutierrez, you wanna fix that in labs/private or the reference in puppet? [17:19:39] reference in puppet plz [17:19:46] cause the secret in prod has the same name [17:19:50] ok [17:19:52] and it's already there [17:21:27] thx [17:21:33] let's see what thinks pcc now [17:24:36] Krenair: you've missed modules/certcentral/manifests/server.pp:125 [17:25:16] wasn't that what I changed? [17:25:52] nope [17:25:55] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459809/16/modules/certcentral/manifests/server.pp [17:26:07] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459809/15..16/modules/certcentral/manifests/server.pp [17:26:28] hmmm [17:26:37] oh wait so you did use the hyphen [17:26:39] not the other way around [17:26:45] indeed [17:26:45] but then why was it [17:26:57] so... puppet must be converting it to the underscore [17:27:29] yeah... everything there uses underscores instead of hyphens [17:27:34] so I'm renaming it, sorry! [17:28:12] np at least what's used in puppet will match the filenames [17:32:12] done in labs and prod [17:32:16] pcc seems happy now: https://puppet-compiler.wmflabs.org/compiler1002/13017/ [17:33:08] Krenair: BTW, do you think that we could include the username in the gdns-sync.py script on the SSH commandline? [17:33:21] yeah [17:33:38] feel like someone suggested that and then I forgot to actually do it [17:33:38] even as a parameter with certcentral as default value [17:33:45] yup.. it was me :) [17:33:49] :) [17:36:31] nice...so I'd suggest letting the CR rest till tomorrow and see if we can get some reviews [17:36:48] meanwhile I'm merging the debian branch change [17:36:59] yeah [17:37:00] ok [17:37:11] that change does actually touch the authdns servers [17:37:11] so [17:37:12] hmmm but we should bump the version somehow [17:37:21] 0_1-2 I guess [17:37:26] 0.1-2 I guess [17:37:31] worth being more cautious than with the fresh unused boxes earlier :) [17:37:56] yup, breaking the authdns servers isn't an ideal outcome [17:38:49] Krenair: could you add a new version in the changelog for the debian package? [17:39:01] yeah [17:39:03] thx [17:40:50] hmmm [17:40:56] being pedantic it should be 0.1-2 [17:41:00] same code [17:41:05] different packaging cod [17:41:07] true [17:41:07] *code [17:41:08] O:) [17:56:38] we should probably have tasks for the current open certcentral commits [18:27:03] 10Traffic, 10Operations, 10Performance-Team: Investigate 200-300ms increase in responseStart.p75 - https://phabricator.wikimedia.org/T207315 (10Krinkle) [18:27:37] do you know by any chance if "check_sslxNN" is still used for any checks? grepping through prod icinga config and puppet repo.. it seems no check is using this checkcommand? [18:27:49] want to determine if i have to worry about making it work on stretch [18:28:15] it's only about the "xNN" varierty , not regular check_ssl [18:28:42] ah, i found a commit " Replace check_sslxNN with check_ssl_unified [18:43:07] removed [19:38:17] 10netops, 10Analytics, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) p:05Triage>03High [19:38:28] 10netops, 10Analytics, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) [22:29:11] 10Traffic, 10Operations, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Krinkle) [22:33:04] 10Traffic, 10Operations, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Bawolff) Further comments on irc seem to suggest that its related to Virgin Media's "WebSafe" proxy feature that's supposed to... [22:34:20] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) [23:51:06] 10netops, 10Analytics, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) Most of the cloud infrastructure hosts either are in the public vlan or are moving there as we update and replace hardware. The labsdb10...