[01:15:15] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Dzahn) ` 20:09 <+icinga-wm> PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% ... cp1078 login: [33059.724815] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeo... [06:47:33] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) Thanks for handling cp1078 @Dzahn. It looks like 4.9.144 is also affected [07:13:24] 10netops, 10Operations: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10elukey) p:05Triage→03High [07:40:53] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) so far we've seen crashes in the following servers: * cp1078 (twice) * cp1080 * cp1084 * cp1085 on the Dell community forum there is a [[ https://www.dell... [07:45:45] 10netops, 10Operations: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10Peachey88) [07:47:10] 10netops, 10Operations, 10monitoring: create a test for multicast relay - https://phabricator.wikimedia.org/T82038 (10Peachey88) [08:01:22] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10MoritzMuehlenhoff) The reports in that thread are for RHEL 7, which uses 3.10 as the base layer kernel (but with backports for all kinds of drivers, so it's hard to te... [08:51:54] 10netops, 10Operations: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10elukey) [11:23:42] 10Traffic, 10Operations, 10Proton, 10Readers-Web-Backlog, and 2 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) @phuedx and I talked about this. We need some documentation about how proton interacts with load balancers an... [11:30:25] 10Traffic, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) [11:31:35] 10Traffic, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) @Jhernandez I'm happy to explain to you whatever you might want to know about our load-balancing infrastructure, and... [14:33:13] 10Certcentral: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 (10Vgutierrez) [14:34:26] 10Certcentral: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 (10Krenair) Wasn't this https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/459866/ ? [14:35:37] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4365: ordinal not in range(128) [14:35:46] something crapped out when doing the authdns-update [14:35:51] * akosiaris investigating [14:46:49] 10Certcentral, 10Patch-For-Review: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 (10Vgutierrez) @Krenair yeah, same issue, the proper patch is https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/484438... [14:53:51] Åland Islands [14:53:52] ? [14:55:09] I am gonna go on a limb here and say we don't need utf-8 chars in our geo-maps file [14:55:14] bblack: ^ [14:56:48] this is one thing I hate about python and locales [14:57:26] this only broke for me because I force a locale of C for my account on wmf servers [14:57:48] changing it to en_us.utf-8 fixed it of course [14:59:17] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/484445 [15:07:20] Krenair: BTW, last night certcentral automatically renew the first certificate that reached the renewal threshold [15:07:32] successfully? [15:07:35] indeed [15:07:37] cool [15:07:38] root@certcentral1001:~# openssl x509 -noout -in /var/lib/certcentral/live_certs/pinkunicorn.rsa-2048.crt -dates [15:07:38] notBefore=Jan 15 04:00:10 2019 GMT [15:08:16] root@certcentral1001:~# openssl x509 -noout -in /var/lib/certcentral/live_certs/pinkunicorn.ec-prime256v1.crt -dates [15:08:16] notBefore=Jan 15 04:00:18 2019 GMT [15:08:24] both versions got renewed as expected [15:08:41] are there any renewed ones that are actually being served? [15:08:47] not yet [15:09:20] librenms one is going to get renewed at the end of the week [15:09:49] notAfter=Feb 17 15:50:45 2019 GMT [15:09:54] so in 2 days [15:10:07] k [15:13:11] akosiaris: yeah seems sane [15:20:15] (but it also seems crappy that user locale settings should affect our software execution at all) [15:21:36] I could kind of understand if it crapped out showing the diff to the user I guess, but you'd think locale for non-display things would be a fixed value even if not explicitly specified, hmmm [15:22:13] why do you force C locale? [15:22:46] vgutierrez: btw as we get into the bigger public certs and all that, one of the things to be sorted (it might be a in a ticket already?) is to get OCSP integrated at the puppet level. [15:23:31] we have some OCSP offline fetch stuff that integrates for tlsproxy for static manual certs, but I think it probably has hardcoded paths and such and won't drop right in for the certcentral pathnames, etc? [15:24:51] only OCSP ticket we have that I see with a simple ctrl+F is https://phabricator.wikimedia.org/T207295 [15:25:02] (in the certcentral phab project) [15:25:03] is that it? [15:25:29] well I guess that's a concern, too, and maybe should factor into design decisions [15:25:40] so... no then. ok [15:26:06] gtg [15:26:09] but even if we ignore the whole "it needs to be ocsp stapled before it goes live" thing, right now our OCSP puppetization that's used by tlsproxy (a) doesn't mix with legacy or new LE case and (b) hardcodes /etc/ssl/localcerts/ paths [15:26:49] so we can't ocsp them at all yet. but as soon as we fix the basic integration there, our next problem would be making sure OCSP is fetched for a new cert before it goes live. [15:27:22] relatedly and mentioned in the above ticket is the clock skew problem, which takes days anyways [15:28:17] so if we do basic OCSP integration (just get the pathname issues fixed up on the OCSP puppetization side to handle CC cert paths sanely), and then we do what we need to do for clock skew, the rest will take care of itself (since OCSP will fix itself before clock skew window is over) [15:29:41] what we need to do for clock skew is: when it's a renewal (not a first-ever copy of a cert, but an actual renewal), after we fetch the new cert, we need to hold it for some configurable time period (days) before putting it in live use. But it should be held on the client side too, not just CC-server side. [15:29:52] (so that the client can take care of its OCSP) [15:29:57] yeah.. first step is being able to keep a cert for a few days as new instead of live to avoid clock skew issues and have time to get ocsp stapled and so on [15:30:16] right [15:31:13] the only existing case where we have OCSP all working smoothly is the manual unified wildcard certs from GS+Digicert [15:31:40] and in those cases, what we do is put timestamps (well, year numbers) in the certificate filenames and deploy overlapping sets and choose the live one. [15:32:56] so e.g. the current working cert might be "globalsign-2018-rsa-unified.crt", and when we issue the next one we immediately push it to the TLS terminators as an additional new cert "globalsign-2019-rsa-unified.crt" which is dropped in the certs directory and configured for the OCSP stapler, but not live for nginx config [15:33:15] and then later after the clock skew delay is done, the nginx config switches to the 2019 cert filename. [15:33:54] we don't necessarily have to mirror that same design/system. We could hack up the OCSP stapler itself to operate differently, or have a different way of rotating certs into use, etc. [15:34:19] but at the end of it all, we need the renewals to be reliably stapled and out of clock skew before they hit nginx config. [15:44:51] (ocsp config needs to be able to remove old ones too I guess, or it will keep stapling all past certs forever) [15:46:38] ... all that being said: so long as we're not pushing the LE cert into live use for the unified case, and just deploying it as a manual backup cert we could switch to, we don't have to solve that for it. [15:47:04] but it might be simpler in the long run to solve it first and deploy it right the first time [16:00:23] bblack: currently OCSP stappling happens individually on every server that acts as TLS terminator, mainly cp servers, right? [16:03:19] 10netops, 10Operations: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10ayounsi) a:03Papaul Papaul, can you please verify the status of msw-d7-codfw and its link to msw1-codfw, and replace any faulty part if necessary. Thank you. [16:06:15] vgutierrez: correct. It's possible to centralize it for the LE case independently of our manual stuff (by doing the stapling of "live" and "staged renewal" certs on the CC server, and then pushing the stapling data file out to clients too just like we do certs). [16:06:30] but then I'd note these caveats before you just dive in: [16:06:40] yeah, python3-cryptography 2.4 already implements OCSP [16:06:49] so I was thinking in doing it from certcentral servers [16:06:54] well [16:07:23] anyways, continuing like I didn't hear that: [16:08:32] 1) OCSP data is fairly ephemeral. Over the past few years from a few providers, we've seen lifetimes ranging from ~4h to ~7d. Latest/best practice seems to be these more week-like values, but honestly I don't even know what kind of lifetimes LE gives on OCSP data. Even if we knew the present value (might be helpful), it can change on the provider's side at any time. [16:09:43] 2) If you centralize it, you're responsible for keeping it live and correct and monitoring it. Everything breaks if it gets stale or invalid, and it will break for all the servers you've centralized it for. [16:11:57] and contiuing like I did hear that: [16:12:49] 3) The current ocsp stapler script we use is a python script that doesn't use python3-cryptography, and instead drives the openssl command. It's been pretty stable lately, but it went through a number of bugfixes along the way. [16:13:37] So, you might want to start with just re-using that, but on the CC server instead of per-tls-terminator, and then think about replacing it with something based on python3-cryptography and mirroring all of the validity checks and re-testing that it all works sanely [16:14:12] I guess it depends on your confidence level and time estimates [16:14:32] but replacing the "update-ocsp" script with something new sounds like unecessary risk+time at this point. [16:16:37] the nice thing is since you have a daemon at all, you can schedule OCSP fetches more intelligently than the current cronjobs [16:17:12] (as in, look at the time to OCSP expiry and fetch it when it's half-dead, and retry more aggressively as it nears real expiry if warranted) [16:17:49] the current cronjobs just fire every 12 hours, which only works because we happen to know the lifetimes the operators are using [16:20:27] err I guess digicert is currently a week, but globalsign is currently 4d [16:21:13] ---- [16:21:58] another option might be to do both (leave the local redundant ocsp cronjobs on the TLS terminators, and also deploy an initial ocsp fetched by CC with any new/renew cert?) [16:22:43] but it might be tricky to puppetize that (to only deploy the ocsp file once when it's new) [16:24:40] right, I'm not a big fan of the current update-ocsp, but as you just said, it's stable and works :) [16:27:18] yeah the shelling out isn't ideal for sure, and the fact that it's a one-shot script [16:27:27] but it might turn into quite a rabbithole to replace it initially, too [17:06:20] 10netops, 10Operations: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10Papaul) 05Open→03Resolved looks like the mgmt switch froze have to unplug and plug the power back. Switch is back up [17:08:18] bblack: thoughts about https://gerrit.wikimedia.org/r/#/c/operations/dns/+/484509/ - https://phabricator.wikimedia.org/T213561 ? [17:22:37] ottomata: well there's probably some things to talk about re what "round robin" really means and/or accomplishes. But if it's just a bootstrap list, and the clients are smart enough to randomize and/or iterate the list themselves instead of just doing gethostbyname(), maybe ok. [17:23:13] bblack: ya they are [17:23:27] they usually take a list themselves for bootstrap [17:23:44] and will try each one round robin style until they find one that responds with the actual brokers to use [17:23:59] right, but do they have the DNS-level code built in to fetch such a list? most software when given a hostname uses some API that devolves down to gethostbyname(), which only returns one address. [17:24:15] https://github.com/edenhill/librdkafka/issues/1232#issuecomment-311307806 [17:24:19] clients that are smarter have to be configured with the recursive cache IP that you'd normally find in /etc/resolv.conf and use it directly and speak DNS [17:24:55] ^^ seems like it should work? [17:25:28] "librdkafka uses the standard libc resolver" means gethostbyname() I'm pretty sure [17:26:07] hm [17:26:31] bblack: lvs would work here also, do you think that would be more likely to work? [17:27:30] what are we trying to accomplish? [17:28:30] or I guess I can go read the ticket, I'm kind of in half-ass multitasking mode right now [17:28:42] np no hurry at all [17:28:49] i just want a single alias for kafka bootstrapping [17:29:11] since k8s stuff doesn't have access to puppet/hiera configs [17:29:17] i don't want to maintain more lists of servers [17:30:34] 10netops, 10Operations, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) Scheduling the work on Tuesday January 22nd, 16:00UTC, scope is Amsterdam only. That gives us the remaining of the week to monitor for any issue. Then collect/ana... [17:32:03] what does k8s have access to for cfg mgmt? [17:33:24] an LVS service (or anything like a real service as opposed to a set of metadata) seems overkill for a bootstrapping problem [17:33:55] the "right" solution would probably be if kafka supported either a SRV lookup, or had better than "use the standard libc resolver", if you want to store the list in DNS. [17:34:23] but assuming kafka doesn't do SRV lookups, without modifying kafka code to make things better, the configured explicit list you have seems like the best you could do [17:34:55] but e.g. an LVS service as a replacement for lack of cfg mgmt, seems overkill. [17:35:05] yeah it doesn't; i'd have to do it in my code manually and provide it to the client [17:35:11] and DNS "Round Robin" generally won't do what you want it to do, even if it mostly works most of the time. [17:35:14] akosiaris: thougths about T^^^ [17:35:16] ? [17:37:02] (the problem is there's one or more DNS caches and then libc sitting between that long list of A/AAAA records and kafka, and they won't relaibly randomize things. they could just as well end up returning just the first IP all the time, especially when it matters most). [17:37:10] *reliably [17:38:15] ottomata: but still, there must be some config management in some sense for k8s, if nothing else you're going to plug in this roundrobin hostname there in this scheme. [17:38:30] it's just a question of plugging in one hostname or a list of hostnames/IPs [17:40:05] indeed [17:40:12] maybe all this isn't worth it, just sucks to maintain more lists of hostnames [17:42:29] 10Traffic, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) With request rate as low as this endpoint is expected to have, the Varnish hit rate would probably be very close... [17:45:00] yeah that's what I don't get is the CM disconnect [17:45:16] surely there's some repo somewhere that counts as CM and can de-duplicate things like server list [17:45:25] bblack, there's the helm deployment-charts [17:45:32] but those are just templates with overridable values [17:45:48] but the values are per service deployment [17:45:55] dunno if htey have plans to have some shared config m stuff [17:46:12] who knows maybe they could add hiera support for the templates? dunno [17:46:17] fsero: ^ ?:) [17:47:01] yeah something like a bridge between helm's "Values" and hieradata might be helpful [17:51:10] 10netops, 10Operations: network device audit - https://phabricator.wikimedia.org/T213843 (10RobH) p:05Triage→03High [17:51:17] 10netops, 10Operations: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10RobH) [18:57:55] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Tomorrow (Jan 15) we have a meeting with some SRE folks to revisit this. We've got the cloud-analytics Hadoop... [18:59:25] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Some links: - https://prestodb.io/docs/current/security/ldap.html - https://prestodb.io/docs/current/connector... [20:19:24] ottomata: if the issue is you don't want to add a list of servers to values.yaml files then SRV RRs it is (assuming the software supports them). For what is worth we are looking into helm values management solutions as we 've identified the problem as well. We are still in the early states though [20:20:52] ok akosiaris, if yall are also looking into better ways [20:20:58] i'll hardcode for now and hope for a better future :) [20:21:30] I am not at all sure yet how we will bridge from puppet to helm btw [20:21:51] best way I can think of is puppet generating some part of the values.yaml file [20:22:23] that would work....or if puppet hieradata is just cloned on deployment host, maybe some kind of hiera aware templating? [20:23:01] bridging directly into hieradata is almost certainly not going to happen [20:23:04] aye [20:23:22] using it to populate something, that's possible [23:32:59] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) I think there is a distinction to make here when saying "prod", as it's made of several vlans/networks, especial...