[05:57:18] https://tls.ulfheim.net/ [06:06:42] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10ayounsi) p:05Triage>03High [06:39:41] 10Traffic, 10Operations: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) Confirmed that all the network devices are back to a healthy state. And we received a completion notice, should be safe to repool the site. >>! In T206861#4664498, @faidon wrote: > - How come the bottom ha... [06:56:11] 10netops, 10Operations, 10Patch-For-Review: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) 05Open>03Resolved All set. [06:56:28] 10netops, 10Operations, 10Patch-For-Review: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) a:05RobH>03ayounsi [09:08:52] 10netops, 10Operations, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Resolved>03Open The IPv6 pings eqiad alert keeps flapping, I downtimed it for 2 days and emailed the RIPE. [10:09:14] 10Traffic, 10DNS: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10MarcoAurelio) [10:09:34] 10Traffic, 10DNS, 10Operations: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10MarcoAurelio) [13:01:40] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [13:02:40] 10Traffic, 10DNS, 10Operations, 10Wiki-Setup (Rename): Redirect dk.wiktionary and dk.wikibooks to da.wiktionary and da.wikibooks respectively. - https://phabricator.wikimedia.org/T17357 (10MarcoAurelio) [13:03:53] volans, vgutierrez [13:03:55] "I didn't say that, but ofc it's a possibility, although quite ugly :) [13:03:57] Ah and I missed the hiera call in the nginx erb template. I know we have few in the repo, but we shouldn't use them either. [13:04:02] The obvious solution is to pass a variable, but I agree that might be quite painful to add it in any place that will call this define to generate the certs and those places might be deep inside modules, pretty far from the profiles that will be able to do the hiera call." [13:04:35] If you're not going to allow hiera calls inside modules, there's no way anyone is going to allow a production hostname to be hardcoded inside one [13:05:51] For the reasons you state, don't want to pass it in as a parameter either [13:05:59] That leaves the current option as the best one - hiera inside module. [13:08:34] I'm only willing to drop lint:ignore:wmf_styleguide stuff where you're genuinely proposing a better solution [14:13:37] 10netops, 10Operations: relabel switch interfaces formerly saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207035 (10Jgreen) [14:18:02] Krenair: https://puppet-compiler.wmflabs.org/compiler1002/12917/certcentral1001.eqiad.wmnet/change.certcentral1001.eqiad.wmnet.err [14:18:09] ppc is not happy with the change [14:18:18] *pcc [14:19:06] hieradata/role/common/certcentral/server.yaml:profile::certcentral::server::accounts: [14:19:07] modules/profile/manifests/certcentral/server.pp: Hash[String, Hash[String, String]] $accounts = hiera('profile::certcentral::server::accounts'), [14:19:29] yup [14:19:43] so the role is certcentral instead of certcentral::server [14:20:02] and the hieradata lies in hieradata/role/common/certcentral/server.yaml [14:20:14] we should rename one of those I guess [14:20:31] how is the role relevant though? [14:20:34] oh [14:20:36] oops [14:20:50] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698 (10fgiunchedi) 05Open>03declined We're one year of librenms data in Graphite already, I'm declining this since we'll eventually reach librenms ret... [14:20:52] 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167 (10fgiunchedi) [14:24:37] thanks [14:24:45] re-running pcc... [14:25:15] Krenair: in any case.. maybe it's sane to provide empty defaults in the profile [14:25:40] vgutierrez, so when including the profile in the role you add in params from hiera? [14:25:54] is that even allowed? [14:27:11] right now in modules/profile/manifests/certcentral.pp we have the hiera calls to fetch acounts, certificates and challenges [14:27:14] 10netops, 10Operations, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10faidon) @ayounsi, what's the current status of this task? Last update is from over a year ago, but I think some of our latest woes wit... [14:28:42] ok then, we're missing the definition of certcentral::cert::certcentral_host for the production environment [14:29:03] uh [14:29:12] oops, one sec [14:47:18] Krenair: hmmm pcc doesn't seems satisfied by PS62 --> Error: Evaluation Error: Error while evaluating a Function Call, Could not find data item certcentral::cert::certcentral_host in any Hiera data file and no default supplied at /srv/jenkins-workspace/puppet-compiler/12922/change/src/modules/profile/manifests/certcentral.pp:43:25 on node certcentral1001.eqiad.wmnet [14:47:51] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (review/alex_monk/central-cert-service)$ grep certcentral::cert::certcentral_host hieradata/common.yaml [14:47:51] certcentral::cert::certcentral_host: 'certcentral1001.eqiad.wmnet' [14:48:06] I know [14:48:08] you sure you gave it PS62 ? [14:49:04] yup [14:49:04] HEAD is now at d6bb87d4a9... [14:49:19] that matches the commit id for PS62 [14:50:24] so why is it not happy with the common.yaml addition? [15:07:24] hmmm maybe cause the compiler catalog is deprecated? [15:07:35] somehow I lost access to compiler02.puppet3-diffs.eqiad.wmflabs *sigh* [15:10:45] didn't that project get deleted vgutierrez? [15:12:03] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Documentation updated at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [15:12:26] Krenair: that could explain it :) [15:13:34] there is compiler1001.puppet-diffs.eqiad.wmflabs and compiler1002.puppet-diffs.eqiad.wmflabs [15:13:42] and you appear to have access there [15:36:08] yup, updating the facts didn't help [15:36:21] so we've some issue with the certcentral role I guess [15:36:47] node 'certcentral1001.eqiad.wmnet' { [15:36:47] role(certcentral) [15:37:06] modules/role/manifests/certcentral.pp, class role::certcentral, does include ::profile::certcentral [15:37:33] hieradata/common.yaml, included everywhere, says certcentral::cert::certcentral_host: 'certcentral1001.eqiad.wmnet' [15:38:05] modules/profile/manifests/certcentral.pp, class profile::certcentral, does class { '::certcentral::server': [15:38:40] accounts => $accounts, [15:38:41] active => hiera('certcentral::cert::certcentral_host') == $::fqdn, # lint:ignore:wmf_styleguide [15:39:12] I don't get it [15:39:30] why is it saying it can't find that [15:40:45] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10ayounsi) PEM is dead, RMA# R200206473 created. [15:46:40] Krenair: BTW, it should be certcentral_host or something like that just to keep commomn.yaml naming conventions [15:47:06] or just "cercentral" mimicking debmonitor config [15:47:08] doesn't that just lead to yet another wmf_styleguide violation? [15:47:12] # Debmonitor instance [15:47:12] debmonitor: debmonitor.discovery.wmnet [15:47:20] I don't think that's how you default hiera() anyways [15:47:36] bblack, what? [15:47:49] active => hiera('certcentral::cert::certcentral_host', $::fqdn) [15:47:50] ? [15:47:57] ... no? we don't want to do that [15:48:13] we want to check that the result of hiera('certcentral::cert::certcentral_host') is the current host [15:48:34] if the result is true, this is the active host [15:48:35] ok maybe I'm missing context [15:48:39] if the result is false, this is not the active host [15:48:51] so set the active param to true or false [15:48:56] I thought active host was only used to set the puppet file-fetching source anyways? [15:49:16] bblack, at some point someone decided that only one host should be requesting certs at any given time [15:49:29] the other host sits there and does nothing [15:49:49] that won't work in the long run, I remember explicitly saying something in the other direction at some past time [15:50:09] yeah. anyway [15:50:41] the reasoning there is that if we wait till we switch actives to flip on the other certcentral, a few problems arise: [15:50:43] problem we've got right now is certcentral::cert::certcentral_host is used in a few locations (not just the one above) but should be globally defined everywhere [15:50:47] so stuck it in common.yaml [15:50:57] but there's errors like Could not find data item certcentral::cert::certcentral_host in any Hiera data file and no default supplied [15:51:26] (1) client puppetization will fail temporarily until all the certs are fetched for the first time, unless we can stage out "activate the node" then later "activate it for puppet fetches", but even then that only covers a planned switch. [15:52:11] (2) when we have lots of usage, we're likely to fail LE ratelimiters trying to fetch everything immediately, vs the normal smooth spacing over ~3 months. [15:56:22] what's the link again? [15:56:40] hm? [15:56:51] nvm [15:56:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/441991/ [15:57:01] well [15:57:03] yeah there's that [15:57:08] but we weren't having that discussion [15:57:17] we were trying to figure out why it fails puppet compiler [15:57:36] yeah that's what I'm looking at now [15:57:47] so first off, clearly common.yaml has no other examples of namespaced things [15:58:02] so something's probably amiss with this pattern in one direction or the other [15:58:10] sure but I don't think that matters? common.yaml should be included on all hosts and it should work the same way as any other hieradata yaml file? [15:58:27] yes yes, but if nothing else it's confusing as the only counter-example [15:58:35] the namespaced ones are meant to match where they're used [15:58:44] and parameter defaulting, etc [15:58:57] anyways, that was just the first thing I noticed in the first file. [15:59:47] blerg meeting starts! [16:00:14] sometimes I wonder if code reviews should be scheduled as meetings [16:00:47] for PS62 on something complex, possibly! :) [16:01:09] sigh :) [16:01:22] anyway gonna try removing the namespacing and asking vgutierrez to re-run pcc [16:01:29] Krenair: ack [16:01:37] btw [16:01:42] we should really fix the access control on PCC [16:01:44] like seriously [16:04:00] well.. [16:04:02] now it worked ;) [16:05:01] https://puppet-compiler.wmflabs.org/compiler1002/12927/ [16:05:08] I don't understand why it worked [16:05:40] but let's move [16:05:41] on [16:07:47] 10netops, 10Operations, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10ayounsi) No real update since a year ago. All switch stacks have been upgraded to a version that doesn't have this specific bug (14.1X... [16:09:41] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) [16:14:54] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10BBlack) p:05Triage>03Normal [16:15:27] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10BBlack) [16:15:30] 10Traffic, 10Operations, 10Patch-For-Review: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10BBlack) [16:17:51] 10Traffic, 10Operations: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10BBlack) [16:18:12] 10Traffic, 10Operations: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10BBlack) [16:18:17] 10Traffic, 10Certcentral, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10BBlack) [16:34:15] 10Traffic, 10Operations: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Krenair) So, in scope: apt.wikimedia.org archiva.wikimedia.org dumps.wikimedia.org librenms.wikimedia.org lists.wikimedia.org mirrors.wikimedia.org netbox.wikimedia.org... [16:39:23] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927 (10mforns) p:05Normal>03Low [16:41:57] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10Cmjohnson) swapped it with one from a spare switch....leaving ticket open to enter RMA details [16:54:51] gehel: see T206105#4667533 . I'll poke at it a little on first, gimme a little bit! [16:54:52] T206105: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 [16:55:35] bblack: sure! At least I'm glad to see that the base idea wasnt entirely crazy :) [16:55:50] as far as I can see, there is already 4 RX queues with the current config on wdqs [16:56:20] right, but only 1 TX queue. I need to double-check on some non-prod tg3 host how some things work out, specifically: [16:56:47] 1) Whether we can turn on the 4x TX queues successfully via ethtool, and whether that looks like a universal limit across all our tg3 on the fleet in cumin [16:57:13] 2) Whether, if we puppetize that, it blips the card the first time you turn on interface::rps for a host (and make sure it doesn't blip if param unchanged, at least, in that case) [16:57:40] 3) In either case (1TX or 4TX), validate that the rest of the RPS script actually works right on tg3. It looks like it would, but we've never verified. [16:58:48] for bnx2/bnx2x, we do modify tx/rx queue counts via ethtool, and if they need to be changed (first time or right after a reboot), it does blip the card's link off/on [17:03:59] anyways, scanning for a testbed tg3 right quick [17:04:36] wow, 74% of production hosts use tg3 ethernet cards heh [17:04:55] Krenair: BTW, regarding to CNs length, if the CN length is > 63 chars, LE it's going to refuse the CSR, so we should detect that and put those hostnames only in the SAN list [17:06:02] vgutierrez, is that LE-specific or in the ACME/X509 specs? [17:06:14] X.509 spec actually [17:06:19] well [17:06:27] shouldn't it just error about that? [17:06:31] the config is static [17:06:38] https://community.letsencrypt.org/t/ssl-for-a-63-character-max-number-of-characters-domain-name-s/36387/17 [17:06:58] yup... that's another way to proceed [17:07:01] but at least we should document it :) [17:07:19] alright [17:07:26] I'll stick in a certcentral bug for this [17:07:50] thx [17:08:16] https://phabricator.wikimedia.org/T207062 [17:08:36] also gonna get some wikibugs config to send certcentral task notifications here [17:10:09] relatedly, even if the CN > 63 problem was gone, there's a separate implicit limit from ACME for DNS-01 challenges [17:10:42] because domainnames in DNS are limited to 255 bytes total (encoded on the wire), but whatever CN/SAN you're authorizing also has to have room within that limit for prepending "_acme-challenge." [17:11:33] (which steal 17 bytes in encoded form) [17:12:23] you're right [17:12:34] another limit to take into account :) [17:12:35] so SANs are limited to 238? [17:12:56] well, the meaning of "238" is tricky to translate back to the ASCII domainname world, a bit. [17:13:18] and the limit only applies to DNS-01. You could create SANs without ACME, or with ACME HTTP/S or TLS-SNI challenges that went past that [17:13:27] (in theory) [17:14:30] re: translating the wire limit: if we assume all the bytes of all the DNS labels are plain ascii [^-A-Za-z0-9] and don't need special escaping/encoding/etc [17:16:26] then "www.example.com.": count those bytes (3 + 1 + 7 + 1 + 3 + 1) = 16, +1 extra = 17. In wire-encoded form, "www.example.com" is a 17-byte name. [17:16:58] another way to think of it: you add up the data bytes of all the labels "www", "example", "com", +1 byte per label, +1 byte at the end. [17:17:17] it gets trickier if the ascii form has multibyte escapes or whatever in it [17:17:27] okay so [17:17:41] we need certcentral to do all the necessary encoding etc. and then check the result is okay [17:18:16] ? [17:18:30] yeah, or alternatively you can just ignore it and let it fail, and probably nobody will ever even try to make such a long CN/SAN [17:19:34] it's pretty easy to just take the string length "www.example.com" = 16, add 1, and that's 17. That's what you're comparing to the 238 limit. [17:19:41] it just sucks if you have to deal with escapes [17:20:54] which is of course really up to your input method or whatever, it wouldn't be the DNS standards in that case. [17:21:20] if you haven't defined any way to escape special/meta characters into CN/SAN info, then you don't have to deal with it either, and just call them unsupported :) [17:21:58] in a dns zonefile, you can do stupid things like "w\.ww.example.com", [17:22:18] (a literal dot in a label, which can be legal, but probably would never work right or be intuitive anywhere and seems like a horrible idea) [17:25:31] I would probably opt for just explicitly supporting only legitimate hostname characters and maybe the underscore in domainname input and fail and tell people to file a bug otherwise [17:25:42] [-_A-Za-z0-9] [17:25:49] (and dots between labels) [17:26:05] probably nobody will file the bug [17:28:04] and then you can just do: if (challenge_method == DNS-01 && strlen(domainname)+1 <= 238)) { error } [17:28:08] err [17:28:12] and then you can just do: if (challenge_method == DNS-01 && strlen(domainname)+1 > 238)) { error } [17:28:15] :) [17:28:58] I seem to have used a different constant though, checking re: "238" [17:30:11] yeah I miscounted above on IRC: the length of the string "_acme-challenge" is 15 bytes. So as a label it adds 16 bytes to the wire-encoded form. [17:30:14] so: [17:30:36] and then you can just do: if (challenge_method == DNS-01 && strlen(domainname)+1 > 239)) { error } [17:32:59] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) 05Open>03Resolved [17:52:54] 10netops, 10Operations, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10faidon) I just looked briefly at T172459 and it looks like the last update there was to attempt this during the switchover period whic... [17:57:30] 10Traffic, 10DNS, 10Operations, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) [17:58:58] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) p:05Triage>03Low [18:00:25] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) a:05Urbanecm>03Dzahn [18:00:56] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) is MobileFrontend enabled? [18:01:51] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) Yes, per https://wikimedia.org.ve/wiki/Especial:Versi%C3%B3n. [18:06:52] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) ``` host ve.m.wikimedia.org ve.m.wikimedia.org has address 208.80.154.224 ve.m.wikimedia.org has IPv6 address 2620:0:861:ed1a::1 ``` htt... [18:45:25] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) 05Open>03Resolved [18:55:01] gehel: task updated with a patch. I should push that in general, and then we can move on to wdqs use-case (which might need some slight amend) [18:58:45] bblack: thanks a lot, I'm mostly off to today, will look more tomorrow [18:59:31] ok! [19:04:51] bblack: the patch seems to make sense to me. We're only going to enable this on wdqs (at least atm) and we can handle a 2'' loss of connection (or do a rolling deploy and depool) [19:05:45] so once your patch is deployed, as far as I can see, I just have to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/465624 and then magic happens [19:06:13] I'll test first on one of the servers for a few days, I don't understand all that enough to completely trust it :) [19:18:00] gehel: yeah. the only minor things on the patch is (1) the interface argument is redundant, it will default from $title, and (2) We may want to turn on $numa_networking for wdqs as well over in per-node hieradata (it will eventually be the default behavior, but for now it's a weird global hieradata thing) [19:18:10] gehel: which will confine the CPUs used to the same NUMA node the card is attached to