[06:18:07] 10Traffic, 10Analytics, 10Operations, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper) [06:19:02] 10Traffic, 10Analytics, 10Operations, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper) Introduced in T138027; T262882 might be a dup? [06:26:05] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Aklapper) For anyone running into this, please follow https://www.mediawiki.org/wiki/How_to_report_a_bug#Reporting_a_connectivity_issue (but please note that this t... [11:01:31] wikiworkshop.org expiry warning, known and/or can be acked? [11:39:01] godog: the cert is issued by letsencrypt, so I think it should auto-renew, but maybe let's wait for bblack to confirm before ack'ing [12:23:02] ema: ack, makes sense [13:58:23] yeah the wikiworkshop.org situation is fine in practice, but need to search for a ticket first about resolving this so it doesn't alert us again [13:58:38] it has already auto-renewed, but is in the 7 day staging wait before it switches certs [13:59:21] I'm not sure if that's because our renewal times + staging exceeds the monitoring threshold, or maybe it's that acme was offline for a few days sometime recently and thus pushed some timelines out? [14:08:14] bblack: quick q: I see that the various *-lb.$dc.wikimedia.org AAAA records are defined in the DNS but not actually used in the LVS hosts (unless I missed them). Were those just pre-allocation for future work or leftovers from an original plan that was superseeded by real life? [14:08:30] with * {text,test,ncredir,upload} [14:08:59] what do you mean by "not actually used in the LVS hosts"? [14:09:59] mmhmh let me double check one thing first [14:10:11] I think you missed them, but I'm curious as to the source of confusion, as there might be some sense in which they're not configured somewhere the v4 is, even thought things work? [14:10:36] root@lvs1013:~# ipvsadm -Ln |grep TCP|grep :: [14:10:36] TCP [2620:0:861:ed1a::1]:80 sh [14:10:36] TCP [2620:0:861:ed1a::1]:443 sh [14:10:42] [...] [14:11:09] yeah I see them in ip addr [14:11:33] as I said, double checking why they are showing up in my diff, give me 5 [14:11:39] I spoke too soon apparently :) [14:11:46] ok :) [14:12:26] but yeah they're not "real" in some sense, but neither are the v4 versions of those hostnames. Those are mostly debugging hostnames. [14:12:33] (or allocation hostnames, if you prefer!) [14:12:33] sure [14:12:45] now we can have allocation even without dns :D [14:13:15] it could even be the case that e.g. the reverse dns for the ipv6 has an error or something [14:13:32] (in which case LVS wouldn't care anyways, but it might cause confusion for your tooling) [14:13:49] yes is exactly what I'm double checking [14:14:34] ok got it [14:15:04] they use a non-standard netmask [14:15:17] /111 for the origin [14:15:41] do you mean in the commentary in the zonefile? [14:15:54] oh the origin splits [14:15:58] the $ORIGIN [14:16:21] they should be a /128 (VIPs) the /111 is a "container" iirc [14:16:25] yeah so the records are actually generated but with our default /64 origin/split [14:16:36] well [14:16:43] it's all a matter of semantics [14:16:44] because $someone told me that "hey for v6 we use only /64 at most" :D [14:16:53] so the diff fails to see that they are the same [14:17:03] but it's actually generated correctly [14:17:15] technically ORIGIN is just a domainname hierarchy thing, not actually a subnet definition (e.g. you could use $ORIGIN at a completely different boundary than true "subnet" just to save typing on some common parts of the address) [14:17:17] with some definition of "correctly" :D [14:17:32] by yeah they're VIPs and there is no subnet [14:17:45] but there's a /110 (in eqiad's case) reserved to those VIPs [14:18:51] our current limitation is that we have a /64 granulairty for inclusions of v6 records auto-generated [14:19:08] although it's notable that in the eqsin case (most-recent), we did mark the whole /64 as VIP space [14:19:21] eqsin commentary reads like: [14:19:21] ; 2001:df2:e500:ed1a::/64 - LVS Public Service IPs (allocated) [14:19:22] ; - 2001:df2:e500:ed1a::0:0/110 (::0:0 - ::3:ffff -- LVS Public Service IPs (in use) [14:19:25] ; -- 2001:df2:e500:ed1a::0:0/111 (::0:0 - ::1:ffff --- LVS high-traffic1 (Text) [14:19:34] but then we still do the /110 origins to save typing [14:19:56] sure, that just means that if we want to migrate those records to the auto-generated ones we'll have to do the whole /64 at once [14:20:01] we can "fix" that by using the /64 origin everywhere, and just typing more zeros or whatever on the individual ones [14:20:08] nbd [14:20:09] then it will all match up now [14:20:37] it's just those 4 records per DC, I'll check them manually and we're happy I think [14:20:46] we'll anyway re-check the diff if/when we want to migrate them [14:20:46] I hate manually :) [14:21:16] but renaming the ORIGINs will be manual too :) [14:24:46] yeah but it's once, and then no more warnings for this process [14:25:35] anyways, up to you [14:25:54] but a mechanical patch to the v6 reverse zones to set them up as just /64 origins is fine too [14:26:29] and I can proof-test it locally running the diff :) [14:47:14] ema: bblack: do I have it right that `sub deliver_synth_` in wikimedia-frontend.vcl.erb gets run on every HTTP response, and would be a fine place to add NEL/Report-To headers based on req.http.Host? [14:59:01] cdanis: yup, all responses go through either vcl_deliver or vcl_synth, hence through our deliver_synth_ https://book.varnish-software.com/4.0/_images/simplified_fsm.svg [14:59:19] thanks! [15:00:13] well, or vcl_pipe, but that's just TCP proxying [15:00:34] bblack: https://gerrit.wikimedia.org/r/c/operations/dns/+/627524/ for the above [15:08:18] volans: can you make the comments on the 4 older ones look like eqsin's comments (where it calls out the /64 alloc then the 111->110) [15:08:47] sure [15:20:44] {done} [15:21:13] I've moved the ORIGIN just below the /64, hope it makes it more clear but lmk if you preferred how it was [15:30:09] nice, thanks [15:30:46] it's "safe" to merge anytime or requires some coordination? [15:36:16] volans: should be safe afaik [15:36:33] it doesn't change any responses on the wire, and in any case nothing infrastructure-wise relies on those PTRs [15:37:07] ack, let me do it now, I'll check the PTRs before/after [15:43:41] * volans merging [15:44:13] * volans testing [15:45:50] all looks good so far [15:48:36] I've tested them @localhost -p 5353 and all looks unchanged [15:49:43] 10netops, 10Operations, 10ops-eqiad: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) [15:50:27] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [17:06:17] FYI there is a bit of cron-spam from Cron /bin/systemctl reload acme-chief [17:06:23] acme-chief.service is not active, cannot reload. [17:06:36] seems every hour [17:08:47] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10Cmjohnson) a:05Cmjohnson→03RobH @robh cross-connect has been connected 15/16 and matched the serial number equinix provided. [17:10:09] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6461316, @Aklapper wrote: > For anyone running into this, please follow https://www.mediawiki.org/wiki/How_to_report_a_bug#Reporting_a_connec... [17:21:58] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Dzahn) >>! In T262869#6461316, @Aklapper wrote: > but please note that this ticket is public so you may not want to post your IP and other personal data If you are... [17:41:55] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) No light on the connection, you may have to roll the fiber at the dmarc panel. [17:46:49] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10ayounsi) There is no light in. I emailed Telia to let them know we're ready. Updated the X-connect ID/ports, etc in the termination A https://netbox.wikimedia.org/circuits/circuits/95/ Nex... [17:48:07] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10ayounsi) From Telia: > We currently see this link down on our end. Receiving low power. > Tx Power: 0.65160 mW (-1.86019 dBm) > Rx Power: 0.00260 mW (-25.85027 dBm) [18:10:12] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) a:05RobH→03ayounsi So I am a bit confused: * I wasn't included in the above email to Telia, I have no idea what is going on for this. Please include me in communications on link... [18:31:00] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191) cross-connection - https://phabricator.wikimedia.org/T261791 (10RobH) a:05ayounsi→03Cmjohnson Ok, summary of what we know and what we need to check: TX light to Telia: * Telia's RX of our light is at 0.00260 mW (-25.85027 dBm) * Our TX on... [18:31:20] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191) cross-connection - https://phabricator.wikimedia.org/T261791 (10RobH) [18:31:25] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191) patch - https://phabricator.wikimedia.org/T261791 (10RobH) [20:05:03] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Andyrom75) >>! In T262869#6460813, @CDanis wrote: > Today we had reports of an issue from @Andyrom75 that was happening all the time on their Wind (AS1267) mobile c... [20:38:11] 10Traffic, 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 4 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10eprodromou) OK, we'll take a look at getting this figured out. @Pchelolo , let's consult on what's needed. [23:17:14] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Andyrom75) ~15min ago connection has been restored. I'll test it again tomorrow.