[00:06:11] 10Traffic, 10Operations: Update certspotter - https://phabricator.wikimedia.org/T204993 (10faidon) >>! In T204993#4610222, @MoritzMuehlenhoff wrote: > Adding the Debian maintainer :-) This seems fixed in 0.9-1 so updating stretch-backports to 0.9 could fix this. This is now done :) [05:54:50] 10Traffic, 10Analytics, 10Operations, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) > Ok, @Pchelolo gets the persistence award! Yay! I've got the award! > Let me understand: are there other headers we would need besides the accept one... [08:32:12] 10Traffic, 10Analytics, 10Operations, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10phuedx) ☝️ Best I could do at short notice… [08:33:47] 10Traffic, 10Operations: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) @Papaul that's right, reinstalling both servers would be the fastest/safest approach :) [09:10:56] bblack: split the replacement of bohrium with matomo1001 in three steps: [09:11:05] 1) add the matomo1001 backed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464110/ [09:11:19] 2) replace the backend in the piwik's director https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464112/ [09:11:26] 3) cleanup https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464113/ [10:06:43] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10Marostegui) Are we still good for Thursday at 16:00 UTC for row B? [12:05:17] vgutierrez, did you find any other normalisation needed? [12:09:37] Krenair: nope [12:10:21] maybe we could consider idn hostnames in the future [12:13:54] When the subjectAltName extension contains a domain name system [12:13:54] label, the domain name MUST be stored in the dNSName (an IA5String). [12:13:54] The name MUST be in the "preferred name syntax", as specified by [12:13:54] Section 3.5 of [RFC1034] and as modified by Section 2.1 of [12:13:54] [RFC1123]. Note that while uppercase and lowercase letters are [12:13:55] allowed in domain names, no significance is attached to the case. [12:14:11] [blah blah blah] [12:14:19] Rules for [12:14:20] encoding internationalized domain names are specified in Section 7.2. [12:15:10] Wikimedia does have some IDNs, they're all parked [12:15:29] I don't think it's a priority right now [12:15:42] (or just not in our DNS lol) [12:15:46] yeah it's not [12:16:07] doesn't look like any are set up to redirect so [12:25:32] IDNs should parse like regular domainnames anyways, they're only really a concern when you want to display them to a user (arguably you might display them in logs or error messages in certcentral, but it's probably better to still treat them as ASCII in that case anyways, less confusing) [12:25:52] yes [13:40:26] bblack: yup... I was thinking about letting the user configure them in a readable way [13:45:49] eh [13:46:16] or take into consideration that the user could put a domain like that in the configuration [13:46:16] "readable" is in the eye of the beholder. For a browser UI or URL bar, maybe displaying UTF-8 is more-readable. [13:46:44] indeed [13:46:45] but for technical / infrastructure tools, it seems more likely to cause problems [13:47:01] (vs just using the ASCII representation) [13:47:57] somewhere deep in the bowels of some python library, that decision to pass around a UTF-8 string and/or try to normalize it will bite :) [13:51:06] all we need it for is an equality check [13:52:19] yeah, but you'd first have to normalize by IDNA rules [13:52:51] and IDNA is a standard that evolves, because yeah utf-8 and normalization and not fooling users is hard stuff [13:53:38] there was IDNA2003, then IDNA2008, which has like 5 different RFCs [13:54:32] https://tools.ietf.org/html/rfc5895 is an attempt at an informational overview of the latest of the whole thing [13:55:11] also from unicode.org, the 2003->2008 transition in brief: [13:55:12] Q: What is IDNA2008? [13:55:12] A: It is a revision of IDNA2003, approved in 2010. For most Unicode characters it produces the same results as IDNA2003, but there are important classes of characters for which it is not backwards compatible with IDNA2003. See [RFC Numbers at http://www.unicode.org/reports/tr46/#IDNA2008]. [13:56:05] my gut feeling is trying to do anything other than "use the ascii encoding only and ignore IDN" in a technical tool is only going to be a long-term rabbithole of pain [13:59:58] http://xn--rksmrgs-5wao1o.josefsson.org/ notes a real-world problem caused by that 2003->8 transition for someone's hobby domain [14:35:35] 10Traffic, 10Operations: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) @Vgutierrez let me know when I have green light to start working on this. [14:39:36] right [14:40:03] vgutierrez, have you tested this cert subjects change code? [14:42:01] Krenair: I think the best way should be to deliver a full test case rather than manual testing [14:42:15] it's going to be a little bit complex though [14:43:06] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) a:03Ottomata [14:44:52] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) @Papaul as soon as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464156/ gets merged :) [14:51:11] Krenair: right now is not behaving as expected though :) [14:58:00] that itself is not entirely unexpected :) [14:58:11] Krenair: https://gerrit.wikimedia.org/g/operations/software/certcentral/+/refs/changes/82/460382/11/tests/test_certcentral.py#1059 [14:58:18] Krenair: my bad, it's working as expected :) [14:58:24] and that test shows it [14:59:22] arg.. meeting in <60 secs :) [14:59:33] Krenair: let me know if you are happy with that test [15:00:46] vgutierrez, can we have the _with_config_change function call the normal one first? [15:02:27] I can refactor both of them [15:02:37] to have a function to be called by them [15:02:44] and avoid the copy&pasteed code [15:02:48] *pasted [15:03:29] but having one test depending on the other doesn't feel right [15:03:33] it looks like the first 4 steps are identical? [15:03:35] I suppose [15:03:46] yup, they do exactly the same [15:04:13] actually we could get rid of the first one [15:04:21] could just get rid of the first one vgutierrez? [15:04:55] indeed [15:07:34] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Still good, here is the list of hosts currently on the new asw2-b-eqiad that will be impacted by Thursday 4th 16:00UTC 2h maintenance window (with a worse case of a 30min do... [15:22:11] 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10RobH) We can resolve this since we decommissioned cr2 and are getting rid of it, right? [15:25:26] 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10ayounsi) 05Open>03Resolved a:03ayounsi Yep. [15:56:05] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Optic replaced yesterday and confirmed no more issues. Steps for today: [] Verify cr2-eqiad is VRRP master [] Disable interfaces from cr1-eqiad:ae1 to as... [16:12:56] Krenair: I got rid of the first test as you mentioned [16:33:11] vgutierrez, so how do we want to do this? [16:33:17] we both contributed to this commit, so... :| [16:33:26] 10netops, 10Operations, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) [16:33:32] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) 05stalled>03Resolved [16:35:05] Krenair: volans is out today, but I can ask him tomorrow for a quick review :) [16:35:15] ok [16:35:15] as our in-house python guru [16:36:40] vgutierrez, I just noticed that you removed the sorted() call [16:36:56] oh but they became sets [16:37:15] so should be ok [16:59:24] 10Traffic, 10Horizon, 10Operations, 10Upstream: Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10Krenair) I got projectadmin in the `openstack` tenant back and made an instance called labs-t204013-osdev. Then I followed https://docs.openstack.or... [17:10:40] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) puppet certs for lvs2009 and lvs2010 delete from master for OS reinstall [17:34:50] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10RobH) a:05RobH>03fgiunchedi Ok, updates from IRC sync up and followup actions: * @robh updated (per @fgiunchedi's instruction) @bblack's patchset https://gerrit.wikimedia.org... [17:47:33] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Steps to migrate asw2-b-eqiad to a supported topology. {F26293194} Step 1) [] Enable all VC ports on FPC2 and FPC7 ``` request virtual-chassis vc-port set pic-slot 0 port... [17:51:12] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) 05Open>03Resolved All new systems are in place, resolving this task. [17:51:15] 10Traffic, 10Operations, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10RobH) [18:01:35] bblack: are there any non prod traffic servers in row A that could be moved to asw2 ? [18:14:24] XioNoX: I don't think there are? [18:14:49] just curious, not an issue [18:15:16] asw2-a-eqiad is now between asw and cr1 and passing traffic, everything looks stable so far [18:15:24] oh wait, cp1008? [18:16:38] there's some others that are to-be-decom, but then they don't make great tests either [18:18:20] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) The reason the OS reinstall is taking long is that when all 4 NIC's are plugged in, the server can not auto configure the first NIC so I have to to in the BIOS and disa... [18:20:28] Also I wrote the step by step instructions for asw2-b over there if you want to proofread it: https://phabricator.wikimedia.org/T201039#4639293 [18:26:18] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) 05Open>03Resolved This is now stable. Back to T187960 for the remaining steps. [18:29:03] XioNoX: how confident do you feel that we won't lose access (e.g. create temporary islands of B that can't see each other, or can't see upstream routers) at various points in that process? [18:30:48] XioNoX: and I guess my other thing would be: in case there's instability during all the middle steps that eventually gets better at the end, can we minimize the total time by having all the cables pre-run for the final config (as opposed to the possibility that some switches are isolated/broken and we're waiting around for physical work routing cables to finish) [18:30:57] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) LVS 2009 ``` root@lvs2009:~# fdisk -l Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physic... [18:31:33] bblack: physically if the steps are respected, there should always be at least one leaf between the two spines, but I can't be sure of how the fabric will route between the two. It worked previously when we didn't have the 7m DAC, where 2 single homed leafs (on different spines) were able to communicate via other leafs [18:33:19] previously when we didn't have the 7m DAC: https://phabricator.wikimedia.org/T201145#4567915 [18:33:33] Cabling has been done out of order, but end result is there. (minus the 7m DAC). [18:33:36] During the re-cabling, the fabric was very unstable: frequent disconnects between members, adjacencies appearing/disappearing. [18:33:39] A reboot of the fabric didn't solve the issue. [18:33:42] Disabling the FPC3-FPC2 link made everything stable again, and stayed stable after re-enabling it. [18:34:40] bblack: good points, Chris already ran the 7m fibers to save time, we can double check that everything that can be run is done [18:34:42] which makes me think we could easily again see a scenario of "very unstable" until at least we get to the end of the cabling steps and/or take some kind of reset. [18:35:24] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) - Re-enable the other 3 NIC's - First puppet run complete LVS2009 is ready [18:35:40] bblack: yeah, doing it in order should lessen that risk but it's only a guess, [18:35:49] ok [18:36:13] I just want to be realistic. Luckily there's not as much traffic as normal in eqiad in general. but there is some, and there are some eqiad-only things too. [18:39:47] XioNoX: is https://phabricator.wikimedia.org/T201039 the ticket for this? it has the most-recent updates anyways [18:39:53] yeah, we don't have many other options though than warn people and try to minimize downtime, doing it after the switchback would be more impactful, and we know the end state is stable [18:40:22] bblack: for the recabling, yeah [18:40:53] and https://phabricator.wikimedia.org/T183585 for the overall moving to asw2-b-eqiad [18:40:54] sorry maybe I've lost all context [18:41:45] starting here is just today talking about tomorrow's thing and the server list: https://phabricator.wikimedia.org/T201039#4637947 [18:42:02] before that we're back at like mid-august? [18:43:34] bblack: yeah, there is an email thread about it too, named "eqiad row D switch upgrade" [18:43:57] this is B! :) [18:44:13] yeah, the topic drifted... [18:44:34] ok [18:44:46] so yeah, assuming that email thread is the announcement [18:45:13] and I mentioned it during the meeting [18:45:48] yeah I'm just wondering if someone who cares about that long list of servers is going to be surprised [18:46:38] I see DBA + Cloud respond on the email thread [18:47:31] yes, hopefully they pay attention in meetings! [18:47:42] elukey also replied on IRC about the analytics hosts "loosing all those hadoop nodes is not a huge deal but it is kinda concerning (bad case scenario)" [18:47:59] ok [18:53:42] bblack: I can send an email to ops about it to catch any last minute blocker [19:00:27] XioNoX: yeah might be best just in case [19:05:45] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) lvs2010 ``` root@lvs2010:~# fdisk -l Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical)... [19:08:19] 10Traffic, 10Operations, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) Thanks @papaul! [19:08:44] 10Traffic, 10Operations: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) 05Open>03Resolved a:03Papaul [19:55:26] 10Traffic, 10Community-Tech, 10MediaWiki-Parser, 10Operations, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10MaxSem) a:03MaxSem [19:57:11] sent [20:48:59] 10netops, 10Operations, 10ops-ulsfo: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10RobH) Ok this was odd and I had to sync with @ayounsi via IRC. The spare optic is the same model, but made in China where the other 4 40g optics were made in Malaysia. The Chinese vers... [23:27:23] 10Traffic, 10DNS, 10Operations, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent)