[07:55:36] anyone around to sanity check https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451585/? [08:12:18] ema: done [08:26:02] hmmm interesting [08:26:13] lvs2009 & lvs2010 come with bnxt_en NICs [08:28:01] 10Traffic, 10Operations, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) 05Open>03Resolved [08:28:04] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [08:59:38] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [08:59:47] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [09:01:05] oh doh! cp2003 is running jessie :) [09:02:47] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2003.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-aut... [09:03:51] gotta love the ifaces naming with bnxt [09:04:01] enp59s0f1d1 [09:04:03] wtf is that! [09:04:10] yesterday you said you miss eth0 [09:04:14] yup [09:04:20] perhaps one day you'll miss enp59s0f1d1 too! :) [09:04:25] hahahah [09:04:33] I'm not even able to say it out loud for god's sake! [09:04:56] "hey jack! have you checked if everything is alright with enp59s0f1d1?" [09:06:02] this reminded me once that we had an issue with a very old server and the person in the DC was able to get it back online only with an auto-generated temporary interface [09:06:21] that he had to dictate to us on the phone as he was physically on the server console [09:06:25] it was "fun" ;) [09:08:09] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Vgutierrez) [09:11:08] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul in lvs2009 on board NICs need to be disabled in the BIOS (in lvs2010 they're already disabled): ```name=lvs2009 root@lvs2009:~# dmesg |grep tg3 [ 2... [09:24:36] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2003.codfw.wmnet'] ``` and were **ALL** successful. [09:24:52] 10Traffic, 10Operations, 10decommission, 10ops-codfw: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10Vgutierrez) [09:25:42] w00t, cp2003 successful at first attempt \o/ [09:25:49] nice [09:26:01] cause you ran it from sarin this time [09:26:02] :P [09:26:35] lol [09:26:44] of course not, I never use sarin [09:27:07] hahahhaha [09:27:25] I jut logged in today for the first time since I've joined WMF [09:27:29] *just [09:27:41] * volans uses 95% of the time sarin [09:31:13] hey look, wikipedia via ATS! [09:31:16] curl -v -H "Host: en.wikipedia.org" -H "X-Forwarded-Proto: https" http://cp2003.codfw.wmnet:3129/wiki/Main_Page [09:34:02] nice! [09:35:10] cool [09:39:05] well actually the X-Forwarded-Proto part is just muscle memory and unnecessary :) [09:39:07] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @ayounsi interface naming in lvs2009 and lvs2010: |current name| lvs2009|lvs2010| |nic1|enp59s0f0|enp59s0f0| |nic2|enp59s0f1d1|enp59s0f1d1| |nic3|enp175s0f0|e... [09:42:43] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [09:42:44] adding lvs dns entries should be considered torture [09:43:35] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4489746, @Catrope wrote: > In addition to that, do we have a task for "every service should incl... [09:46:40] phabricator works too, but this time X-Forwarded-Proto is needed to avoid the apache rewrite rule [09:46:43] curl -v -H "X-Forwarded-Proto: https" -H "Host: phabricator.wikimedia.org" http://cp2003.codfw.wmnet:3129/T199720 [09:47:44] you don't like --resolve? :P [10:08:54] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4484792, @Imarlier wrote: > I've been investigating the use of an [[ http://opentracing.io/ | Op... [10:53:14] I couldd definitely use an extra pair of eyes here: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451607/ [11:20:42] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Tgr) [11:23:34] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Tgr) 05Open>03Resolved [12:23:44] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) >>! In T201409#4491050, @mobrovac wrote: > > > Perfect -- to be clear, I wasn't making any objection, j... [12:31:37] vgutierrez, so where are we with puppet versions? [12:31:43] all wikimedia stuff is currently on 4.8 right? [12:33:37] I think so [12:34:26] BTW, I'm not saying we must refactor that from yaml to pson [12:34:38] it simply came to my attention while reviewing the API documentation [12:35:02] I think you're probably right [12:35:53] it looks like our puppet clients support yaml anyway but yeah [12:36:19] do you know if there are plans for puppet 5 here? [12:36:34] because if so there's json support [12:37:13] hmm do you know if the client signals that via http headers? [12:37:26] I'm thinking in the Accept HTTP header [12:37:49] I don't, worth checking though [12:37:53] we could reply in json/pson/yaml as instructed by the Accept HTTP Header [12:37:57] defaulting to PSON [12:38:21] I guess that depends whether you prefer to be compliant with the puppet API docs or what the client asks for :) [12:41:04] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @BBlack Could use a hand from you/someone on your team. I've generated... [12:54:50] ema: btw, were you able to access those codfw ATS hosts under role(test)? or just reimage straight into the new role to get into them? [12:55:16] ema: I put cp1071-4 under role(test) as well, but after imaging into it, no access methods really worked (no working ssh for new_install or my keys, etc, anyways) [12:55:23] bblack: I was able to access them as role(test) too [12:55:30] huh [12:55:56] see eg, cp2009.codfw.wmnet now [12:56:01] bblack: with the reimage script? [12:56:09] volans: yeah [12:56:14] so cumin should have access ;) [12:56:16] the reimages succeeded [12:56:44] yeah, cumin does, oddly [12:56:49] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) So a few things: 1) We'll have to hack in the rewrite manually in VCL, bu... [12:57:12] bblack: ok to reimage 2009/2015/2021 or do you wanna check what happened there? :) [12:57:46] go ahead [12:57:55] you can take 1071-4 too whenever [12:58:56] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp2009.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2021.codfw.wmnet'] ``` T... [13:00:02] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) >>! In T199252#4491444, @BBlack wrote: > So a few things: > 3)... [13:04:04] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) >>! In T199252#4491444, @BBlack wrote: > So a few things: > 1) We'll ha... [13:06:35] bblack, do you know if there are plans for puppet 5? [13:07:08] no idea, and the main person who would know is now officially on vacation [13:08:25] but in general I think it's fair to assume that: (a) We're deeply invested in puppet at present, and I don't see us having resources/will to suddenly be able to migrate all that to an alternative within the next year or three + (b) Forward progress has to happen for various raisins, so we probably eventually end up on whatever the latest Puppet is, even if a few years behind [13:08:52] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet'] ``` The log can be foun... [13:10:00] TL;DR T184564 [13:13:18] oh, so yeah it's already apparently planned, but not in progress :) [13:13:28] no timeline AFAIK [13:21:44] bblack: eqiad is still depooled, do we want to keep it like that till we find out more about the network situation? [13:27:50] yeah I think so [13:27:59] at least until we have a Plan [13:28:29] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet'] ``` and were **ALL** successful. [13:36:44] ema: thoughts on https://gerrit.wikimedia.org/r/c/operations/puppet/+/451535 appreciated too. We had another one of those "huge bursts of cache_upload outbound bandwidth to few clients" scenarios not long after our real network problems yesterday, which was confusing and alarming, and due to eqiad depool it saturated eqiad<->codfw transport for a bit. [13:40:45] interesting, thanks [13:44:31] bblack: it looks reasonable, I wonder if/how we can monitor these things [13:45:11] bblack: some grafana dashboard with "fast flows" (per host? cluster?) would be interesting [13:45:50] meanwhile, I think the reason why we couldn't access the eqiad ats cluster is lack of IPv6 [13:46:13] this should do: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451628/ [13:48:33] oh duh, should've thought of that [13:49:00] long day yesterday! [13:49:09] I've noticed! :) [13:56:24] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [13:56:49] I'll push the flow maxrate thing in a little bit, before the traffic meeting. I tested what happens on cp1008 earlier and it doesn't blip the interface apparently, but still probably better to let it roll on natural puppet runs in case of minor flow disturbances from re-setting fq [13:56:58] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) a:03ema [13:57:29] ok [14:13:06] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet'] ``` The log can be foun... [14:44:19] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqi... [14:48:21] ok, pushing the maxrate thing shortly [14:49:17] bblack: did you have to change something in the BIOS for the new bnxt NIC cards? [14:51:55] vgutierrez: I think 1 or 2 out of 16, I had to go turn on PXE [14:52:00] vgutierrez: but in general no [14:52:02] why? [14:52:19] the new lvs2009 and lvs2010 come with those cards [14:52:28] instead of bnx2x like lvs1016 [14:52:36] right [14:52:50] it should be fine, it will need a final reboot after all puppetization is done I think [14:53:13] I've figured out the naming and open a CR with all the DNS stuff for lvs2007-2010 [14:53:18] I updated all the interface hardware tweak stuff to support bnxt_en, for both cp and lvs, while working on the new cps [14:53:54] (in puppet I mean. most of it was pretty trivial, interface-rps already automagically supported it based on work we did before for the general case) [14:54:17] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) @bblack The DIMM has been replaced with new, please resolve task once satisified Return Tracking USPS 9202 3946 5301 2439 4635 97 FEDEX 9611918 23... [14:54:28] we should probably hold on going live with bnxt_en lvses until the other driver-level issue is investigated more [14:54:44] ack [14:54:58] (5/16 of the cp servers, during eqiad network mess yesterday, their bnxt_en driver crashed on a transmit timeout and didn't come back without a reboot) [14:55:09] I think I'll let them live as spare systems till then [14:55:10] :/ [14:56:52] I think there's upstream bugfixes for bnxt_en our stretch 4.9 doesn't have yet, but I still have some digging to do on that [14:57:19] (or possibly, we need to flash newer firmware, I donno) [14:58:12] bblack: oh and I've asked papaul to disable the onboard NICs in lvs2009, they were enabled [14:58:37] ok [14:58:50] we've talked before about possibly actually using one, but never really committed to it IIRC [14:58:53] they were already disabled in lvs2010 though [14:59:02] it would mean some strange puppetization fixups that apply only to newer ones where we're doing it [14:59:18] (as in, keep the onboard 1G as the host-primary interface, and have all the 4x10G be traffic-only) [14:59:26] kinda a management port [14:59:29] meeting :) [14:59:32] yeah [15:00:14] I'll be like 5m late, but I'll be there! [15:16:57] all dns recursors rebooted to the latest kernel (eqiad ones were already up-to-date thanks to recent reimage) and upgraded to SSBD-enabled microcode (except eqiad ones, will fix that next week). nescio and maerlant have too old CPUs for which Intel didn't release microcode updates so far (but they're up for hw replacement anyway) [15:17:37] thx moritzm :D [15:37:04] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Cmjohnson) [15:44:16] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Cmjohnson) @ayounsi assigning this to you. Everything has been updated on the switch, verified what they were and disabled the ports. asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0... [16:14:24] I've configured ATS to use outbound TLS with the appservers [16:14:31] hmmm [16:14:32] it's been hard but I've managed: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451654/ [16:14:34] what about TLS parameters? [16:15:06] tls version, cipher suites... you know the drill :P [16:16:08] lol [16:16:47] yeah I think ATS has some generic parameters to configure outbound TLS params [16:16:57] we should probably limit to 1.2 and nice ciphers, etc [16:17:03] https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html?highlight=tls#client-related-configuration [16:17:24] patch ssl_ciphersuite su support ATS too l) [16:17:27] ;) [16:17:35] * volans hides [16:19:17] note docs have some client settings not in the client-related-configuration section [16:19:20] https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html?highlight=tls#proxy-config-ssl-client-cipher-suite [16:19:57] ssl_ciphersuite doesn't really have client-side in mind [16:20:06] hmmm we need some additional stuff there.. like signature hashing algorithms and maybe EC curves... [16:20:10] (but I guess it could!) [16:20:41] we don't set sigalgs in nginx either, I think it has sane defaults? [16:20:52] we do set EC curves to nondefault x25519:secp256r1 [16:21:21] openssl doesn't have sane defaults about that :) [16:22:32] I meant nginx [16:22:50] not all of nginx's use of openssl to set settings is exposed as nginx-level settings [16:23:09] reminds me, nginx is on the list of repos I need to re-clone since I switched laptops [16:23:13] no git grep :/ [16:24:58] https://github.com/nginx/nginx/search?q=sigalgs&unscoped_q=sigalgs --> 0 results [16:25:10] https://www.openssl.org/docs/man1.1.0/ssl/SSL_CTX_set1_sigalgs_list.html [16:25:21] it should match something with sigalgs :_) [16:28:52] hmmm [16:29:03] oh wait, really it's a client-side concern for us, right? [16:29:18] server-side, sigalgs is implicitly limited by the set of deployed certs [16:29:26] (and lack of caring about client certs) [16:29:52] so if we're a server like our nginx terminators, and we don't use client certs, we don't care about that setting [16:30:03] but yeah for outbound TLS from ATS, maybe we do [16:30:09] yup [16:30:13] we did care for librdkafka [16:30:38] so nothing to worry about for tlsproxy necessary [16:30:41] just for ATS outbound [16:30:59] yup, I was thinking about ATS as the TLS client, sorry [16:32:25] can I get a sanity check on the cache_misc TTL patch? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451656/ [16:33:47] no just after the nginx search, I was worrying heh [16:33:49] I was jus confused [16:33:54] are you ignoring labtest* on purpose? [16:34:04] those are pointing to misc as well [16:34:14] ah, lovely [16:34:37] I was sedding 600s, that's 1H [16:34:46] damn humans [16:34:46] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Suggested plan to have asw2-a-eqiad similar to codfw: {F24734668} Need to be added: fpc1-fpc3 fpc3-fpc4 fpc5-fpc6 Need to be removed: fpc2-fpc4 fpc6-fpc... [16:34:50] ;P [16:34:58] bblack, mark, https://phabricator.wikimedia.org/T201145#4492126 thoughts? [16:36:02] ema: [16:36:04] templates/wikimedia.org:labtestwikitech 1H IN DYNA geoip!misc-addrs [16:36:08] templates/wikimedia.org:labtesthorizon 1H IN DYNA geoip!misc-addrs [16:36:11] templates/wikimedia.org:labtesttoolsadmin 1H IN DYNA geoip!misc-addrs [16:36:14] templates/wmfusercontent.org: 600 IN DYNA geoip!misc-addrs [16:36:18] wmfusercontent.org is for the root of that domain I think [16:36:30] I have no idea why those labtests are 1H, but they should've been 600 anyways [16:40:00] ema: I don't see actual patches in the parsoid/graphoid/citoid/ores window [16:40:14] maybe wait for them to start up on the hour to see if any last-minute patches show [16:40:23] if not, use their slot for this change [16:40:30] since ores is affected at least [16:40:45] (for the misc->text part) [16:40:58] sure [16:54:21] 10Traffic, 10Operations, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10ema) [16:55:48] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [17:07:10] bblack: doh! https://phabricator.wikimedia.org/T201518 [17:12:52] so https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451659/ is ready for review/merge, time to eat now here. I'll drop by after dinner [17:25:13] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Suggestion: maybe just have these generate long-term to the generic sha... [17:35:17] ema: np, I can merge it a little later [17:44:32] ema: whenever (not urgent), but take a peek at /dev/sd[cdefg] on cp2009 ATS, seems like some strange iDRAC virtual devices [17:47:43] XioNoX: looking at calendar, we have Morning SWAT coming up on the hour at 18:00, and then MW train deploy (version bump to the big wikis) at 19:00 [17:48:33] let's aim for after mw train deploy to do actual work. and then if we need to we can cancel their evening swat if things get dodgy (better than canceling the impending morning swat or train) [17:48:44] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1800 [17:49:03] I don't think they'll take their full 2h, we can watch for when they're done [17:49:25] gives time to think / eval / plan too, if we need chris to be ready as well with phy cable moves, etc [17:49:56] bblack: let's connect the links in a disabled state now then, so we don't block Chris? [17:50:10] right [17:50:15] https://phabricator.wikimedia.org/T201145#4492126 -> [17:50:37] Connect fpc3-fpc4 (keep disabled) [17:50:37] connect fpc5-fpc6 (keep disabled) [17:50:52] we can have them pre-disabled before connect and should be high probability zero impact, right? [17:51:08] (and then everything else is software changes?) [17:52:30] correct [17:53:03] ok [17:53:13] log it all super verbosely anyways :) [17:53:37] but yes, +1 let's connect the disabled cables now [17:54:03] random cp3037 ipsec alert... [17:54:07] yep [17:56:22] ok [17:56:45] I'm going to poke at cp1080 and at ema's geoip misc->text patch in the short-term meantime [17:57:20] Will sync up with Chris in -dcops [17:57:26] ok [18:00:38] alright, it's done, and still show as down as expected [18:00:55] everything else can be done remotely now [18:08:08] ok [18:12:48] bblack: thanks, I'll call it a day then! Will check the scsi mystery on cp2009 tommorow o/ [18:14:49] ema: cya [19:21:58] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ayounsi) Switch ports descriptions updated. [20:18:02] ema: FYI I did a quick hack-around your numa_networking disables for the 21 that remain. I removed all their hieradata settings, then ran puppet on them all to catch up on changes (esp ipsec), then re-disabled puppet with the same message and reverted the hieradata settings back into place [20:18:24] ema: so you shouldn't really notice it, but just in case! [20:27:25] XioNoX: so I think we're more-or-less clear, I've chased down other minor things that might annoy or confuse too [20:27:35] ok! [20:28:17] so, we should probably keep an eye out on whatever it is that shows that crazy 1.2 Mpps of multicast at the router, it seems like a good smell test that things might be going crazy [20:29:08] another thing m.ark mentioned earlier was the possibility of not doing multicast forward on some of the links (e.g. only do it from cr2 but not cr1 to the A switches, or only 1/2 of the inter-fabric links) [20:30:02] I donno if that sounds riskier, or maybe just a backup plan if multicast seems crazy again [20:30:26] worst case we could even just stop multicasting while we get through a transition. things will "work" without it, we just lose cache invalidations, basically. [20:30:30] (which is better than melting) [20:30:39] yeah, that's more backup [20:30:51] as well as disabling the link to cr1 [20:33:49] alright, ready to start in 5min [20:34:36] ok I made some announces in -ops to hold other things and changed topic etc too [21:32:14] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [21:43:33] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [21:43:52] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) a:03Cmjohnson