[01:38:01] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [01:38:04] I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [01:38:08] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [01:38:11] A fascinating blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [01:41:02] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [01:41:06] I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [01:41:09] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [01:41:12] A fascinating blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [07:19:59] <_joe_> sigh [07:41:51] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10TheDJ) So apparently Google detects JS based redirects these days. I also suspect... [07:53:22] ema: it looks like the cp stretch hosts are cron-spamming a little bit [09:33:51] vgutierrez: that's https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864242, opened by our dear godog [09:37:05] good times [09:46:18] ha, the network activity graphs on varnish-machine-stats are borked on stretch [09:46:41] we're filtering for devices named eth.* :) [09:49:05] ema: that's expected [09:55:20] how could we survive in those times of uncertain interface names, really I don't know [09:55:29] yeah.. dark times [10:44:26] !log reboot cp1045 for kernel update [10:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:16] bblack: what would be the best way of removing AES128-SHA from ssl_ciphersuite.rb? just getting rid of it and leaving "compat" list empty? [12:00:41] that would make mid == compat [12:26:58] but this could give us some problems for servers with OS < debian jessie [13:33:34] 10Traffic, 10DNS, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Cmjohnson) [13:59:10] vgutierrez: yeah it's an interesting question. we could also cumin-audit for ssl_ciphersuite use on (I don't think cumin supports checking a parser function directly, but we could check all the classes that use the parser function) [14:00:20] although "mid" will auto-downgrade to "compat" as well for the whole reason for that was the dhe_ok bit, for apaches https://news.ycombinator.com/item?id=17659983 "WireGuard is submitted for Linux kernel inclusion" [14:08:54] there's also the non-web stuff to consider. mail exchangers use ssl_ciphersuite as well I think, and I don't know how bad this change is for them, we don't really track it [14:10:15] oh scratch that. we never did get the smtp stuff switched to ssl_ciphersuite anyways [14:53:02] bblack: according to puppetboard we've only a few < jessie hosts, (ignoring labs environment) [14:58:04] we also have some pretty funny cases like... [14:58:09] modules/puppetmaster/manifests/init.pp:125: $ssl_settings = ssl_ciphersuite('apache', 'compat') [15:05:20] yeah [15:05:33] clearly, internal cases shouldn't need "compat", they can be "high" [15:06:11] also, tlsproxy is a focal point, it uses "compat" because it was originally developed for the edge servers, but it's unparameterized for ciphersuite and tlsproxy is used by a ton of other things now, which are all internal as well [15:06:17] well, some of them are internal, I mean [15:06:26] maybe we can clean all of that up after, separately [15:11:48] I'm running down some cumin intersections to find out if the dhe_ok code can just be removed (apache + ssl_ciphersuite + trusty) [15:12:11] volans: was there any magic trick to perform in case remote ipmi fails? [15:12:19] ipmitool -I lanplus -H cp5006.mgmt.eqsin.wmnet -E chassis power status [15:12:27] Error: Unable to establish IPMI v2 / RMCP+ session [15:12:35] ema: yes [15:12:43] https://wikitech.wikimedia.org/wiki/Management_Interfaces [15:13:06] doc?! [15:13:08] volans: awesome, ty! [15:13:31] yeah I wrote them down recently [15:13:36] sent an email about it :) [15:14:17] bblack: is now a good time to push mtu 1450 to codfw ipsec? [15:14:21] XioNoX: +1 [15:14:54] cool, fyi change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/449526 [15:21:41] volans: interesting, so this diff command was showing output https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? [15:21:55] volans: tried --commit, waited a bit, still nothing [15:22:13] volans: reset the password, still no remote IPMI [15:22:19] the diff is empty now? [15:22:23] yup [15:22:32] for both ones? [15:22:53] yes [15:23:07] racadm racreset [15:23:09] ? [15:23:40] I'm in [15:23:50] it works for me [15:24:04] does remote IPMI work for you? [15:24:19] sorry, tried ssh [15:24:21] not yet ipmi [15:24:48] vgutierrez: I audited, I believe completely-thoroughly, and there are no cumin hosts that are Chassis Power is on [15:24:51] yes it wors [15:25:06] I'm following up on one or two strange cases while I'm auditing though, just a sec [15:25:28] ema: did you reset it? [15:25:31] volans: this fails for me from neodymium `ipmitool -I lanplus -H cp5006.mgmt.eqsin.wmnet -E chassis power status` [15:25:47] I did from sarin [15:25:47] sudo ipmitool -I lanplus -H "cp5006.mgmt.eqsin.wmnet" -U root -E chassis power status [15:27:15] fascinating [15:27:26] mtu change pushed to cp2026, waiting to see if any issues [15:27:56] volans: it seems to fail if you `sudo -i` first [15:28:54] ema: have you tried with -U root ? [15:28:58] I always use it [15:29:30] the fragility of this shit is mind blowing [15:30:29] volans: yes, '-U root' fixes it [15:30:53] volans: this also works -> `sudo ipmitool -I lanplus -H cp5006.mgmt.eqsin.wmnet -E chassis power status` [15:31:14] what does not work is: `sudo -i ; ipmitool -I lanplus -H cp5006.mgmt.eqsin.wmnet -E chassis power status` [15:31:57] volans: thanks :) [15:32:05] ema: from the man page is even better [15:32:06] -U [15:32:06] Remote server username, default is NULL user. [15:32:10] :D [15:33:25] XioNoX: could your change be related to the puppet run fails on codfw nodes? [15:33:39] from mc2035 puppet.log [15:33:42] Aug 1 15:25:20 mc2035 puppet-agent[22634]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for mc1035.eqiad.wmnet Resolv::DNS::Resource::IN::AAAA at /etc/puppet/modules/strongswan/manifests/init.pp:14:25 on node mc2035.codfw.wmnet [15:34:20] uh, yeah [15:34:40] we've a bunch of them failing already [15:35:42] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp5006.eqsin.wmnet', 'cp5007.eqsin.wmnet'] ``` The log can be found in `/var/l... [15:36:09] https://puppetboard.wikimedia.org/nodes?status=failed [15:36:45] vgutierrez: I think I know what the issue is, but rolling back [15:36:54] ack [15:40:16] done, and running puppet on failed hosts [15:43:35] <3 nice [15:43:59] I think the issue is with `$dest_ip6 = ipresolve($dest_host,6)` if $dest_host doesn't have an IPv6, ipresolve breaks, instead of returning null [15:44:17] yup, looks like that [15:44:26] not sure how to fix it yet though [15:45:38] one stretched workaround is to configure v6 everywhere :) [15:45:46] XioNoX: it looks like patching ipresolve [15:46:09] (everything recovered) [15:46:24] that sounds like fun (patching ipresolve) [15:48:07] https://github.com/wikimedia/puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ipresolve.rb#L104 [15:49:06] returning nil there should do the trick, and at least strongswan init.pp is able to handle nil values [15:49:07] vgutierrez: https://gerrit.wikimedia.org/r/q/topic:ciphersuite_stuff - can you review/think on the sanity there? The first two we can merge today, the last is for a little later when we're sanitizing the mid/compat use-cases, etc, but explains why to keep the "compat" argument at all. [15:50:01] ack [15:52:56] vgutierrez: thanks for the pointer, question then is how to make sure that change would not break anything else? Look at all the other occurences of ipresolve? [15:53:45] XioNoX: I'm afraid that's the way to go [15:53:52] okay! [15:54:19] bblack: on the first change, it could make sense to verbosely warning that ssl_ciphersuite no longer supports <= jessie? [15:56:46] phabricator.w.o, grafana.w.o, and config-master.w.o can be tested through pinkunicorn [15:57:19] add `208.80.154.42 phabricator.wikimedia.org` and similar to /etc/hosts and see if stuff breaks (it shouldn't) [16:08:37] vgutierrez: you mean still check for it? or alternatively, we could assert it and let puppet fail [16:09:19] but yeah, good idea either way [16:10:03] hmmm let puppet compilation fail would be the safest approach [16:10:06] I think labs uses/used Xenial too [16:10:10] ? [16:10:37] which we assume is ~ jessie+ for these purposes in terms of apache/nginx package versions [16:10:48] but it would fail a debian>=jessie check [16:12:11] dunno about labs honestly [16:12:35] yeah me either [16:12:45] BTW, I'm tempted to suggest moving the only DHE ciphersuite to compat [16:13:21] but as it doesn't weak the other configs (like AES128-SHA is doing right now) I guess it doesn't matter [16:13:34] it might make sense, as part of the 3rd change or later, but we need to develop a better plan around what we're doing to all the different ssl_ciphersuite/tlsproxy consuming cases first [16:14:03] e.g. there could be some service configured as "mid" now that actually cares about DHE (because android 2.x or openssl-0.9.8) [16:14:19] bblack: no, I'm not aware of WMCS using xenial [16:14:29] oh ok, I thought I saw that somewhere once [16:14:45] IIRC it was under consideration but all went the stretch path [16:14:56] they do have trusties that aren't being considered by my cumin audit I guess [16:15:48] are you trying to find Cloud VPS instances affected or just the WMCS servers in the production domain? [16:15:49] bblack: you can apply your audit to WMCS too if you want [16:15:58] there's a separae cumin setup for the former [16:16:15] but not with smart puppetdb queries, has to be something to check on the hosts [16:16:16] moritzm: well, anything we might break via prod puppet code, which might include instances that puppetize? [16:16:20] as Wmcs doesn't have puppetdb [16:16:39] this is sounding complicated anyways [16:16:57] I re-ordered the commits to just pull AES128-SHA out ot compat first and leave the rest for after [16:17:36] ok, you can also simply re-run your Cumin query on the Cumin/WMCS server, then: https://wikitech.wikimedia.org/wiki/Cumin#WMCS_Cloud_VPS_infrastructure [16:17:48] it used puppetdb [16:17:51] but yeah [16:17:53] but I doubt any VPS instances uses that anyway [16:18:26] well, I think there are some special cases, like toollabs proxies that are instances, which indirectly end up using ssl_ciphersuite [16:18:27] we should have a code-level way to declare that a given class/role is not suitable for WMCS [16:19:34] vgutierrez: in any case, with the risks of some "omg labs broke" revert, better in this order so we can just do the AES128-SHA bit today and leave everything else for later [16:20:06] bblack: right [16:21:25] vgutierrez: rebased and added bug link to the aes128-sha change [16:21:39] vgutierrez: I'd say push it whenever you're ready (just the first change) [16:22:08] vgutierrez: and when we get some time in the next few weeks, we should audit all the rest of this situation more-thoroughly and figure out what to change with all the other ssl_ciphersuite consumers [16:22:22] bblack: the goal task is T192555, maybe it's worth mentioning it as well [16:22:23] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [16:22:34] thx stashbot :* [16:22:58] at this point I'm inclined to think our standard should be: "strong" for any internal-only case, "compat" for TLSv1.0 for the public cache termination, and "mid" for everything else that's public-facing but not the prod wikis (with TLSv1.2+ and the current mid list) [16:23:42] vgutierrez: updated [16:24:34] (and then maybe from that point above, we can consider pushing DHE down to compat, effectively stripping it from everywhere but the cache terminators) [16:25:06] I'm not 100% sure about mid [16:25:43] if we drop TLS1.1 and TLS1.0 in mid, only some older Safari would use those ciphersuites instead of the ones listed under strong [16:26:01] (according to ssllabs data) [16:26:48] so public-facing but non prod wikis could go to "strong" as well [16:27:36] taking into account of course that actually "non prod wikis" is everything that's not behind our cache layer [16:27:47] right [16:28:01] the curious case is the older safaris mostly [16:28:09] that do TLSv1.2 , but not GCM [16:28:40] anyways, it can at least be a second step after we see how the first goes [16:28:46] yep [16:29:18] push DHE down will just kill android 2.x and openssl-0.9.8 and other things of that era afaik [16:29:32] Java 6u45 as well [16:29:32] which should be fine everywhere but the caches I think [16:29:45] java6 is already dead because we require 2048-bit DHE [16:29:50] sure [16:30:22] but android 2.x an openssl-0.9.8 are not able to talk TLS 1.2 [16:30:37] so not only pushing DHE down will kill'em [16:30:58] making mid TLS1.2+ only will kill'em too [16:31:18] oh right [16:31:37] so may as well move DHE down to compat at the same time as the tls version change [16:31:56] indeed [16:32:20] so I guess keep the empty list in the first commit too, then [16:33:16] looks saner than removing and readding it :) [16:36:16] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5007.eqsin.wmnet', 'cp5006.eqsin.wmnet'] ``` and were **ALL** successful. [16:36:32] vgutierrez: re-uploaded [16:36:46] nice [16:37:11] so make the honors and turn AES128-SHA off whenever you want :) [16:37:37] I was going to say you do it [16:37:44] oh please, you're the boss :P [16:38:03] * ema votes for valentin too! [16:38:25] if I'm the boss I get to decide who does it anyways :) [16:38:34] damn... [16:38:37] he's right as usual [16:38:38] ha! [16:38:46] s/usual/always/g [16:39:22] so... let's go FS only :) [16:39:42] I don't even remember if the terminators will auto-reload the config [16:39:44] \o/ [16:47:13] bblack: nope, we need to reload the config manually [16:49:49] ok [16:50:05] https://www.ssllabs.com/ssltest/analyze.html?d=pinkunicorn.wikimedia.org&s=208.80.154.42&latest [16:50:20] I reloaded our beloved pinkunicorn [16:50:36] no weak ciphersuites :) [16:50:47] I assume just an nginx reload does the tricky, doesn't need the upgrade/restart? [16:50:57] indeed [16:51:01] reload is enough [16:51:02] if so we can push that with cumin pretty aggressively, it's downtimeless [16:51:41] but not entirely seamless! :) [16:51:53] it will take until a little while for the actual stats to drop off, because I think the reloaded nginxes will continue to allow session resumption from the session cache for AES128-SHA [16:52:11] but that's ok :) [16:52:36] tomorrow I'll submit the patch to clean the VCL BTW [16:52:41] ok cool [16:55:12] XioNoX: so rewinding back and looking at the mtu thing, we have hosts without ipv6 defined in dns? [16:55:44] bblack: correct, (some of) the mc servers for example [16:55:58] oh [16:56:08] yeah, I wasn't really thinking about the fact that this will impact those [16:56:16] we probably should avoid touching those without some coordination at least [16:56:56] (even if the ipv6 problem didn't exist, they need to know or think about the risk with the mtu change. maybe they don't have the problems we're trying to fix anyways, and the mtu change causes a perf regression for them) [16:56:59] cp500[67] upgraded to stretch and pooled [16:57:44] XioNoX: could switch up the plan and hit eqsin and/or esams first, which puts off the mc* problem a bit [16:58:17] (but also doesn't get us testing it on both ends of a connection anytime soon, I guess) [16:58:35] bblack: codfw and eqiad are where most of the errors happen [17:01:53] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Update from email thread: I've started to arrange for the Dell Onsite Engineer to visit next Monday, August 6th. We'll need to ensure cp5001 is still offline this Friday in... [17:05:30] vgutierrez: yeah, that was for my work, it can be re-enabled anytime if needed [17:05:56] XioNoX: whenever possible to let the nodes get the ssl_ciphersuite update [17:09:06] XioNoX: I'll run puppet then [17:12:06] XioNoX: done, thx :) [17:13:10] thx, re-disabled [17:13:55] bblack: I'm reloading nginx now.. in batches of 5 nodes.. just for my sanity [17:18:20] done [17:51:11] vgutierrez, is there any point listing tls-sni-0[12] anymore? [17:57:58] vgutierrez: heyas, lower priority (i dont have it scheduled until next monday) but cp5001 will need to go offline on friday for monday (singapore am) memory swap [17:59:58] just checking that is going to be ok [18:19:05] bblack: confirming that the mtu change fixed the issue: https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&panelId=20&fullscreen&from=now-3h&to=now (then select "OUT Dest Unreach codfw:cache_upload") [18:25:22] nice [18:31:36] will push to esams/eqsin [18:56:57] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @fgiunchedi @Joe @Dzahn @herron - Casting a wide net here, as I'm not s... [18:58:04] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @BBlack Somewhat random, but does varnish have the ability to translate... [19:20:31] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Dzahn) >>! In T199252#4470132, @Imarlier wrote: > My current plan is to solve thi... [20:29:50] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4470147, @Imarlier wrote: > @BBlack Somewhat random, but... [21:32:35] bblack: everything looks smooth. Good to do eqiad now, or better to wait for tomorrow? [21:41:28] yeah might as well do it [21:45:00] alright [21:46:57] with netflow we could also automate sending emails to ISPs if we receive a certain amount of ICMP time exceeded (cf. email I sent and CCed NOC) [22:21:35] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Just need to check with #traffic to ensure having this offline Friday-Monday is ok? [22:22:51] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) It's already depooled, should be fine! [22:52:50] 10Domains, 10Traffic, 10Operations, 10WMF-Communications, 10wikimediafoundation.org: Update jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Varnent) [23:04:55] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 (10ayounsi) 05Open>03Resolved This is done, the static routes with mtu lock did the trick, as expected. No more ICMP spikes confirmed on https://grafana.wikimedia.org/dashb... [23:42:17] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Excellent, I'll continue coordinating with Dell support and Equinix to file the tasks for this repair.