[06:59:21] vgutierrez: so the LVSs are now 11 jessie vs 11 stretch? :) [06:59:49] yup :D [06:59:57] nice [07:02:49] re: T192555, I've been gathering some 10 minutes samples to get a rough idea about our current AES128-SHA users [07:02:50] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [07:03:11] after discussing the results with bblack, I need to gather 24h data [07:04:34] I could go for the hacky/ninja/quick&dirty way but maybe it would be interesting to log that info during 24h in logstash [07:04:48] ema: is it hard to add a new log there? [07:06:18] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167254 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3002.esams.wmnet ``` The log can be found in `/var/lo... [07:07:34] vgutierrez: currently we're sending logs to logstash with specific python daemons (varnishospital, varnishslowlog) [07:08:15] so yeah it's not particularly hard, it's a matter of writing a new daemon and installing it on all cp hosts [07:09:02] for the 10 minutes thingie I've been using cumin + timeout/varnishncsa [07:09:13] but we don't think that's reliable for a whole day [07:10:14] did you send the info to logstash? Why do you think it's not reliable? [07:10:35] cumin handling 91 ssh connection open for 24 hours? [07:10:57] yes? [07:11:17] hmm any tcp issue would stop the gathering in some nodes... [07:11:50] also I was logging into /tmp, and for 10m it's acceptable.. but for 24h we'd need some summarization in place [07:11:56] right [07:12:14] is it actually important to get precise numbers or could you get a sample on one node per dc/cluster? [07:13:05] we're discussing user affectation, so I feel more comfortable providing accurate data [07:13:26] how about adding the info to varnishxcps instead? [07:13:44] hmmm won't work, I need user agents [07:13:54] that doesn't fit in prometheus :) [07:14:54] alright then I'm out of ideas! :) [07:16:00] I think something like varnishhospital/varnishshowlog could fit here [07:16:12] sgtm [07:16:28] it's going only to be used for 24h and then reverted, but we are going to need similar stuff in the future [07:16:45] i.e: when we'd need to discuss TLSv1.0 deprecation [07:20:31] can just go for `ensure: stopped` instead of reverting the [07:20:33] *then [07:20:55] 10Traffic, 10Operations: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4167263 (10Vgutierrez) p:05Triage>03Normal [07:21:18] right [07:49:34] vgutierrez: long command with cumin on so many hosts, it's possible but not advisable... I think two better options are: [07:49:38] - puppetizing it [07:50:09] - run with cumin in background: 'your-command &> /tmp/foo & exit' (cumin will launch it and exit immediately) [07:59:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167320 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3002.esams.wmnet'] ``` and were **ALL** successful. [08:09:28] <_joe_> I suggest puppetizing anything that needs to run longer than for a weekly experiment [08:09:55] <_joe_> anything shorter is ok to run from tmux [08:18:37] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167357 (10Vgutierrez) [09:23:23] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167455 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3001.esams.wmnet ``` The log can be found in `/var/lo... [10:01:48] elukey: crazy question, what would be the impact of logging TLS information in webrequest? [10:03:01] vgutierrez: in theory we would need to change the webrequest format itself and store it on HDFS tables etc.., so there would be a lot of work to do :) [10:03:22] ack [10:03:34] plus the main issue is that we wouldn't have that info logged by varnish (unless nginx inserts a special header) [10:03:38] but! [10:04:03] that info it's actually logged by nginx + varnish :) [10:04:24] what we could do instead is finding a way to have the TLS info at the varnish level, and then create a special varnishkafka instance only for that, that pushes whatever format we want to Kafka [10:04:32] ah sorry didn't know it :) [10:05:48] maybe it could be interesting to explore that way in the future [10:08:10] elukey: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L111-L117 [10:08:38] very nice [10:09:26] so adding a new vk instance is relatively cheap, and even collecting data on a regular basis to HDFS is not that hard (me and Arzhel worked for a bit on collecting netflow data for example) [10:10:02] so if you need something like that because regular metrics are not enough, let me know! [10:10:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167518 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3001.esams.wmnet'] ``` and were **ALL** successful. [10:10:42] elukey: right now I'm working on sending UA data + TLS data to kibana, it shouldn't be a problem cause it's going to match only a <0.09% of our traffic [10:11:14] ack [10:11:18] but basically I need to do it only cause webrequest lacks TLS data :) [10:11:42] so I was wondering the actual cost of including that info in webrequest [10:12:38] with some love on that info we could provide reports on MiTM victims visiting wikipedia and stuff like that [10:16:20] I can definitely ask to my team this question [10:16:27] thx :) [10:32:13] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167568 (10Vgutierrez) [10:32:15] only codfw to complete T191897 \o/ [10:32:16] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:32:23] *only codfw missing [10:32:26] :) [10:37:53] nice! [10:39:00] if bblack approves, that could be done on Wednesday [10:50:25] all tests pass w/ the 4.1 backport of https://github.com/varnishcache/varnish-cache/pull/2555 as well as with the version in master, while backporting the patch to 5.1 makes https://github.com/wikimedia/operations-debs-varnish4/blob/debian-wmf/bin/varnishtest/tests/c00041.vtc#L85 occasionally fail [10:50:43] it seems to me that we need https://github.com/varnishcache/varnish-cache/pull/2422 too. Backported, rebased and waiting for jenkins [10:51:19] https://gerrit.wikimedia.org/r/#/c/429762/ is green which is a good start :) [10:54:16] yay, success! https://gerrit.wikimedia.org/r/#/c/429440/ [10:54:24] :D [10:55:03] lunch & [13:40:49] logging TLS to webreq doesn't sound like a bad idea. but maybe in simplified form so it's just a single string or something (we have a ton of fields). [13:41:25] bblack: maybe just the CP-Full-Cipher from VC_Log [13:41:44] well TLS version would be useful too [13:42:25] hmmm [13:42:39] in that case, X-Connection-Properties from nginx I guess it contains everything [13:43:18] it has some otehr bits too though, which might be nice to filter on separately if at all [13:43:21] hmmmm [13:44:19] BTW, bblack / ema, what's the best way to define a systemd service that only affects the frontend instance? I'm aiming to the define varnish::instance with an if $instance_name == 'frontend' {} block [13:45:11] I guess I don't understand context for that q [13:45:23] you're trying to add a new systemd service that depends on the fe instance, or? [13:46:19] I'm adding a varnishslowlog, varnishospital alike daemon, but I only want it running recollecting data from the varnish-frontend instance [13:48:44] I think our existing pattern for that is just to ref the class for your daemon in in profile::cache::base, and give it a fixed parameter for the instance name it should connect to, like kafka::webrequest. [13:48:58] which has this in it: [13:48:59] $varnish_name = 'frontend' [13:49:00] $varnish_svc_name = 'varnish-frontend' [13:49:43] ack [13:55:57] oh.. even better... varnish:logging :) [13:57:49] so, in existing webrequest we already have an ";http=1" field (missing if not HTTPS) [13:57:53] sorry ";https=1" [13:58:22] we could talk about perhaps repurposing that to specific the protocol [13:58:42] ";https=TLSv1.0", etc (which will cover future things like DTLS or QUIC) [13:59:39] and then separately an https_cipher="ECDHE-ECDSA-AES128-SHA" or whatever, which is also not set if https is not set. [14:00:23] it would preserve their existing query behvaior where they're just checking NULL-ness of https, but any queries relying on the explicit value https=1 would need updating to "https IS NOT NULL" [14:02:53] or we could leave ";https=1" alone as a legacy field to eventually remove, and make a new one for protocol, which has the crypto-layer protocol if https, and "http" otherwise [14:03:22] ";proto=http" or ";proto=TLSv1.2", etc... [14:03:43] it's kind of a mixed meaning though, as there's two layers of "protocol" and we could be putting e.g. H/2 info there... [14:04:09] I donno [14:33:42] bblack: thoughts on https://gerrit.wikimedia.org/r/#/c/429394/ and related changes? I'd merge those and prepare 5.1.3-1wm8 if you agree [15:09:26] 10Traffic, 10Operations, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4168193 (10Vgutierrez) After running several small captures (10 minutes lapses over 2 days), we've got the following results: * 56% MiTM victims * 32% deprecated human-operat... [15:16:19] gilles: hi! I'm looking at varnishlogconsumer.py, nice one [15:16:33] there are a couple pep8 errors apparently [15:16:42] puppet/modules/varnish/files/varnishlogconsumer.py:53:17: E128 continuation line under-indented for visual indent [15:16:52] puppet/modules/varnish/files/varnishlogconsumer.py:55:80: E501 line too long (84 > 79 characters) [15:16:55] puppet/modules/varnish/files/varnishlogconsumer.py:58:80: E501 line too long (82 > 79 characters) [15:18:01] hmmm jenkins pep8 sets the limit in 100 chars per line IIRC [15:18:37] so the offending one it's on line 53 [15:19:15] ema: BTW, i remember you mentioning gilles work on some meeting, maybe it could be adopted for varnishtlsinspector before merging? [15:19:42] vgutierrez: yes so his work is currently here (mediawiki/vagrant) https://gerrit.wikimedia.org/r/#/c/427641/ [15:20:39] once we're happy and it's merged there we need to add it to the puppet repo [15:21:49] ack [15:46:31] ema: +1 on the patch series [15:46:46] ema: I guess we should roll up all of this before trying to improve on the current vcl_hit situation? [15:47:45] bblack: yes, the current (not particularly elaborate!) plan is https://phabricator.wikimedia.org/T192368#4153519 [15:52:03] with s/1wm7/1wm8/ as 4.1.10 was released two days after that comment :) [16:14:39] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4168429 (10demon) a:03demon I'll handle this. Should just be a domain swap--no need to bother doing renames... [16:34:52] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4168470 (10ema) @Krinkle I've pushed https://gerrit.wikimedia.org/r/429833 to remove varnishmedia, my understanding is that there's only [[ https://grafa... [16:38:56] that reminds me, we should finish pushing numa_networking to rest of caches sometime (complicated by need for downtimes, etc) [16:43:57] bblack: we can perhaps do that together with the varnish wm8 upgrades [16:59:18] ema: vgutierrez: ok, I'll fix the pep8 issues tonight and create the same patch for puppet [17:11:40] gilles: cool :D [17:24:47] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168747 (10cwdent) SSL certs are what allow your browser to show you a green bar and guarantee that if you see that, you are talking to the Wikimedia Fo... [17:27:05] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168760 (10Ejegg) cwdent we formerly had silverpop-hosted urls in the email links, and lots of people thought they were phishing spam [17:28:51] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168775 (10CCogdill_WMF) We used a Silverpop URL for a few months and got enough complaints from donors that our Donor Services team asked us to turn cl... [18:47:26] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4169093 (10cwdent) @Ejegg @CCogdill_WMF ok scratch that idea :) [21:18:16] 10Traffic, 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4169688 (10EddieGP) a:03Joe Assigning to joe - it seems you're the one most comfortable (or only one comfortable?) on apache changes. Also p...