[05:43:44] <XioNoX>	 ema: there is already a dashboard named https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1&from=now-1h&to=now ok to overwrite it while renaming prometheus-varnish-http-requests?
[05:43:50] <XioNoX>	 they seem redundant
[08:19:44] <ema>	 XioNoX: yep
[08:19:53] <XioNoX>	 thx
[08:42:32] <wikibugs>	 10HTTPS, 10Traffic, 10DBA, 10Operations, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#4282120 (10Bawolff)
[08:54:28] <ema>	 vgutierrez: hey :)
[08:55:13] <vgutierrez>	 hey
[08:55:27] <ema>	 any specific reason for using two different resp.reasons in wikimedia-frontend.vcl.erb? "Browser Connection Security AES128-SHA intercepted request" vs "Browser Connection Security Warning"
[08:55:43] <vgutierrez>	 to avoid hitting the two if clauses?
[08:56:00] <ema>	 oh, you send different CC headers
[08:56:05] <vgutierrez>	 right
[08:58:24] <ema>	 on the other hand, we don't really use "Browser Connection Security Warning", do we?
[08:58:47] <ema>	 with the exception of /test-sec-warning 
[09:00:04] <vgutierrez>	 only for /sec-warning && /test-sec-warning
[09:00:26] <vgutierrez>	 it's nice to be able to check the page directly when you do changes on it
[09:01:18] <ema>	 right, so perhaps we might just have one single if (resp.reason == "Browser Connection Security Warning") with CC: "max-age=0, must-revalidate, no-cache, no-store"?
[09:01:39] <vgutierrez>	 hmmm
[09:02:08] <vgutierrez>	 only while it's a non 100% redirection, after that.. caching it's useful as bblack said yesterday EU afternoon
[09:03:39] <ema>	 if it makes things easier, then let's keep the "CC: $cacheable" block too but add a comment explaining why it's there (for when we reach 100%) ? 
[09:04:18] <ema>	 otherwise I'll forget the reasoning next week and bug you again about this, I'm sure :)
[09:04:50] <vgutierrez>	 I love it when you bug me
[09:04:57] <vgutierrez>	 but I'll add the comment :P
[09:05:05] <ema>	 haha
[09:06:42] <vgutierrez>	 it would be awesome to deploy this 1% change today... to be in the 6 weeks schedule and disable completely AES128-SHA on Aug 1st
[09:08:59] <ema>	 ok yeah let's do it
[09:09:19] <ema>	 interesting post on lvs-users: http://archive.linuxvirtualserver.org/html/lvs-users/2018-06/msg00004.html
[09:09:56] <ema>	 perhaps instead of avoiding iptables altogether disabling conntrack is enough for our use case? 
[09:10:06] <vgutierrez>	 the timeframe allow us to do it tomorrow... but you know... Friday before pre-summit weekend with everybody traveling isn't a good a idea
[09:11:08] <vgutierrez>	 I guess that disabling iptables completely saves us some CPU cycles as well
[09:11:40] <vgutierrez>	 but security-wise it could be better if we can afford that and have host firewall on the lvs as well :)
[09:12:33] <ema>	 right
[09:14:22] <XioNoX>	 yup, I agree
[09:15:35] <_joe_>	 vgutierrez: iirc iptables kills the lvs servers as soon as you load it
[09:15:49] <XioNoX>	 that's not convenient
[09:15:52] <_joe_>	 but yeah, we couldd make a try
[09:16:36] <XioNoX>	 maybe start with iptables on cp servers
[09:16:38] <vgutierrez>	 we could try banning iptables conntrack modules
[09:16:53] <vgutierrez>	 even a stateless firewall is better than open doors :)
[09:16:56] <_joe_>	 XioNoX: you'll have to convince your manager first
[09:17:23] <_joe_>	 he planted a sign "keep off my lawn" in front of iptables a long time ago :D
[09:17:42] <ema>	 I'll add the item to today's agenda to spice things up :)
[09:17:49] <XioNoX>	 _joe_: we briefly discussed it, which that MTU issue thing. It might come back on the table if someone has cycles to work on it
[09:17:59] <XioNoX>	 with*
[09:18:41] <XioNoX>	 ema: probably a good topic for the offsite too
[09:18:54] <vgutierrez>	 ema: hahahaha
[09:18:56] <vgutierrez>	 +1
[09:19:19] <vgutierrez>	 I'd try to attend to the meeting... probably from napuccino... that's pretty close from my physiotherapy place :(
[09:19:35] <vgutierrez>	 I bet XioNoX remembers the place
[09:27:41] <mark>	 i don't really see the point of iptables on lvs servers
[09:28:25] <mark>	 just make sure there aren't any open doors?
[09:28:31] <mark>	 it's not like they run many services
[09:29:33] <XioNoX>	 yeah, also limit the scope of a vulnerability in a service we could run
[09:31:11] <mark>	 i am all for iptables elsewhere, let me be clear about that
[09:31:25] <mark>	 just on lvs in particular, and to some degree cp servers, I think the drawbacks don't outweigh the benefits
[09:33:57] <volans>	 vgutierrez: IIRC the module is already blacklisted/banned
[09:34:04] <mark>	 11:10:06 <vgutierrez> the timeframe allow us to do it tomorrow... but you know... Friday before pre-summit weekend with everybody traveling isn't a good a idea
[09:34:30] <mark>	 no deployments allowed on fridays in general, certainly not this week ;)
[09:34:31] <vgutierrez>	 volans: currently all of them
[09:34:35] <XioNoX>	 afaik, we're not 100% sure of the drawbacks. We think it would have a performance impact but it might not
[09:36:36] <mark>	 well, it sure had a big impact in the past. but a) not using conntrack would make a big difference there indeed, and b) things could have changed in the mean time
[09:37:11] <vgutierrez>	 hmm a small stateless ruleset shouldn't affect that much
[09:37:39] <mark>	 maybe not, no
[09:37:42] <mark>	 to do what though?
[09:38:09] <vgutierrez>	 check that ssh, pybal ports and so on are only reachable by allowed peers?
[09:38:16] <vgutierrez>	 s/check/ensure/
[09:39:39] <mark>	 they're on the internal subnets in the first place
[09:39:51] <vgutierrez>	 not all of them :)
[09:39:56] <mark>	 work on that then :)
[09:40:09] <mark>	 the pybal ports shouldn't really allow you to do anything either
[09:40:28] <XioNoX>	 only the old eqiad are not on internal subnets
[09:40:34] <vgutierrez>	 right.. but you're widening the attack surface unnecessarily 
[09:41:34] <mark>	 i rather see it as not much room for narrowing it more than it already has
[09:43:13] <vgutierrez>	 also... it would be interesting to avoid lateral movement on cp* and lvs* servers
[09:43:44] <XioNoX>	 racks are pretty strong, servers shouldn't move laterally
[09:43:49] * XioNoX hides
[09:44:11] <vgutierrez>	 XioNoX: they leak sometimes.. I can jump from one cp server no another
[09:44:14] <vgutierrez>	 :P
[09:44:24] <vgutierrez>	 s/no/to/
[09:46:17] <vgutierrez>	 of course we could monitor this kind of foul play in other layers
[09:46:32] <vgutierrez>	 i.e: netflow/sflow on the switches
[09:46:50] <vgutierrez>	 we wouldn't be able to stop but at least we would be aware of it
[09:53:28] <mark>	 and logstash
[09:53:45] <vgutierrez>	 logstash?
[09:55:18] <mark>	 yes, analysis of centralized logging across hosts so you can track logins
[09:55:24] <mark>	 of course it only works for some cases
[09:55:44] <mark>	 not with remote execution vulnerabilities spawning a shell ;)
[09:55:47] <vgutierrez>	 sure, for stuff that generate log entries :)
[09:55:48] <vgutierrez>	 right
[09:55:50] <mark>	 yup
[09:56:32] <mark>	 but we don't do much of that yet and in the near future we should
[09:59:26] <vgutierrez>	 I'm feeling bad right now, i'm a security engineer and the machines of my subteam have banned iptables :(
[09:59:30] * vgutierrez goes to the corner
[10:06:02] <mark>	 i think also the newer stuff that does offloading of ACLs to hardware is interesting
[10:19:01] <_joe_>	 so regarding lvs and network isolation of functions, we had a discussion in the past and paravoid wrote a ticket with a proposal on how to use network namespaces for that
[10:19:36] <_joe_>	 https://phabricator.wikimedia.org/T114979
[10:20:53] <mark>	 yes
[10:21:12] <mark>	 ema: remote hands request sent
[10:25:26] <ema>	 mark: thanks!
[10:25:33] <vgutierrez>	 nice :D
[10:25:54] <vgutierrez>	 let's see if we are able to diagnose or even recover cp3037
[10:28:18] <vgutierrez>	 ema: I'm wondering... what's the safest way of testing https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440114/ ?
[10:28:35] <vgutierrez>	 besides deploying it on cp1008
[10:30:31] <ema>	 mmh
[10:30:59] <ema>	 we should write a vtc test case
[10:32:05] <ema>	 and yeah, cp1008
[10:33:20] <ema>	 also you could test that on a cache_misc node, the one your IP gets chashed to in esams
[10:34:28] <ema>	 but IMHO testing on cp1008 should be enough 
[10:35:11] <vgutierrez>	 https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishtest/tests/m00002.vtc --> std.random() testing
[10:36:06] <vgutierrez>	 ok
[10:38:00] <ema>	 oh cool, I didn't know of debug.srandom
[10:49:29] <vgutierrez>	 ema: is it safe to run varnishtest on a machine already running varnish?
[10:50:11] <vgutierrez>	 (mainly cp1008)
[10:51:48] <ema>	 vgutierrez: absolutely
[10:52:19] <ema>	 lunch, bbl
[11:11:59] <vgutierrez>	 lunch++
[11:34:32] <mark>	 wtf, is data in racktables that bad
[11:36:50] * mark cries
[11:36:51] <mark>	 https://phabricator.wikimedia.org/T136403
[11:42:29] <bblack>	 re: the long-standing iptables debate on cp/lvs: it's really just a matter of testing and digging.  Someone has to put in the legwork to validate how it really works out on modern kernels with RSS in general and LVS in particular.
[11:43:26] <bblack>	 my intuition is it's probably ok on the cp servers these days, but we should validate it doesn't effectively undo the RSS optimizations (meaning iptables keeps the traffic from scaling linearly across IRQs/cores)
[11:44:03] <bblack>	 for LVS it's all that, plus how it may or may not affect the efficiency given LVS itself and that whole unique thing
[11:44:17] <bblack>	 (and in either case, I should say, I assume stateless rules.  conntrack is probably a bad idea on either one)
[11:45:49] <XioNoX>	 yeah, conntrack should be kept out of the picture
[11:50:02] <bblack>	 starting with RSS: reading through current design docs and/or kernel code could give a lot of real insight on RSS-vs-iptables.  Testing could also tell us some things, but can be tricky to rely on (we'd eventually have to test on a fairly heavy-traffic scenario to see any effect, and look deeply at the distribution of work on the CPUs, cache misses, other inter-processor stuff (IPI? numa?)
[11:50:48] <paravoid>	 my take is what joe linked to above
[11:50:51] <bblack>	 there's a real risk of going about this the wrong way, where we "test" on one cp node somewhere and everything looks "fine", but it's just subtle broken efficiency in a way that won't matter until we're under a heavier load spike
[11:50:57] <paravoid>	 if you split the LVS-as-a-router to a separate namespace
[11:51:17] <paravoid>	 you can do conntrack in the main namespace, and generally treat the host as a regular host
[11:51:22] <paravoid>	 living only in the internal network etc.
[11:51:29] <bblack>	 yeah, that too, assuming namespaces split things so cleanly tnat iptables in one namespace doesn't at all impact network traffic for another.
[11:52:00] <bblack>	 for all we know there's some inefficiency there where the mere presence of iptables modules causes some inefficient path for all network traffic off the card regardless of namespace
[11:52:13] <mark>	 that's pretty likely with hardware offloading and all that yeah
[11:54:07] <bblack>	 but if we're going down this road at all: look first at RSS-vs-iptables ignoring LVS-specific issues.  because if there's an RSS-vs-iptables problem it affects both cases.
[11:54:46] <paravoid>	 maybe?
[11:54:58] <paravoid>	 the idea is that you'd assign the 10G interfaces to a separate netns entirely
[11:55:16] <paravoid>	 the whole interface, so in theory it'd be a clean cut
[11:55:29] <bblack>	 yeah but iptables rulesets don't attach to interfaces.  interfaces are something you match on inside the rulesets
[11:55:38] <bblack>	 hence the possibility loading them at all affects everything
[11:55:57] <paravoid>	 I was talking about RSS
[11:56:35] <bblack>	 yes
[11:58:22] <bblack>	 the goal with RSS (which we're only partway there on for some cases) is that it splits the traffic across some IRQs that route to separate cores, and the traffic split is fairly clean down to the applayer (in that it doesn't involve cross-cpu work to process these separate flows, and they scale well if you're using a bunch of separate REUSEPORT sockets, etc
[11:59:53] <bblack>	 but if iptables rules can hypothetically match the traffic (and I don't think putting a card wholly inside a namespace necessarily stops that), iptables may have some efficiency-destroying things going on where some table or counter or lock causes the flow of processing those traffic streams to touch each other again for some kind of accounting or whatever, that's the paranoid fear.
[12:00:33] <paravoid>	 they can't, netfilter is separate per namespace
[12:00:41] <bblack>	 ok
[12:00:42] <paravoid>	 but yes, we've never tested it
[12:01:08] <paravoid>	 so I don't disagree that there might be gotchas or stuff we haven't thought about :)
[12:01:11] <bblack>	 in theory all of this can be designed well.  You can implement stateless iptables to scale perfectly with RSS (even without using namespaces).
[12:01:27] <bblack>	 I think it's just that I'm pessimistic on whether that's present reality of the kernel's code :)
[12:06:18] <bblack>	 I doubt very many people have tested it before either (e.g. developers).  In the LVS case, this specific notion of splitting namespaces, assigning cards separately, and putting iptables on one side of the namespace split while relying on RSS efficiency in the other.
[12:06:49] <bblack>	 it should work, you'd think the design of all related things would work well for that by default, but that's being optimistic without testing :)
[12:07:23] <bblack>	 the namespace-splitting has other benefits regardless of iptables though.  it's something we should pursue regardless, at some priority level.
[12:18:31] <ema>	 uh cp3037 came back \o/
[12:19:35] <volans>	 just noticed, was about to say the same :)
[12:21:35] <bblack>	 mgmt interface came back too?
[12:21:52] <ema>	 yup
[12:21:57] <bblack>	 nice
[12:22:40] <bblack>	 so this will at least get us back to a 7+11 = 18 scenario combined with the node move yesterday, assuming it stays up
[12:22:58] <bblack>	 vs 8+9 = 17 where we started
[12:27:06] <bblack>	 don't forget that ocsp will be outdated too, needs a manual run of the update-ocsp-all that's in root's crontab
[12:30:06] <ema>	 done
[12:31:54] <volans>	 cannot be run at reboot too like puppet's cron?
[12:32:06] <volans>	 (probably stupid question ;) )
[12:32:41] <vgutierrez>	 the idea was to move the update-ocsp-all from crontab to a systemd (timer) unit and have it also as a dependency for nginx...
[12:32:58] <vgutierrez>	 and solve the issue with that
[12:34:34] <bblack>	 yeah it would be a good idea probably, as much as I recoil at anything that boils down to "use systemd even more" :)
[12:35:08] <bblack>	 I'd say ditto for zerofetcher really, but then again that's going away in the long term anyways
[12:36:28] <bblack>	 speaking of zero, latest news is tenatively July 1st for removing the zero+eqsin exceptions, pending confirmation again from the zero team on/after that date before we proceed.
[12:37:11] <bblack>	 tentatively? it's such a strange word
[12:37:45] <ema>	 :)
[12:38:13] <ema>	 tentativo=attempt in Italian, sounds fine to me!
[12:38:23] <bblack>	 yeah it comes from latin
[12:38:25] <vgutierrez>	 tentativa=attempt in Spanish
[12:38:29] <vgutierrez>	 +1
[12:38:42] <bblack>	 I looked at etymology for rambling too, apparently Dutch, so that explains its novelty to you :)
[12:39:08] <ema>	 ha
[12:39:58] <vgutierrez>	 bblack: so.. can we move forward the AES128-SHA thingie today? to avoid deploying stuff on Friday and also meet the Aug 1st date (looks good to my OCD)
[12:40:50] <bblack>	 I think so, but I haven't looked again today, let me go stare a bit
[12:40:58] <vgutierrez>	 sure :)
[12:47:10] <bblack>	 vgutierrez: yeah I think it's ok on general methodolgy, 3x minor patchups to look at.
[12:47:19] <vgutierrez>	 thx!
[12:59:16] <paravoid>	 vgutierrez: want to talk about librdkafka?
[12:59:30] <paravoid>	 you had some question yesterday I think
[12:59:33] <vgutierrez>	 right
[12:59:49] <paravoid>	 not sure if now is a good time, can certainly talk about it later/tomorrow/next week :)
[13:00:49] <vgutierrez>	 basically we need to figure out what's the best way to have in our environment (cp* instances) the features implemented in this PR (already merged) https://github.com/edenhill/librdkafka/pull/1809
[13:01:40] <paravoid>	 ah!
[13:01:49] <paravoid>	 great to see that merged :)
[13:02:17] <paravoid>	 lemme ask magnus when he's planning for the next release -- do you have a specific timeline (or deadline) for this?
[13:03:00] <vgutierrez>	 I think that elukey can answer that question better than me
[13:03:09] <vgutierrez>	 but I guess this quarter :)
[13:03:15] <bblack>	 paravoid: EOQ was our target, to have this stuff in place to resolve "we think the kafka conns are secure enough we can dump ipsec->kafka-broker"
[13:03:36] <paravoid>	 so... in the next two weeks?
[13:03:38] <bblack>	 assuming no other blockers expected to go past the deadline anyways (I don't think so)
[13:03:46] <vgutierrez>	 paravoid: right O:)
[13:04:06] <paravoid>	 ok, I doubt we'll see a new upstream in time for this
[13:04:24] <bblack>	 but we could do a one-off +wmf packaging with it until upstream refreshes
[13:04:29] <paravoid>	 so then the next step is to just patch the upstream Debian package with this, release a +wikimedia1, and put it in apt.wikimedia.org
[13:04:41] <bblack>	 ^ that
[13:04:43] <paravoid>	 yeah
[13:04:52] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253860 (10Verdy_p) Isn"t there a way for the wiki server to autodetect those browsers that are still using the legacy TLS implementation and add som...
[13:05:04] <mark>	 so cp3037 has been powercycled, is it back yet?
[13:05:21] <mark>	 ah i see, yes
[13:05:48] <paravoid>	 is that something that one of you guys can handle?
[13:05:58] <paravoid>	 should be fairly simple I think
[13:06:00] <bblack>	 I think so
[13:06:33] <bblack>	 vgutierrez: I'll answer the rambling interjection on https://phabricator.wikimedia.org/T196371#4282728 :)
[13:06:51] <paravoid>	 let me know if/how I can help regardless
[13:07:09] <paravoid>	 and I already pinged magnus on IRC, will let you know when I have news
[13:07:15] <vgutierrez>	 bblack: <3 thanks
[13:07:25] <vgutierrez>	 paravoid: thanks you too
[13:07:54] <vgutierrez>	 I'm headed to physiotherapy right now.. see you bblack (and folks) at 17:00 meeting
[13:08:14] <paravoid>	 feel better
[13:08:34] <ema>	 good luck, see you then
[13:09:28] <mark>	 btw, back to the earlier discussion:
[13:09:30] <mark>	 14:07:23 <bblack> the namespace-splitting has other benefits regardless of iptables though.  it's something we should pursue regardless, at some priority level.
[13:09:47] <mark>	 i hope to work on that at some point after taking care of pybal tech debt
[13:10:21] <mark>	 there have been ideas to split up pybal functions a bit for other reasons and it would help here too
[13:10:40] <mark>	 i proposed a pybal plans discussion session at the offsite, let's discuss there
[13:12:30] <bblack>	 +1
[13:13:13] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282738 (10Verdy_p) Note that because of ULS, using "Wikipedia" instead of "Wikimedia" is still accurate: the secure logon will be made on other wiki...
[13:41:25] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282813 (10Aklapper) @Verdy_p: IMO your question is already answered in the task description. Apart from that it seems unclear which actual problems...
[13:51:10] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282826 (10BBlack) >>! In T196371#4282728, @Verdy_p wrote: > Isn"t there a way for the wiki server to autodetect those browsers that are still using...
[13:51:23] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282827 (10Verdy_p) Not able to even read the wiki in an enforced incognito mode (removing all private session keys, disabling some scripts, just ren...
[14:53:32] <wikibugs>	 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4282963 (10chasemp) We met today to sync up on moving the remaining lab* servers.  Hopefully these days/times al work for @cmjohnson (I added him to the calend...
[15:00:05] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4282970 (10ayounsi) NAT change pushed.
[15:03:03] <bblack>	 I'm getting there!
[15:03:44] <wikibugs>	 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282978 (10BBlack) We've got some overlapping timelines on these long-form posts :)  I assume most of the most-recent one above is in the context of...
[15:04:37] <wikibugs>	 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4282979 (10jcrespo) To clarify- databases that are depooled do not need to stop replication- if replication goes down it tries to infinitely retry connecting w...
[16:07:24] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4283202 (10cwdent) 05Open>03Resolved Thanks @ayounsi   Site looks good, and slander is working again after kicking syslog a few places
[16:10:39] <vgutierrez>	 ema: hmm are tests actually being executed by our CI?
[16:10:40] <vgutierrez>	 I mean the vtcs
[16:11:19] <ema>	 vgutierrez: nope. It's complicated :)
[16:11:23] <vgutierrez>	 oh ok
[16:11:36] <vgutierrez>	 what am I doing wrong?
[16:11:50] <ema>	 https://phabricator.wikimedia.org/T128188
[16:13:02] <ema>	 vgutierrez: is the vtc test not working?
[16:14:49] <ema>	 vgutierrez: tested on traffic-text-varnish5.traffic.eqiad.wmflabs, green lights
[16:15:25] <vgutierrez>	 nice :)
[16:15:45] <vgutierrez>	 so... let's stop puppet on cp*, merge it and test on cp1008?
[16:15:58] * vgutierrez doesn't want to break wikipedia
[16:16:08] <ema>	 yes, sounds like a plan
[16:16:16] <vgutierrez>	 nice
[16:16:30] <ema>	 well actually
[16:16:46] <ema>	 the test does not check whether decent X-Connection-Properties work fine :)
[16:17:11] <vgutierrez>	 decent TLS parameters you mean?
[16:17:50] <vgutierrez>	 let's add one test case with something like H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES128-SHA; EC=prime256v1;
[16:17:57] <ema>	 yeah like -hdr "X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-CHACHA20-POLY1305; EC=X25519;" should give resp.status == 20
[16:18:00] <ema>	 *200
[16:18:10] <vgutierrez>	 right
[16:20:22] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10ema) The host  and its management interface are back online.  It seems like we're looking at a thermal issue, here are kernel logs at the time of the crash:  ``` Jun 12 06:26:51 cp3037...
[16:21:11] <vgutierrez>	 ema: hmmm easy way to re-seed the random generator?
[16:21:21] <vgutierrez>	 or just setup a v2 with the same seed?
[16:26:24] <ema>	 vgutierrez: mmh, probably setting up a v2 and point txreq to it would work 
[16:27:02] <ema>	 we can also keep it simple and add another vtc file though
[16:27:30] <vgutierrez>	 hmmm it's easy
[16:27:45] <vgutierrez>	 varnish v1 -cliok "debug.srandom 55" before running client c2 should do the trick
[16:28:05] <ema>	 sweet
[16:30:55] <vgutierrez>	 I also added a second request in c1 to ensure that it returns a 200 :)
[16:32:19] <ema>	 let's see
[16:34:03] <ema>	 vgutierrez: you need another rxreq; txresp in s1 
[16:34:49] <ema>	 (the test results in two requests to the backend server s1)
[16:35:13] <vgutierrez>	 oh right
[16:35:31] <ema>	 other than that, +1
[16:35:42] <vgutierrez>	 fixed
[16:36:02] <paravoid>	 vgutierrez: Magnus said he's planning a release for end of June
[16:47:25] <vgutierrez>	 good thing we decided to try in cp1008 first...
[16:47:39] <vgutierrez>	 Error: invalid byte sequence in US-ASCII
[16:47:40] <vgutierrez>	 Error: /Stage[main]/Varnish::Common::Browsersec/File[/etc/varnish/browsersec.inc.vcl]/content: change from {md5}a1502995036da18482d4ff50bb656132 to {md5}194b244641aff1742455c86b3a62feb0 failed: invalid byte sequence in US-ASCII
[16:48:21] <ema>	 interesting, I didn't get that on my self-hosted puppetmaster
[16:48:30] <vgutierrez>	 I wonder how the first one worked though
[16:48:41] <vgutierrez>	 I mean.. the current version of sec-warning has a lot of stuff outside US-ASCII
[16:53:56] <wikibugs>	 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4283368 (10Cmjohnson) The dates work for me..I accepted the calendar invites
[16:55:37] <bblack>	 heh
[16:56:01] <bblack>	 is that puppet complaining, about the contents of a deployed file being US-ASCII? shouldn't it not care about encodings at this level? :P
[16:56:53] <vgutierrez>	 /etc/varnish/browsersec.inc.vcl: HTML document, UTF-8 Unicode text, with very long lines
[16:56:58] <vgutierrez>	 current version it's already outside US-ASCII
[16:57:23] <ema>	 maybe something changed puppet-wise since the last modification to the file?
[16:57:40] <bblack>	 yeah we know it's supposed to be utf-8, it has to be to support all the languages.
[16:58:01] <bblack>	 the question is why should puppet even care about the encoding of a file it's deploying? it could be binary for it cares, I'd think.
[16:58:42] <bblack>	 I guess because of templating? (it has to parse template includes for more template directives?)
[16:59:51] <bblack>	 oh look it's fixed, so don't worry nothing to see here :) https://tickets.puppetlabs.com/browse/PUP-1031
[17:03:43] <vgutierrez>	 bblack: last time you didn't have any issues, right?
[17:03:50] <_joe_>	  
[17:06:36] <vgutierrez>	 ema: btw... maybe it didn't fail in your server cause you're running a utf-8 locale?
[17:09:49] <vgutierrez>	 https://phabricator.wikimedia.org/T93614
[17:10:48] <vgutierrez>	 hmm apparently setting show_diff => false for that file should be enough /o\
[17:14:31] <ema>	 vgutierrez: same locale on my test environment and on pinkunicorn/puppetmaster1001
[17:14:38] <ema>	 en_US.UTF-8
[17:15:12] <vgutierrez>	 hmm puppet ran by cron applied the change...
[17:15:43] <ema>	 fascinating
[17:15:51] <vgutierrez>	 cause it doesn't need to show the diff
[17:15:58] <vgutierrez>	 I guess
[17:17:09] <vgutierrez>	 the diff it's actually there on https://puppetboard.wikimedia.org/report/cp1008.wikimedia.org/0dc71f76808ff372e9a7d506beadb5153efeb40b
[17:20:59] <vgutierrez>	 small html issue detected :(
[17:24:23] <vgutierrez>	 ema: I think i'm going to disable explicitly show_diff for that file
[17:24:34] <ema>	 +1
[17:34:01] <ema>	 while true; do curl -s -v --ciphers AES128-SHA -H "Host: en.wikipedia.org" https://pinkunicorn.wikimedia.org/wiki/Blah 2>&1|grep "HTTP/2 [0-9]"; done
[17:34:06] <ema>	 I did get a 418 eventually :)
[17:34:13] <vgutierrez>	 great :)
[17:34:18] <vgutierrez>	 merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440375/
[17:34:57] <ema>	 o
[17:34:59] <ema>	 ok
[17:37:23] <vgutierrez>	 yey.. show_diff made the trick
[17:37:24] <vgutierrez>	 Info: /Stage[main]/Varnish::Common::Browsersec/File[/etc/varnish/browsersec.inc.vcl]: Filebucketed /etc/varnish/browsersec.inc.vcl to puppet with sum 194b244641aff1742455c86b3a62feb0
[17:37:29] <vgutierrez>	 Notice: /Stage[main]/Varnish::Common::Browsersec/File[/etc/varnish/browsersec.inc.vcl]/content: content changed '{md5}194b244641aff1742455c86b3a62feb0' to '{md5}d3e633e2418b25f94ac88a00c9617317'
[17:37:49] <vgutierrez>	 I guess bblack is more busy than me and just let cron run puppet
[17:38:37] <vgutierrez>	 fixed little hindi issue as well :)
[17:38:48] <vgutierrez>	 ema: let's enable puppet?
[17:39:30] <ema>	 vgutierrez: hindi issue?
[17:39:32] <bblack>	 heh
[17:39:50] <bblack>	 so puppet's application of a file change can fail solely for the reason that it can't solve encoding issues when trying to show the diff? :P
[17:39:57] <vgutierrez>	 yup
[17:40:05] <bblack>	 awesome!
[17:40:08] <vgutierrez>	 ema: missing </strong>
[17:40:19] <vgutierrez>	 html things(TM)
[17:40:30] <bblack>	 should poke amir to look at /sec-warning too (as output in a browser)
[17:40:32] <bblack>	 just in case
[17:41:05] <bblack>	 https://pinkunicorn.wikimedia.org/sec-warning
[17:42:25] <ema>	 there's a minor issue in the Italian translation: "Per favore, aggiorna il tuo dispositivo o contatti il suo amministratore informatico."
[17:42:41] <ema>	 "aggiorna" is informal, "contatti" is formal :)
[17:42:50] <bblack>	 fix it on the wiki, be part of the process :)
[17:43:07] <ema>	 sure thing!
[17:47:33] <ema>	 https://meta.wikimedia.org/wiki/User:Johan_(WMF)/AES128-SHA/it <- done
[17:48:06] <vgutierrez>	 writing the patch...
[17:48:21] <vgutierrez>	 in a taxi.. they always look at me in very funny ways when they see my coding
[17:48:27] <vgutierrez>	 s/my/me
[17:52:00] <vgutierrez>	 ema: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440380/1/modules/varnish/templates/browsersec.body.html.erb
[17:52:03] <vgutierrez>	 looking good?
[17:53:08] <ema>	 vgutierrez: +1
[18:01:23] <vgutierrez>	 bblack: did you ping aharoni?
[18:01:50] <vgutierrez>	 not on IRC at least, he's offline :)
[18:02:30] <ema>	 I've gotta go now, but after merging you can keep an eye on 418s here:
[18:02:33] <ema>	 https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?panelId=2&fullscreen&orgId=1&var-site=eqiad&var-site=eqsin&var-site=codfw&var-site=esams&var-site=ulsfo&var-cache_type=varnish-misc&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=4&from=now-1h&to=now
[18:02:52] <ema>	 see you tomorrow o/
[18:03:17] <vgutierrez>	 see you!
[18:03:36] <vgutierrez>	 it's merged.. we just need to reenable puppet across the caching nodes
[18:58:44] <vgutierrez>	 bblack: so.. seeing that cp1008 is behaving as expected... let's continue with cache::misc before hitting the others?
[20:04:14] <bblack>	 vgutierrez: sorry I wasn't following closely, I kinda figured you already had, and it's late there :)
[20:04:22] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#4283942 (10Krinkle) 05Open>03Resolved a:03Krinkle Obsolete per <https://gerrit.wikimedia.org/r/437774>. The cookie has now been removed entirely :)
[20:05:32] <bblack>	 looking at the recently-merged patches, I take it the state is that the patch is live for all clusters, just puppet-disable preventing the rollout
[20:07:49] <bblack>	 and looking at the puppet states, already re-enabled on cache_misc, just text and upload remain disabled
[20:10:02] <bblack>	 I don't see any 418 states from cache_misc, but low-traffic + better UAs there, probably unlikely anyways
[20:10:36] <bblack>	 will enable on text a site at a time until i can confirm at least a little 418 stats flowing
[20:43:18] <vgutierrez>	 right
[20:43:33] <vgutierrez>	 I was checking IRC on the restaurant.. my gf is gonna kill me xDD
[20:43:57] <bblack>	 yeah don't worry about it, I'll finish pushing it around
[20:44:06] <bblack>	 got some 418s coming in finally
[20:44:19] <vgutierrez>	 cool
[20:44:56] <vgutierrez>	 we are talking about 1% of 0.082% of our traffic... actually less right now
[22:08:55] <bblack>	 the 418 stats with it enabled globally make some kind of rough sense, so I think it's all ok
[22:09:57] <bblack>	 we're getting a rate somewhere in the ballpark of ~0.06/s out of ~100K/sec, which would be 00.00006%
[22:10:22] <bblack>	 multiply it by 100 for the 1% chances at 418s, and we get 00.006%
[22:11:06] <bblack>	 which would be the right figure if roughly 7.3% of requests in general match the other conditions (/wiki, no colon)
[22:11:40] <bblack>	 which is probably about right given how many non-/wiki reqs there are in a normal pageview for /wiki, and API hits, and cache_upload image fetches, etc...
[22:12:05] <bblack>	 close enough anyways :)