[08:25:33] <wikibugs>	 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2177093 (10ema) Mostly out of curiosity, I've checked which protocols are supported by other top-10 websites by looking at NPN responses:   | google.com / youtube.com | h2, spdy/3.1, htt...
[08:56:55] <wikibugs>	 10Wikimedia-Apache-configuration, 6Operations, 7Puppet: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177147 (10Joe)
[10:48:22] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177315 (10elukey)
[10:48:24] <wikibugs>	 10Traffic, 6Analytics-Kanban, 6Operations, 13Patch-For-Review: varnishkafka integration with Varnish 4  for analytics - https://phabricator.wikimedia.org/T124278#2177313 (10elukey) 5Open>3Resolved Code merged by ema, plus the varnish maps cluster has been running with vk for days without triggering any...
[10:48:53] <elukey>	 ---^ ema
[10:48:56] * elukey dances
[12:05:53] <wikibugs>	 10Wikimedia-Apache-configuration, 6Operations, 13Patch-For-Review, 7Puppet: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177350 (10Joe) p:5Triage>3Normal a:3Joe
[12:53:06] <wikibugs>	 10Traffic, 6Operations, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2177434 (10BBlack) Note this should get resolved via T130414 's https://gerrit.wikimedia.org/r/#/c/278353
[12:58:56] <ema>	 elukey: yes!
[13:00:08] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177437 (10ema) 5Open>3Resolved
[13:00:13] <ema>	 \o/
[13:07:43] <elukey>	 http://www.potaroo.net/ispcol/2016-03/zombies.html
[13:15:02] <bblack>	 awesome work :)
[13:15:29] <bblack>	 re: the varnishxcps/varnishrls scripts, I thought those ran everywhere, although at maps traffic levels they're not an important contributor to totals I think
[13:16:45] <bblack>	 rls in particular isn't useful on other clusters than text, but puppet seems to be set up to put it everywhere, if I'm reading it correctly.
[13:17:02] <bblack>	 xcps should be everywhere and is, I think, but not running in maps presently
[13:17:23] <ema>	 bblack: right, I've copy-pasted ori's table here https://phabricator.wikimedia.org/T131353
[13:17:41] <ema>	 so varnishxcps/varnishrls are only needed for text
[13:18:08] <bblack>	 well
[13:18:19] <ema>	 (according to ori) :)
[13:18:25] <bblack>	 rls is only functionally-useful on text, but I think puppet currently configures it everywhere pointlessly
[13:18:33] <ema>	 oh I see
[13:18:34] <bblack>	 and xcps is also configured everywhere, probably not-pointlessly
[13:18:59] <bblack>	 (but IMHO it's not urgent that maps doesn't have it running, as maps is a very small fraction of the total xcps stats)
[13:20:05] <bblack>	 right now r::c::2layer includes r::c::statsd::frontend, which includes both ::rls and ::xcps
[13:20:13] <bblack>	 and all clusters use r::c::2layer
[13:20:34] <bblack>	 we should probably move rls to text-only and stop it on the other hosts though, as only text has load.php at all
[13:21:34] <bblack>	 on maps varnishxcps service is configured, but the service is dead with:
[13:21:37] <bblack>	 Apr 04 13:20:26 cp1043 varnishxcps[20044]: Exception: Unknown Tag: RxHeader
[13:21:48] <ema>	 right, because it hasn't been ported
[13:21:52] <bblack>	 right
[13:22:02] <bblack>	 kinda highlights that we lack monitoring that those daemons are up
[13:22:16] <ema>	 indeed
[13:25:53] <ema>	 bblack: this morning I've restarted nginx on cp1046, cp1052, cp1068, cp1071, cp1099 (openssl upgrade)
[13:26:08] <ema>	 they seem to run fine, can I proceed with the other hosts?
[13:26:27] <ema>	 CC: moritzm ^
[13:26:39] <bblack>	 is this for the upgrade that's been outstanding for a while now, or a new one?
[13:26:56] <ema>	 the former I think
[13:27:10] <bblack>	 I'd hold a bit
[13:27:39] <ema>	 Update to 1.0.2g (Tue, 01 Mar)
[13:27:44] <bblack>	 basically it doesn't really affect us in practice (the bugs fixed), and I just pushed the sysd sec patch for nginx which also needs a full restart on nginx
[13:28:01] <bblack>	 so we can get both if we wait for that to get pushed to the fs on them all first
[13:28:07] <ema>	 perfect
[13:28:59] <bblack>	 usually since nginx restarts in a nice way, we don't depool for a simple nginx restart either, but I usually try to space them out a little instead of all at once
[13:29:47] <bblack>	 probably want to validate the sysd sec thing on a live host first too (whichever we restart first after this puppet run is done)
[13:30:02] <bblack>	 I'll go try cp1065
[13:32:27] <bblack>	 yeah looks fine on cp1065
[13:33:15] <bblack>	 ema: the puppet runs finished now too, so it's good to go.  maybe batch it off with -b 1 and a small sleep after each restart command, like 10-15 secs?
[13:34:27] <ema>	 bblack: I like while loops and I've prepared one with 15 secs sleeps. Would that be OK?
[13:34:36] <bblack>	 ema: I'll look at moving ::rls off the other clusters for now, seems like an easy morning task before coffee #3
[13:34:40] <bblack>	 ema: works for me
[13:34:58] <ema>	 bblack: cool. I'm finishing writing a phab task for the missing monitoring stuff
[13:39:04] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177485 (10ema)
[13:43:26] <ema>	 rolling restart initiated 
[13:46:13] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177507 (10ema) p:5Triage>3Normal
[13:47:00] <moritzm>	 cp1066 has a "Return code of 111 is out of bounds" Icinga alert for it's HTTPS check?
[13:47:28] <ema>	 moritzm: it's gone, it probably checked while nginx was restarting I guess
[13:47:53] <moritzm>	 yeah, now it's on other hosts, so seems in fact a side effect of the rolling restart
[13:50:48] <elukey>	 mmmm really weird it shouldn't happen with nginx fork/restart behavior no? 
[13:53:44] <bblack>	 I wouldn't think
[13:54:29] <bblack>	 maybe our current "service nginx restart" behavior doesn't use the smooth restart method either for some initscript/packaging/code reason
[13:55:07] <ema>	 I've tried a service nginx restart on my workstation while issuing https requests to localhost and a bunch of them got refused
[13:55:16] <bblack>	 yeah ok
[13:55:53] <bblack>	 in any case, the effect should be minimal in practice even with that, given pybal and spacing, etc
[13:56:03] <bblack>	 but a few reqs will get retried, yeah
[13:56:11] <bblack>	 (by the client, I mean)
[13:56:23] <bblack>	 but we should fix that, nginx knows how to restart better :/
[13:56:29] <ema>	 yep
[13:56:45] <ema>	 eqiad done, in the meantime. codfw started a few seconds ago
[13:56:49] <bblack>	 I know it used to at some previous point, but it's probably been a while
[13:57:05] <bblack>	 I'm pretty sure debian package upgrade restart for nginx does it smoothly too, but maybe it doesn't use "service nginx restart"
[13:57:49] <bblack>	 this probably all comes down to systemd (systemd makes it hard for "service foo restart" to ever be smooth without a bunch of work on the daemon author's part)
[14:00:01] <bblack>	 https://phabricator.wikimedia.org/P2852
[14:00:20] <bblack>	 ^ that's the debian postinst script.  it doesn't use "restart", it sends SIGUSR2 and waits around for nginx to finish up, etc...
[14:01:07] <bblack>	 we could copy something like that (maybe tweak a bit re: timeouts and such) and puppetize it as e.g. /usr/local/bin/nginx-safe-restart
[14:01:34] <ema>	 +1
[14:04:40] <bblack>	 hmmm the shipped /etc/init.d/nginx has a better version as function upgrade_nginx
[14:04:54] <bblack>	 e.g. /etc/init.d/nginx upgrade
[14:05:37] <bblack>	 I wonder if that works even though systemd is in use?
[14:06:12] <bblack>	 yup
[14:06:56] <bblack>	 so we don't even need a new script, we just need to document/remember how to hack around this
[14:07:49] <bblack>	 systemd controls nginx, but "/etc/init.d/nginx upgrade" does a smooth restart of the systemd-controlled nginx
[14:09:35] <ema>	 alright, so next time sudo service nginx upgrade instead
[14:18:13] <bblack>	 I don't know, does that work? I would think "service" would always hit systemctl since it's a systemd service
[14:18:33] <ema>	 nope, it defaults to sysvinit for non-standard actions
[14:18:41] <bblack>	 oh, nice
[14:20:11] <ema>	 yeah although there seems to be a little ubuntu-leftover
[14:20:12] <ema>	 $ sudo service --version
[14:20:12] <ema>	 service ver. 0.91-ubuntu1
[14:20:20] <ema>	 (on debian sid)
[14:21:13] <ema>	 rolling restart completed
[14:23:03] <paravoid>	 hey folks
[14:23:05] <paravoid>	 Ruben mailed me again
[14:25:42] <bblack>	 about the EU varnish conf?
[14:31:17] <ema>	 bblack: are you coming? :)
[14:35:13] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177610 (10BBlack)
[14:35:35] <bblack>	 I doubt it!
[14:36:15] <bblack>	 if it was a different week and/or in the US, maybe
[14:36:27] <elukey>	 ah london and copenhagen! nice!
[14:47:10] <ema>	 elukey: uh? I thought it was in amsterdam
[14:47:55] <elukey>	 ema: I saw http://info.varnish-software.com/summits-2016?gclid=CLD34fec9csCFdS7GwodzA4Fgg
[15:00:43] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177744 (10BBlack) On point 3 above (when does varnish send TE:chunked?), my best observations/code-searching indicate:  1. Obviously a do_stream of a chunked fetch is a chun...
[15:08:48] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177767 (10BBlack) And on point 2 above: since varnish seems to be smart about using TE:chunked only when the response length isn't easy to know, there's not much wiggle room...
[15:09:02] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2168861 (10ema) a:3ema
[15:11:38] <bblack>	 ema: in case you're working ahead :) keep in mind the misc_fetch_large_object crap is likely to change dramatically soon, no point working on porting that at this point
[15:15:21] <wikibugs>	 10netops, 6Operations, 10ops-eqiad: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2177774 (10faidon) 5Open>3declined I re-rebooted it from the console, as it wasn't able to read th SSH keys (!? the CF is maybe broken?) and hence sshd was unable to start. It works...
[15:17:31] <ema>	 bblack: alright! I wanted to start working on the easy stuff first (director names collisions with probes, @req_method, ...)
[15:24:11] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177838 (10BBlack) Going into more detail on the current behaviors of misc and upload clusters:  **cache_misc** - regardless of tier/layer, it sets do_stream for objects >= 1...
[15:38:23] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177962 (10BBlack) Also, confirmed that do_stream of a non-chunked fetch doesn't cause chunked response on cache_upload.
[15:41:53] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177973 (10BBlack) Another input here: in the common case, it seems MediaWiki outputs content with TE:chunked, too.
[16:16:24] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178070 (10mmodell) Thanks @dzahn for setting this up so quickly. I tested that and I wa...
[16:16:55] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178072 (10mmodell)
[16:27:43] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2178082 (10ema)
[19:28:17] <wikibugs>	 10netops, 10Monitoring, 6Operations, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178759 (10faidon)
[19:31:36] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178776 (10chasemp) Afaik the 'talk to phabricator' portion here is relevant for git-ssh...
[20:25:06] <wikibugs>	 10netops, 10Monitoring, 6Operations, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178874 (10faidon)
[20:52:30] <wikibugs>	 7HTTPS, 10Traffic, 6Operations, 10Wikimedia-Shop: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2179008 (10GHoltman) 5Open>3Resolved a:3GHoltman  Resolved per HuiZSF