[08:25:33] 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2177093 (10ema) Mostly out of curiosity, I've checked which protocols are supported by other top-10 websites by looking at NPN responses: | google.com / youtube.com | h2, spdy/3.1, htt... [08:56:55] 10Wikimedia-Apache-configuration, 6Operations, 7Puppet: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177147 (10Joe) [10:48:22] 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177315 (10elukey) [10:48:24] 10Traffic, 6Analytics-Kanban, 6Operations, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2177313 (10elukey) 5Open>3Resolved Code merged by ema, plus the varnish maps cluster has been running with vk for days without triggering any... [10:48:53] ---^ ema [10:48:56] * elukey dances [12:05:53] 10Wikimedia-Apache-configuration, 6Operations, 13Patch-For-Review, 7Puppet: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177350 (10Joe) p:5Triage>3Normal a:3Joe [12:53:06] 10Traffic, 6Operations, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2177434 (10BBlack) Note this should get resolved via T130414 's https://gerrit.wikimedia.org/r/#/c/278353 [12:58:56] elukey: yes! [13:00:08] 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177437 (10ema) 5Open>3Resolved [13:00:13] \o/ [13:07:43] http://www.potaroo.net/ispcol/2016-03/zombies.html [13:15:02] awesome work :) [13:15:29] re: the varnishxcps/varnishrls scripts, I thought those ran everywhere, although at maps traffic levels they're not an important contributor to totals I think [13:16:45] rls in particular isn't useful on other clusters than text, but puppet seems to be set up to put it everywhere, if I'm reading it correctly. [13:17:02] xcps should be everywhere and is, I think, but not running in maps presently [13:17:23] bblack: right, I've copy-pasted ori's table here https://phabricator.wikimedia.org/T131353 [13:17:41] so varnishxcps/varnishrls are only needed for text [13:18:08] well [13:18:19] (according to ori) :) [13:18:25] rls is only functionally-useful on text, but I think puppet currently configures it everywhere pointlessly [13:18:33] oh I see [13:18:34] and xcps is also configured everywhere, probably not-pointlessly [13:18:59] (but IMHO it's not urgent that maps doesn't have it running, as maps is a very small fraction of the total xcps stats) [13:20:05] right now r::c::2layer includes r::c::statsd::frontend, which includes both ::rls and ::xcps [13:20:13] and all clusters use r::c::2layer [13:20:34] we should probably move rls to text-only and stop it on the other hosts though, as only text has load.php at all [13:21:34] on maps varnishxcps service is configured, but the service is dead with: [13:21:37] Apr 04 13:20:26 cp1043 varnishxcps[20044]: Exception: Unknown Tag: RxHeader [13:21:48] right, because it hasn't been ported [13:21:52] right [13:22:02] kinda highlights that we lack monitoring that those daemons are up [13:22:16] indeed [13:25:53] bblack: this morning I've restarted nginx on cp1046, cp1052, cp1068, cp1071, cp1099 (openssl upgrade) [13:26:08] they seem to run fine, can I proceed with the other hosts? [13:26:27] CC: moritzm ^ [13:26:39] is this for the upgrade that's been outstanding for a while now, or a new one? [13:26:56] the former I think [13:27:10] I'd hold a bit [13:27:39] Update to 1.0.2g (Tue, 01 Mar) [13:27:44] basically it doesn't really affect us in practice (the bugs fixed), and I just pushed the sysd sec patch for nginx which also needs a full restart on nginx [13:28:01] so we can get both if we wait for that to get pushed to the fs on them all first [13:28:07] perfect [13:28:59] usually since nginx restarts in a nice way, we don't depool for a simple nginx restart either, but I usually try to space them out a little instead of all at once [13:29:47] probably want to validate the sysd sec thing on a live host first too (whichever we restart first after this puppet run is done) [13:30:02] I'll go try cp1065 [13:32:27] yeah looks fine on cp1065 [13:33:15] ema: the puppet runs finished now too, so it's good to go. maybe batch it off with -b 1 and a small sleep after each restart command, like 10-15 secs? [13:34:27] bblack: I like while loops and I've prepared one with 15 secs sleeps. Would that be OK? [13:34:36] ema: I'll look at moving ::rls off the other clusters for now, seems like an easy morning task before coffee #3 [13:34:40] ema: works for me [13:34:58] bblack: cool. I'm finishing writing a phab task for the missing monitoring stuff [13:39:04] 10Traffic, 7Varnish, 6Operations: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177485 (10ema) [13:43:26] rolling restart initiated [13:46:13] 10Traffic, 7Varnish, 6Operations: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177507 (10ema) p:5Triage>3Normal [13:47:00] cp1066 has a "Return code of 111 is out of bounds" Icinga alert for it's HTTPS check? [13:47:28] moritzm: it's gone, it probably checked while nginx was restarting I guess [13:47:53] yeah, now it's on other hosts, so seems in fact a side effect of the rolling restart [13:50:48] mmmm really weird it shouldn't happen with nginx fork/restart behavior no? [13:53:44] I wouldn't think [13:54:29] maybe our current "service nginx restart" behavior doesn't use the smooth restart method either for some initscript/packaging/code reason [13:55:07] I've tried a service nginx restart on my workstation while issuing https requests to localhost and a bunch of them got refused [13:55:16] yeah ok [13:55:53] in any case, the effect should be minimal in practice even with that, given pybal and spacing, etc [13:56:03] but a few reqs will get retried, yeah [13:56:11] (by the client, I mean) [13:56:23] but we should fix that, nginx knows how to restart better :/ [13:56:29] yep [13:56:45] eqiad done, in the meantime. codfw started a few seconds ago [13:56:49] I know it used to at some previous point, but it's probably been a while [13:57:05] I'm pretty sure debian package upgrade restart for nginx does it smoothly too, but maybe it doesn't use "service nginx restart" [13:57:49] this probably all comes down to systemd (systemd makes it hard for "service foo restart" to ever be smooth without a bunch of work on the daemon author's part) [14:00:01] https://phabricator.wikimedia.org/P2852 [14:00:20] ^ that's the debian postinst script. it doesn't use "restart", it sends SIGUSR2 and waits around for nginx to finish up, etc... [14:01:07] we could copy something like that (maybe tweak a bit re: timeouts and such) and puppetize it as e.g. /usr/local/bin/nginx-safe-restart [14:01:34] +1 [14:04:40] hmmm the shipped /etc/init.d/nginx has a better version as function upgrade_nginx [14:04:54] e.g. /etc/init.d/nginx upgrade [14:05:37] I wonder if that works even though systemd is in use? [14:06:12] yup [14:06:56] so we don't even need a new script, we just need to document/remember how to hack around this [14:07:49] systemd controls nginx, but "/etc/init.d/nginx upgrade" does a smooth restart of the systemd-controlled nginx [14:09:35] alright, so next time sudo service nginx upgrade instead [14:18:13] I don't know, does that work? I would think "service" would always hit systemctl since it's a systemd service [14:18:33] nope, it defaults to sysvinit for non-standard actions [14:18:41] oh, nice [14:20:11] yeah although there seems to be a little ubuntu-leftover [14:20:12] $ sudo service --version [14:20:12] service ver. 0.91-ubuntu1 [14:20:20] (on debian sid) [14:21:13] rolling restart completed [14:23:03] hey folks [14:23:05] Ruben mailed me again [14:25:42] about the EU varnish conf? [14:31:17] bblack: are you coming? :) [14:35:13] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177610 (10BBlack) [14:35:35] I doubt it! [14:36:15] if it was a different week and/or in the US, maybe [14:36:27] ah london and copenhagen! nice! [14:47:10] elukey: uh? I thought it was in amsterdam [14:47:55] ema: I saw http://info.varnish-software.com/summits-2016?gclid=CLD34fec9csCFdS7GwodzA4Fgg [15:00:43] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177744 (10BBlack) On point 3 above (when does varnish send TE:chunked?), my best observations/code-searching indicate: 1. Obviously a do_stream of a chunked fetch is a chun... [15:08:48] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177767 (10BBlack) And on point 2 above: since varnish seems to be smart about using TE:chunked only when the response length isn't easy to know, there's not much wiggle room... [15:09:02] 10Traffic, 7Varnish, 6Operations: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2168861 (10ema) a:3ema [15:11:38] ema: in case you're working ahead :) keep in mind the misc_fetch_large_object crap is likely to change dramatically soon, no point working on porting that at this point [15:15:21] 10netops, 6Operations, 10ops-eqiad: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2177774 (10faidon) 5Open>3declined I re-rebooted it from the console, as it wasn't able to read th SSH keys (!? the CF is maybe broken?) and hence sshd was unable to start. It works... [15:17:31] bblack: alright! I wanted to start working on the easy stuff first (director names collisions with probes, @req_method, ...) [15:24:11] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177838 (10BBlack) Going into more detail on the current behaviors of misc and upload clusters: **cache_misc** - regardless of tier/layer, it sets do_stream for objects >= 1... [15:38:23] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177962 (10BBlack) Also, confirmed that do_stream of a non-chunked fetch doesn't cause chunked response on cache_upload. [15:41:53] 10Traffic, 7Varnish, 6Operations: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177973 (10BBlack) Another input here: in the common case, it seems MediaWiki outputs content with TE:chunked, too. [16:16:24] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178070 (10mmodell) Thanks @dzahn for setting this up so quickly. I tested that and I wa... [16:16:55] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178072 (10mmodell) [16:27:43] 10Traffic, 7Varnish, 6Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2178082 (10ema) [19:28:17] 10netops, 10Monitoring, 6Operations, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178759 (10faidon) [19:31:36] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178776 (10chasemp) Afaik the 'talk to phabricator' portion here is relevant for git-ssh... [20:25:06] 10netops, 10Monitoring, 6Operations, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178874 (10faidon) [20:52:30] 7HTTPS, 10Traffic, 6Operations, 10Wikimedia-Shop: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2179008 (10GHoltman) 5Open>3Resolved a:3GHoltman Resolved per HuiZSF