[01:49:18] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2001011 (10AlexMonk-WMF) I just got this again after making MW fatal in beta. Varnish started sending 503s instead, and varnishlog showed "... [02:07:48] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343562 (10BBlack) When you make MW fatal in beta, does hhvm emit some kind of valid error output, or is it truncating its output due to th... [02:31:06] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343586 (10AlexMonk-WMF) My change seems to make MW fatal when Varnish contacts it but not when I try to curl. So it's difficult to tell. [02:40:21] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343599 (10BBlack) Can you give me a way to reproduce it? [02:42:53] heh I just noticed wget does HSTS now :) [03:04:55] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343671 (10AlexMonk-WMF) I just started typing instructions, went to test them (in my separate chrome session - which exists for this extra... [03:09:14] bblack, ... I should probably find an easier bug to reproduce this with huh? [03:16:04] Unfortunately there's something MW is doing that makes gzip break somehow [03:18:16] maybe something in HHVM [03:18:21] Varnish's behaviour seems reasonable [04:50:35] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10brion) [08:55:33] ema, bblack: there's a new nginx vulnerability: http://mailman.nginx.org/pipermail/nginx-announce/2016/000179.html I looked into it and it only seems to affect the client_body_temp_path option (which we don't use). so I'd say let's skip this one (when moving to 1.11 at some point, 1.11.1 will also include the fix anyway) [09:00:05] moritzm: alright [10:51:57] done rebooting cp3* hosts with 4.4, no issues [10:54:29] are you also planning to reboot the to-decom cp3 hosts? it will likely take quite some time until these are actually decomissioned since esams has no present dc-ops [10:57:31] moritzm: oh right, I'll reboot those too [11:08:02] ema: awesome. no reboot issues on the rest of the cp30[34]x? [11:13:29] bblack: nope [11:16:15] moritzm: I think we do have that temp path implicitly. I've already tested 1.11.0, just hadn't done the "official" packaging of it. Will work on that now. [11:17:44] bblack: ok! [11:18:15] (which is slightly-complicated by experimental still being on 1.10.0, but we can do the rest as local patches) [11:21:06] err 1.10.1, even better heh [11:26:17] lunch, then I'll carry on rebooting cp2* [12:17:00] ok all committed and uploaded to carbon and re-tested on cp1008 [12:17:11] doing the rest... [12:22:36] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10faidon) Since you get bad //download// speeds, the opposite traceroute (from eqiad to you) is the more interesting one. I didn't have your IP,... [12:31:13] I was waiting for the nginx maintainer to join us for drinks last night [12:31:26] he texted me with "sorry, there's an nginx CVE I have to deal with..." [12:31:37] such commitment :P [12:34:38] :) [12:36:07] 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request using HTTP/2 (if connection has been idle for a certain time) - https://phabricator.wikimedia.org/T134869#2344610 (10BBlack) We've just upgraded our nginx package to 1.11.1, which incl... [12:36:42] hah :-) [12:37:28] paravoid: I saw your name in his nginx commits too, for the dh-strip workaround suggestion. that's in our package now too :) [12:38:06] ;) [12:39:55] that saves us another patch, right? [12:40:17] we still need to rebuild because libssl1.0.2, but hopefully no source changes other than the changelog entry? [12:45:44] 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request using HTTP/2 (if connection has been idle for a certain time) - https://phabricator.wikimedia.org/T134869#2344627 (10BBlack) Also note, whether or not the nginx update helps the situat... [12:47:02] paravoid: right. I did include 2x local patches in this build, but they're just imports of the nginx.org 1.11.0 + 1.11.1 changes, since debian's git hasn't moved forward to 1.11.x yet. [12:48:27] (actually debian's "upstream" branch did have a 1.11.0 commit a couple days ago, but they pulled that commit back and put 1.10.1 in its place. I happened to still have that one in my git reflog, which saved some time on making this package) [12:52:09] also, I bumped our per-node ssl sessionid shm size from 100M to 200M the other day as an experiment and re-used session count increased by ~5% or so. [12:52:42] I'm still struggling to make up some kind of story in my head that explains the correlation between the 5% bump and doubling of size. [12:53:57] you'd think if the shm size was the limiter, doubling it would have a more-dramatic effect [12:55:43] but then you have to toss out all the connections we get from UAs that never effectively re-use sessionids anyways (probably mostly non-browsers), and also there's going to be some natural curve where the most-frequent-reusers reuse (reconnect) at a much higher rate than the rest of the clients [12:56:05] between those, maybe that explains it, but it doesn't really help me (without additional data) decide if further increases are really worth it [12:59:29] I guess we could try a really crazy experimental value for a day and get a better idea. e.g. bump it to 2GB (and I guess keep that off of labs nginxes since they may have less memory to waste and don't need it anyways). [13:35:12] bblack: puppet is still disabled on cp hosts BTW, can we re-enable it? [13:35:30] oh yeah, oops. I'll re-enable [13:36:11] I didn't want puppet to try to race with the nginx zerod-downtime-package-upgrade process somehow [13:36:31] oh the service reload thing? [13:36:44] s/reload/upgrade/ [13:38:09] yeah [13:38:42] bblack: can I continue with the reboots or should I wait a little? [13:38:58] I think we're ok [13:39:08] alright! [13:40:29] ok wait [13:40:32] ema: ^ [13:40:44] we're going to retry the codfw outage I think [13:40:53] oh that sounds fun [13:49:06] 10Traffic, 10Monitoring, 06Operations: Add LVS public endpoint checks that bypass caches - https://phabricator.wikimedia.org/T136703#2344731 (10BBlack) [13:50:20] 10Traffic, 10Monitoring, 06Operations: Add LVS public endpoint checks that bypass caches - https://phabricator.wikimedia.org/T136703#2344744 (10BBlack) p:05Triage>03High [14:00:41] ema: could carry on with cp1*, maybe a little slower than before JIC. [14:02:24] great [14:05:26] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2344776 (10brion) Thanks, I'll keep an eye out tonight and see if it gets congested again (currently seeing a cool 80 Mbits download rate at 7:04am pacifi... [14:57:51] ema: cp3048 was still on an older 4.4, rebooting it now [14:58:43] thanks! [14:59:19] looks like you already did ulsfo too? [14:59:30] ulsfo was done already! [14:59:42] oh wait, I'm wrong [14:59:56] ulsfo still needs doing, it's not on the latest one [15:00:03] as in, running 4.4, it's not wmf2 [15:00:11] well yeah [15:00:37] I want to get them all on wmf2 and keep things consistent, so we know the fleet was all rebooted around this week and all running identical kernels, basically [15:00:46] right [15:00:47] unless wmf2 was extremely trivial [15:01:38] I was using grains.item kernelrelease to check the kernel version but that doesn't include the package revision [15:02:13] yeah it's confusing in many ways. I ended up md5summing the kernel image in /boot to be sure the package was installed right everywhere heh [15:03:12] cp3038 is also running wmf1, upgrading [15:05:48] oh yeah, missed that in my output heh [15:05:59] visual-grep fail :) [15:06:03] hehe [15:12:03] heh like I said, confusing [15:13:18] the diff from 4.4.2-3+wmf1 to 4.4.2-3+wmf2 includes all of upstream linux's changes for 4.4.[56789] + a little bit of cherrypick [15:13:26] the wmf1 one was based on 4.4.4 [15:15:03] I get why, it's kind of the same reason our nginx-1.11.1+wmf1 is based on debian's 1.10.0-1 [15:15:27] but I changed the version in our resulting package name, too. I imagine tracking like that for kernels doesn't work out the same heh. [15:17:15] yeah but the whole thing really is confusing: 4.4.0-1 in the package name, 4.4.2-3 is the version, 4.4.[56789] the contents :) [15:17:16] I wonder how viable and ridiculously complicated it is at this point to use kpatch for some of these smaller bugfixes between less-frequent kernel upgrade reboots. [15:18:20] moritzm: kpatch? [15:18:55] I have no idea myself. last time I really read about kpatch was ages ago, and at the time it didn't look like a realistic option for situations like ours [15:19:03] it was way too complicated and limited [15:24:09] ema: cp2 network risks gone [15:24:20] perfect [15:33:01] 10netops, 06Operations: Upgrade cr1-codfw/cr2-codfw FPC 0 firmware - https://phabricator.wikimedia.org/T136707#2345005 (10faidon) [15:34:03] 10netops, 06Operations: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282892 (10faidon) 05Open>03Resolved The errors disappeared, so case 2016-0510-0764 was closed in the meantime. Before I could restore the VRRP priorities, a firmware upgrade needed to happen (cf... [15:35:16] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2345025 (10BBlack) Ok so, yeah, the repro is a little complicated, but this presumably isn't a varnish fault since hhvm's output is obvious... [16:13:44] https://gerrit.wikimedia.org/r/#/c/292166/ could use some real review (most of the thinking in the long commitmsg rather than the actual diff) [16:13:53] ema: paravoid: ^ [16:20:54] bblack: the reasoning about S:BP being uncacheable doesn't apply to others, such as maps, right? [16:21:33] https://maps.wikimedia.org/_info seems to be cacheable [16:22:02] true, text is the case the commitmsg goes into detail on [16:22:29] but even for a cacheable response from the applayer, the same problem exists, it's just rarer because it's only possible when the item gets a miss from expiry. [16:23:13] still, if a random maps backend appserver fails to respond to _info, that's a problem that LVS/pybal solves at the kartotherian.svc.codfw.wmnet level, and it's not appropriate to depool a random cache frontend over it. [16:23:38] makes sense [16:25:46] arguably there's a more-complicated middle ground where we deploy some VCL infrastructure for partial-depth checks, but until/if we do that I still think this is an improvement on the current situation [16:26:39] e.g. when pybal is checking a varnish frontend, it uses a URL which passes through the frontend and validates that a local varnish-backend responded with 200 OK (internally) [16:27:05] and similar for backends which don't directly contact the applayer, perhaps varnish<->varnish healthchecks should go 1 layer deeper as well [16:29:15] (it would catch the case where e.g. varnish-frontend on cp3030 has a routing problem where it can talk to LVS, but not to other cp30[34]x, at the cost of random inappropriate frontend depool when a random local backend actuall does fail without being depooled) [16:29:53] it's a lesser cost in terms of odds of inappropriate depooling, and gives us slightly better coverage of odd failure scenarios. [16:32:53] we could do that with some URL regex magic where if the request hostname is varnishcheck and the URL path part is /one-layer-deeper, ... [16:32:57] yes so this could get arbitrarily complex [16:33:27] but on the whole, without stepping into all that complexity, I think the local-only check is better than the full-stack check [16:33:38] I agree [16:35:19] also the choice of routing the request to backends hashing on the URL gives us a great effective cache size at the expense of HA (if the backend 'responsible' for that page is down, you get an error even if we have effectively many other varnish backends) [16:37:12] but anyways this doesn't have too much to do with LVS checks, only thing we should care really is nginx+frontend [16:47:56] kpatch isn't usable yet, it's missing various core work, maybe in time for the official stretch kernel [16:51:32] bblack: out of curiosity, what happens without ProxyFetch (eg. misc before your commit)? [16:52:03] ema: it still has IdleConn, which just opens an empty TCP connection to the HTTP(S) port, and reconnects if it gets dropped. if it can't reconnect in 3s, it depools. [16:52:49] ema: also re hashing: we do have varnish healthchecks for varnish<->varnish on the chashed stuff, so the bad backend should get depooled and rehashed automatically and fairly quickly. [16:53:11] (but sure, not before a few failures slip through, as is the case with LVS/pybal too) [16:53:57] IIRC IdleConnection gained tcp keepalive stuff relatively-recently too, so it will notice if you e.g. pull the network plug on the target host [16:54:12] at least, we talked about that and I think it got implemented [16:57:19] nice [16:57:24] bblack: https://gerrit.wikimedia.org/r/#/c/291752/ thoughts? [16:57:58] godog suggested using logger instead, might be a good idea [17:01:07] do we need to log to a separate file or would /var/log/syslog also work? [17:02:44] well that gets into the whole debate about separability and easy debugging [17:03:02] IMHO, if we use logger, we should also do an rsyslog snippet to send just these to a separate syslog file, too [17:03:28] either way works for me (logger -> rsyslog -> also goes to its own separate file) or direct to a rotated separate file, up to you [17:03:54] the advantage of the separate file is it makes it easy to check things via salt when you need to sometimes [17:04:02] true [17:04:42] I think you can do that with logger+rsyslog and a custom facility, but not 100% sure [17:05:07] yeah it should be possible [17:05:08] I would say "or a match on the daemon name", but I think that only works right via syslog(3), not logger? [17:07:21] well we can use logger -t [17:08:04] ah yeah, just figured that out the manual way :) [17:08:09] -t does set the app-name [17:08:55] so then the advantage over writing straight to a file would be remote logging I guess? [17:09:17] we don't do remote logging though do we? [17:09:28] I have no clue! :) [17:09:55] godog said: [17:09:57] > so we get reliable file rotation, remote syslogging, etc. [17:10:40] well if we're setting up a separate output file either way, we get reliable rotation the same way for the same effort [17:11:06] maybe we have future plans for remote syslog? [17:11:18] I don't think we do it presently, anyways [17:12:30] we could go all out and replace the cronjob with a systemd service that executes every N seconds and uses the systemd journalling of stdout, plus rsyslog splitting to a separate file :) [17:12:46] getting kinda crazy though [17:13:34] rebooting the last eqiad machine in the meantime, codfw also in progress [17:13:56] anyways it would be hard to get the auto-splay we have with a cronjob, in a systemd service with watchdog periodic restarts [17:14:16] lunch, bbl [17:17:34] 10netops, 06Operations: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2345483 (10faidon) [17:37:22] bblack: Hi! If you have time today would you mind to check https://gerrit.wikimedia.org/r/#/c/292172 to see if there are very clean and stupid mistakes? [17:39:59] I have been doing tests today on my vagrant vms and it looks good, but I am a bit unsure about the strstr calls performance wise [17:40:48] will try to think if there is a smarter way to make the if checks to minimize them [17:41:29] because theoretically there are multiple strstr calls for each Timestamp, that occurs itself multiple times for a single request handling [17:42:14] But I tested it with %{end:%FT%T@dt}t in varnish config and works nicely [17:46:59] 10netops, 06Operations: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2345574 (10faidon) Spoke with the Level3 representative just now. Estimated delivery of the circuit is July 8th. [17:51:21] elukey: I left it open in a browser tab, will look later today [17:52:14] super thanks! [18:00:22] so a couple of machines in codfw didn't come back up [18:00:44] hanging at boot time, only output I could get from mgmt is: [OK [18:01:18] they do come back after a powercylce though [18:05:23] ok esams, eqiad and codfw are running 4.4.2-3+wmf2 [18:05:28] ulsfo now [18:27:22] ema: for previous reboots Papaul made idrac updates which fixed subsequent reboots for me, best to collect the hostnames and open up a task once you [18:27:24] are done [18:27:59] moritzm: will do, thanks! [18:49:41] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2345885 (10brion) As of 11:44 am pacific time I'm seeing 24Mbps on the new route through Chicago, down from 80Mbps earlier this morning. [19:24:24] it may just be intermittent on the older idracs [19:24:47] we didn't have issues in esams, but we had way-bad issues on them months ago, and their firmwares versions are way behind even codfw [19:25:18] we should probably eventually get to a point where we audit and keep them up to date proactively [19:25:24] can probably audit via IPMI or whatever [19:28:47] ema: was double-checking in salt, looks like just 4004 to go? [19:29:56] bblack: correct, plus some leftovers in esams (eg: cp3014) [19:30:21] a while loop is taking care of those too :) [19:30:45] oh you mean the ones not in service [19:30:52] I've only been checking the ones in service [19:31:31] yeah: sudo salt --out=txt 'cp*' cmd.run 'uname -a'|grep -v wmf2 [19:32:02] uh cp1044 didn't come back up [19:33:00] console: failed to read Serial Over LAN Configuration. [19:33:55] powercycling [19:35:59] oh it's one of the decom ones [19:36:59] right, racadm finally did its thing [19:40:36] oh, but of course the kernel package was not upgraded on those [19:42:33] heh [19:43:41] bblack: ok so, all machines in service are now running 4.4 wmf2 [19:44:11] I'll take care of the spares tomorrow [19:45:29] ok sounds great [20:00:54] 10Traffic, 06Operations: Fix lvs1001-6 storage - https://phabricator.wikimedia.org/T136737#2346153 (10BBlack) [20:20:32] 10Traffic, 06Operations, 06Community-Liaisons (Apr-Jun-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2346235 (10BBlack) [22:25:08] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2347396 (10ori) Breakdown of 10,718,138 PURGEs, captured on 2016-06-01 between 18:00 and 22:00 UTC: ## Top P... [22:59:45] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2347542 (10ori) @EBernhardson made it so when a job fragments into a number of child jobs, each child job has... [23:00:49] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2347550 (10AlexMonk-WMF) Yeah, those are definitely the right questions but I don't have any useful answers, and am not likely to have time... [23:22:17] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2347595 (10brion) Currently seeing my baseline 80 Mbps; floating IP 208.80.155.243 has been assigned for now to test without the proxy, just to double-con...