[08:26:48] <elukey>	 reporting something seen in #operations
[08:26:49] <elukey>	 <icinga-wm> PROBLEM - Varnishkafka log producer on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka
[08:27:43] <elukey>	 and on the host Apr  5 07:08:40 cp1044 varnishkafka[1518]: VSLQ_Dispatch: Varnish Log abandoned or overrun.
[08:28:25] <elukey>	 in syslog there are also a lot of "cp1044 /usr/sbin/gmond[6424]: [PYTHON] Can't call the metric handler function for.."
[08:29:24] <elukey>	 also cp1044 systemd[1]: Unit varnishxcps.service entered failed state.
[08:37:34] <elukey>	 ah there we are
[08:37:35] <elukey>	 Apr  5 07:08:40 cp1044 varnishd[911]: Child (916) Last panic at: Tue, 05 Apr 2016 07:08:40
[08:37:50] <elukey>	 ---^ there is a big panic log right after a puppet run on cp1044
[08:39:41] <elukey>	 varnish and vk are working now, but this definitely needs a follow up
[08:46:01] <elukey>	 not sure if gmond needs to be restarted too
[08:46:55] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1004 - https://phabricator.wikimedia.org/T131830#2179913 (10elukey)
[08:47:20] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1004 - https://phabricator.wikimedia.org/T131830#2179925 (10elukey)
[08:47:50] <elukey>	 phab task created --^
[08:48:08] <elukey>	 ah snap I forgot a 4
[08:48:31] <elukey>	 amended
[08:48:34] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1044  - https://phabricator.wikimedia.org/T131830#2179913 (10elukey)
[08:48:43] <elukey>	 (technically I am out today, will double check later!)
[09:31:03] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2179963 (10ema) Relevant logs as displayed by journalctl: https://phabricator.wikimedia.org/P2855
[09:31:21] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2179964 (10ema) p:5Triage>3High
[10:25:02] <wikibugs>	 10Traffic, 6Operations: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2180044 (10ema) Backtrace including symbols missing from the logs:    0x433ea5: varnishd() [0x433ea5] <- pan_ic + 357   0x4311e4: varnishd() [0x4311e4] <- obj_getmethods + 84   0x43283e: varnishd(ObjGet...
[11:41:07] <wikibugs>	 10Traffic, 6Operations, 13Patch-For-Review: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2180142 (10BBlack) ^ This should fix the panic, needs review -> package -> deploy
[11:41:17] <elukey>	 \o/
[12:50:33] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2180261 (10BBlack) So I've spent a few days pondering all of this.  There are definitely some improvements we could make to cache_upload's way of doing...
[12:51:31] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Upgrade to Varnish 4: things  to remember - https://phabricator.wikimedia.org/T126206#2007434 (10BBlack)
[12:51:44] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2180266 (10BBlack)
[12:51:47] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2180267 (10BBlack)
[12:51:50] <wikibugs>	 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2180265 (10BBlack)
[12:52:56] <ema>	 bblack: varnish builds fine with your patch
[12:53:26] <ema>	 it would be good to be able to reproduce the bug now
[12:57:48] <bblack>	 well the code bug is pretty clear
[12:59:47] <bblack>	 skipping over irrelevant unconditional statements and whatnot, the very top of HSH_DerefObjCore() is:
[12:59:51] <bblack>	 int HSH_DerefObjCore(struct worker *wrk, struct objcore **ocp) { struct objcore *oc = *ocp; *ocp = NULL; ... }
[12:59:59] <ema>	 right
[13:00:41] <bblack>	 and then we try to deref oc right after that in the getxid thing for printf
[13:00:55] <bblack>	 it lines up with exactly what the panic indicates
[13:15:21] <ema>	 bblack: if you agree we can merge this https://gerrit.wikimedia.org/r/#/c/281645/. Then I'll build new packages and upload them to carbon
[13:17:38] <bblack>	 yup!
[13:19:34] <elukey>	 bblack: out of curiosity, what is the original patch doing exactly? I mean, was it a bug not resolved in varnish that we had to fix by hand, etc..
[13:20:28] <bblack>	 the patch doesn't do anything functional, so it's not really a bugfix
[13:20:42] <bblack>	 it's just doing verbose logging of anytime NukeOne fails to nuke an object to make room for another
[13:21:22] <bblack>	 I looked at the code and it looks like (at least under varnish4) it makes a good attempt to avoid ever hitting that situation, but it's probably still possible with various concurrency/race issues
[13:21:49] <elukey>	 ahh okok got it, thanks :)
[13:22:23] <bblack>	 NukeOne takes a lock, checks for a refcnt that's exactly one, then sets some flags on the object to indicate it's about to be killed, then releases the lock, then calls HSH_DerefObjCore to kill it
[13:22:44] <bblack>	 DerefObjCore asserts the refcnt is nonzero, decrements it, and then fails to kill it if it's not now zero, which is the condition under which we end up in that printf
[13:23:16] <bblack>	 so somewhere inbetween while no locks are held, the refcnt on this object that is flagged for to-be-deleted is getting incremented, preventing the nuke from working.
[13:23:50] <elukey>	 so we failed while notifying a failure?
[13:24:12] <bblack>	 well, I wouldn't say a real failure
[13:24:32] <bblack>	 we deref'd a NULL object and panic'd, while trying to notify about a not-great condition that's probably acceptably-normal in most cases
[13:25:00] <elukey>	 all right, makes more sense :)
[13:25:06] <bblack>	 but for some hysterical raisins (probably past varnish bugs that may be long gone), we log that condition
[13:25:45] <ema>	 bblack: another thing we should fix is the permission issue on _.vsm files. Upon varnish restart they're not world-readable anymore which breaks varnish{stat,ncsa} as non-root, which breaks ganglia and possibly something else. Perhaps we could add a chmod in the systemd unit after start?
[13:26:14] <bblack>	 I'm tempted to just remove the patch unless we really know of a solid reason to still log those that definitely still applies to varnish4
[13:26:24] <bblack>	 but for now since the fixup is simple, may as well stick with it
[13:27:28] <bblack>	 ema: there must be a better solution somehow... we don't want to have some dependency race on undo-ing varnish's chmod() before other tools can work, etc?
[13:28:08] <bblack>	 this seems like a good use for group-read perms, if we have any control over all of this
[13:28:23] <ema>	 the files are readable by members of the varnish group
[13:28:48] <ema>	 so perhaps ganglia could join such group?
[13:29:50] <bblack>	 yeah
[13:30:12] <bblack>	 if the privdrop for any of the readers (gmon, vstat, vncsa) is built into the daemon itself, though, it may or may not work
[13:30:40] <bblack>	 some naive privdrop code does setuid/setgid to drop from root to whatever, but fails to do initgroups() to pick up ancillary group perms
[13:31:27] <bblack>	 probably the varnish tools do it right, but who knows about ganglia
[13:32:51] <ema>	 varnishstat works fine, I've tried it on my test instance after adding myself to the varnish group
[13:32:57] <bblack>	 latest ganglia on github does it right: https://github.com/ganglia/monitor-core/blob/8cadd2592676dec05024ecc3e7b37b4d2c29651b/lib/become_a_nobody.c#L39
[13:33:50] <bblack>	 so does 3.6.0 branch, so probably fine
[13:35:43] <ema>	 4.1.2-1wm2 uploaded
[13:36:30] <bblack>	 yay :)
[13:37:23] <ema>	 mmmh there are lots of packages that want to be upgraded on cp1044. I'd upgrade only *varnish* though
[13:37:58] <ema>	 stars to be read as wildcards rather than bold ^ :)
[13:38:26] <ema>	 depool -> upgrade -> repool ?
[13:39:05] <bblack>	 yeah
[13:39:09] <ema>	 alright
[13:39:25] <bblack>	 and yeah I saw the upgrade list yesterday, but haven't done anything about it yet since other things were up in the air
[13:40:15] <bblack>	 there's no formal process for that.  usually in the absence of a big CVE announcement where we really care, I just notice there's pending upgrades once in a while, and if they look safe and sane I "apt-get ugprade" the fleet over some appropriate window of time
[13:40:40] <bblack>	 where I guess "the fleet" for this channel is mostly cp*, lvs*, and the authdns boxes
[13:40:59] <bblack>	 they often fall behind for a while, but I try not to let them get too far behind
[13:46:37] <ema>	 interesting, after the upgrade varnish backend does not listen on 3128
[13:47:01] <ema>	 running puppet agent
[13:47:24] <bblack>	 probably the installer doesn't know how to restart our instances...
[13:47:45] <bblack>	 the old (3.x) packages, I think I removed the postinst stuff to restart varnishes, to decouple package upgrade from depooled restart and such
[13:48:31] <ema>	 mmh no, I think the package overwrote /lib/systemd/system/varnish.service
[13:48:35] <bblack>	 we probably need to take a peek at all of that.  we're also shipping our own systemd unit file on 3.x, and the 4.x package may have its own, etc...
[13:49:25] <bblack>	 in the long run, it would be saner if we'd use custom service names for both instances and never use the default "varnish" service that lacks a -n instancename
[13:49:54] <ema>	 or we could drop the units under /etc rather than /lib so that package upgrades don't overwrite them (because they're conffiles)
[13:50:06] <bblack>	 it would clean up a bunch of puppet conditionals on whether to use "-n foo" at all or not (we'd always have an instance name), it would get rid of the issue with the systemd unit file, and we could unconditionally stop/disable service "varnish" in puppet depending only on package install...
[13:50:35] <bblack>	 we made a policy decision not to put systemd units under /etc/, because systemctl's behavior is broken if you do that...
[13:50:42] <ema>	 oh yeah?
[13:50:52] * ema didn't know
[13:51:27] <bblack>	 yeah it's a design bug on the systemd end, IMHO.  I don't know what the latest status is upstream, but it's still there in debian.
[13:51:44] <ema>	 cp1044 looks sane now. Repool?
[13:51:47] <bblack>	 yup!
[13:52:29] <bblack>	 so the systemctl thing is this:
[13:52:34] <bblack>	 systemctl has a "mask" command:
[13:52:35] <bblack>	 mask NAME...
[13:52:35] <bblack>	 Mask one or more unit files, as specified on the command line. This will link these units to /dev/null, making it impossible to start them. This is a stronger version of disable, since it prohibits all kinds of activation of the unit, including enablement and manual activation. Use this option with care. This honors the --runtime option to only mask temporarily until the next reboot of the system
[13:52:42] <bblack>	 . The --now option can be used to ensure that the units are also stopped.
[13:53:04] <bblack>	 "link these units to /dev/null" means "link /etc/systemd/system/foo.service to /dev/null, wiping out any unit file you may have put there"
[13:53:49] <bblack>	 which would make sense if all real unit files are always under /lib/, and /etc/ is just used for fragment overrides in /etc/systemd/system/foo.service.d/ I guess
[13:55:12] <bblack>	 however https://www.freedesktop.org/software/systemd/man/systemd.unit.html is pretty clear that sysadmins are supposed to be able to drop their local replacement for a unit file in /etc/ ...
[13:55:39] <bblack>	 those two things are not in agreement in practice - "mask" (which is undoable with "unmask") should not wipe out the local modification with no way to put it back :P
[13:56:44] <ema>	 yep
[13:56:59] <ema>	 upgrading cp1043 now, cp1044 seems happy
[13:57:52] * elukey is happy too
[13:58:20] <elukey>	 good work folks, time to fix for a bug close to nothing :)
[14:06:53] <ema>	 upgrade done
[14:07:17] <ema>	 for the record, this is the procedure I followed:
[14:07:19] <ema>	 apt-get update ; apt-get install varnish libvarnishapi1 varnish-dbg ; puppet agent -tv ; service varnish-frontend restart ; service varnish restart ; chmod o+r /var/lib/varnish/*/_.vsm
[14:16:06] <ema>	 elukey: thanks for your help! Enjoy the day off :)
[14:17:50] <wikibugs>	 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2180445 (10BBlack) re: nginx upstream+debian: debian's "master" branch is still at 1.9.10-1, but their "dyn" branch has work beyond that up through 1.9.13 and not yet released, which als...
[14:30:36] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180463 (10BBlack) Noted while investigating other things: https://www.varnish-cache.org/trac/ticket/1643 - the ticket is still open, but it's not clear to me whether it's only an open issue in 4...
[14:37:22] <ema>	 bblack: yeah they say "Look up how much effort it is porting this to 4.0" which might mean it's fixed in 4.1.x?
[14:38:12] <bblack>	 yeah I donno, we'll have to test it when we get around to it I guess
[14:39:10] <ema>	 maybe we can test it now as well by serving a file > 65536 bytes and seeing what happens
[14:46:18] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180493 (10ema) >>! In T131502#2180463, @BBlack wrote: > Noted while investigating other things: https://www.varnish-cache.org/trac/ticket/1643 - the ticket is still open, but it's not clear to m...
[14:52:48] <justinl>	 Hi all. Last week Hashar directed me here from a posting of mine to the mediawiki mailing list regarding my wikis' varnish hit ratios plummeting after upgrading from MW 1.24 to 1.26. I spent a huge amount of time and effort trying to diagnose why but with no luck. I'm hoping to perhaps get some advice here on ways I might troubleshoot this issue that I may not have thought of already.
[14:56:45] <hashar>	 ema: bblack ^^ :-}
[14:57:21] <hashar>	 that follow up a post from justinl on mediawiki-l  where his varnishes cache hit plummeted from 70% down to 30 %  iirc
[14:57:49] <justinl>	 hashar: yep, it was actually about 85% down to about 37%
[14:58:37] <hashar>	 and since I myself already forgot about varnish vcl  functions ...    -:D  I advertised this channel
[14:58:51] <hashar>	 the traffic folks must be busy
[15:01:37] <bblack>	 well our starting points on VCL are probably markedly different
[15:02:10] <bblack>	 do you have any idea what kinds of URLs you're suffering the excess misses on?
[15:02:17] <justinl>	 for what it's worth, here's a graph showing the hit ratio over the last 12 months: http://imgur.com/Uenool7 - the boost last year was when i stripped cookies from thumbnail images due to some thumbnail-heavy pages causing serious performance issues
[15:03:11] <bblack>	 I really don't mentally map things well to our public release versions, since it's a continuous stream of updates here
[15:03:30] <bblack>	 but I know in the recent past, there have been a number of changes around resourceloader (load.php) and the images/css/js it loads, etc
[15:04:13] <justinl>	 from watching the varnish logs, it was pretty spread out but lots of images, obviously, and then load.php, index.php
[15:04:35] <justinl>	 we have about 12 GB of thumbnails alone in our main wiki and about 20 GB overall across all 5
[15:04:44] <bblack>	 probably the drop is primarily in one category of things
[15:05:10] <justinl>	 however, looking at that graph, varnish was only configured with 4 GB RAM per server up until the recent MW upgrade
[15:05:17] <justinl>	 i boosted it to 8 but that didn't help
[15:06:10] <bblack>	 is it varnish 3 or 4?
[15:06:23] <justinl>	 varnish 3
[15:07:32] <bblack>	 so "varnishtop -b -i TxURL" (assuming default varnish instance, as opposed to -n foo for a specific of multiple instances)
[15:07:39] <justinl>	 the drop was immediate after the MW upgrade, but the only other change made at the time was a dist-upgrade of the web servers which are running Ubuntu 12.04.5
[15:07:46] <bblack>	 will give you a running "top"-like output of what your varnish is fetching from MW
[15:08:02] <bblack>	 the big misses should be somewhere near the top of that output
[15:08:18] <bblack>	 (well misses and passes and anything else that generated backend traffic)
[15:08:40] <justinl>	 the top entries for that varnishtop command end up being mostly load.php URLs
[15:08:56] <bblack>	 even with -b ?
[15:09:01] <bblack>	 ok
[15:09:08] <justinl>	 e.g. top now is /load.php?debug=false&lang=en&modules=ext.smw.style%7Cext.smw.tooltip.styles&only=styles&skin=monobook
[15:09:10] <bblack>	 so probably the resourceloader changes are at fault
[15:11:58] <bblack>	 I see similar URLs in our request stream, but they tend to be cacheable
[15:12:49] <justinl>	 I'm not a MW expert so not really familiar with the resource loader, but I do see the references in the 1.26 release notes about it becoming fully asynchronous. I assume that's what you're referencing?
[15:12:57] <bblack>	 probably, yeah
[15:13:18] <bblack>	 does the output of MW for that load.php URL at the top of your list have sane Cache-Control headers and such?
[15:13:33] <bblack>	 I'm not an MW expert either :)
[15:13:58] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180630 (10ema) I've tried to reproduce the corrupt range response issue on 4.0.3 without success. Instead, I've encountered a different problem:    $ curl -v -H 'Range:bytes=0-' http://localhost...
[15:13:58] <bblack>	 pretty much all load.php traffic for us is cache hits
[15:14:16] <justinl>	 hmm, let me do some varnishlog commands to see if i can get the cache-control headers for the load.php URLs
[15:16:22] <justinl>	 they mostly seem to be "Cache-Control: public, max-age=300, s-maxage=300" and "Cache-Control: public, max-age=2592000, s-maxage=2592000"
[15:17:08] <bblack>	 yet varnish isn't caching them?
[15:17:19] <bblack>	 something else wrong with them maybe, set-cookie headers and such?
[15:17:46] <bblack>	 a lot of little things can make a URL uncacheable.  depends a lot on your VCL too.
[15:18:07] <justinl>	 I don't have anything specific in Varnish to set cookies. One sec, let me get a link to my VCL code...
[15:18:50] <justinl>	 here it is with my IPs altered for security: http://pastebin.com/nfs0yUKd
[15:19:54] <justinl>	 that's pretty much directly from https://www.mediawiki.org/wiki/Manual:Varnish_caching
[15:20:06] <justinl>	 with a couple of tweaks
[15:20:09] <bblack>	 heh
[15:20:43] <justinl>	 and exactly what I'd been using for a long time, since at least last summer when i added the thumbnail cookie stripping
[15:20:53] <bblack>	 I've never seen that page, but there's not much in common between that stuff and what we do here for WMF caches
[15:21:07] * ema sees it for the first time as well
[15:21:14] <bblack>	 so you still have an explicit "return (pass)" if any request sends a Cookie: header
[15:21:36] <bblack>	 is it possible with the new version, some new unexpected cookies are being sent by clients and tripping that, even though they're not logged-in users?
[15:22:46] <bblack>	 also, our VCL doesn't have your If-None-Match pass-block either, and I'm not sure why that would be there
[15:22:53] <bblack>	 RL could be using more If-None-Match too
[15:23:04] <ema>	 RL?
[15:23:30] <bblack>	 ResourceLoader aka load.php
[15:23:48] <bblack>	 which I think sends JS that the browser uses to load more stuff.  there's a lot of magic in there somewhere
[15:24:21] <bblack>	 or at least, sends dynamic content for a bunch of cacheable image/css/js links for the browser to load
[15:25:05] <bblack>	 justinl: you might try looking at the full client-side (-c) varnishlog for a request on that top URL in your varnishtop (the load.php) one, and seeing what a miss looks like
[15:25:21] <ema>	 but yeah looking at the VCL it seems that cookies are the most likely cause for misses
[15:25:28] <bblack>	 if it has a Cookie: which isn't from your logged-in sessions you wanted to prevent caching on, or has an If-None-Match, that would point the way
[15:25:49] <bblack>	 we do have a lot of code features in MW that use Cookie: for things other than user sessioning
[15:26:03] <bblack>	 (which is why our VCL is very explicitly about only looking for Cookie names that we know are session cookies)
[15:26:17] <justinl>	 I assumed it would be something to do with cookies but I was wary of trying anything. Tried stripping cookies from all images and that caused problems for users.
[15:27:16] <bblack>	 what we match on, in place of "pass all requests that have a Cookie at all" is:
[15:27:19] <bblack>	 if (req.http.Cookie ~ "([sS]ession|Token)=") {
[15:27:29] <bblack>	 because the actual user login stuff uses cookie names that match that regex
[15:27:43] <bblack>	 there's a bunch of other cookies that don't, and aren't session/user-specific and don't need to prevent caching
[15:29:02] <bblack>	 I would point you at our VCL for examples, but our VCL is vastly more-complicated than most sites would need, and is spread over several templates and then generated by puppet heh
[15:29:10] <justinl>	 ok, that's all interesting. I'll have to tinker with that in our test environment to ensure the proper behavior
[15:29:24] <bblack>	 but still, you can peek for some ideas at the text-* files in https://github.com/wikimedia/operations-puppet/tree/production/templates/varnish/
[15:29:31] <justinl>	 That's understandable, I definitely have simple, out-of-the-box needs.
[15:29:43] <justinl>	 cool, thanks for the link
[15:30:28] <justinl>	 well, mostly out-of-the-box
[15:31:20] <bblack>	 there's also our common VCL (common across our 4x distinct varnish clusters, but only the "text" cluster is MediaWiki) at: https://github.com/wikimedia/operations-puppet/tree/production/modules/varnish/templates/vcl
[15:34:48] <justinl>	 bblack: I really appreciate the help. :)
[15:44:58] <justinl>	 bblack: fwiw, it looks like a lot (most?) of the cookies, at least on load.php requests, are google analytics cookies
[16:00:02] <bblack>	 interesting :)
[16:02:01] <justinl>	 i would assume it's safe to still ignore them like you do with session and token cookies as far as caching is concerned
[16:02:24] <bblack>	 yeah I think so
[16:03:28] <justinl>	 does your varnish code do anything specific with ga, utma, etc. cookies?
[16:04:06] <bblack>	 nope, we just ignore them
[16:04:26] <bblack>	 we basically ignore all cookies for cache-pass purposes, except those matching the earlier regex req.http.Cookie ~ "([sS]ession|Token)="
[16:05:00] <justinl>	 ok, cool, thanks!
[16:15:49] <ema>	 bblack: oh the do_stream stuff in v4 sounds nice!
[16:16:05] <bblack>	 yeah
[16:30:47] <bblack>	 ema: so basically, where misc and upload are doing size-based tradeoffs for conditional do_stream, we want to just conditionally drop that for varnish4
[16:31:25] <bblack>	 for that matter the hacks around do_stream-on-pass are unnecesary on varnish4
[16:31:34] <ema>	 right, because the default is to stream now
[16:32:04] <ema>	 the streaming improvements are really not advertised at all
[16:32:13] <ema>	 or at least I couldn't find much
[16:32:24] <bblack>	 yeah they're not :)
[16:32:42] <bblack>	 but basically they're at the core of the whole client/backend split in the new varnish code
[16:33:14] <bblack>	 v3: one thread would handle the request and the backend request for that fetch, which has a lot of structural effect on shared do_stream, etc...
[16:33:56] <bblack>	 v4: client request and backend fetch are de-couple, and streaming goes into temporary stream buffers several clients can pull from asynchronously, which gives us tradeoff-free do_stream and thus it was turned on by default
[16:34:23] <bblack>	 (tradeoff-free meaning a slow initial client doesn't make do_stream slow down all concurrent requests for the same resource)
[16:35:28] <ema>	 for some reason h2 comes to mind :)
[16:35:37] <bblack>	 :)
[16:36:02] <bblack>	 there's a lot we can eventually do with h2, which will be tricky on the nginx/varnish boundary
[16:36:31] <bblack>	 since nginx controls the h2 streams/priorities/push, and varnish knows nothing about h2, and nginx->varnish is basically one req-per-conn on http/1.0 ....
[16:47:54] <ema>	 bblack: did we forget to merge https://gerrit.wikimedia.org/r/#/c/281439/ or are you waiting for some reason?
[16:48:16] <ema>	 oh maybe waiting on ori?
[16:50:07] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2180904 (10mmodell) New problem: Apparently jenkins can't access phabricator over ssh.
[17:31:04] <bblack>	 ema: yeah kinda waiting on ori, or I guess I could dig into the stats myself
[17:32:05] <bblack>	 mostly my question-mark there is I don't understand exactly what varnishrls is gathering and whether they care that it counts non-load.php and even non-text requests
[19:10:18] <ema>	 bblack: I'm currently running varnishrls on cp1048 (upload) with output to stdout
[19:10:31] <ema>	 and it produces no output really
[19:11:33] <bblack>	 oh really?
[19:11:37] <ema>	 yup
[19:11:45] <bblack>	 hmmmmmm
[19:11:46] <ema>	 $ varnishrls 
[19:11:46] <ema>	 ResourceLoader
[19:11:46] <ema>	 ResourceLoader
[19:11:52] <ema>	 and it stays like that
[19:12:06] <ema>	 whereas on cp1052 (text) every now and then it flushes the stats to stdout
[19:12:21] <bblack>	 I was looking at the code, and it has those varnishlog params that look an awful lot like the cmdline ones
[19:12:25] <bblack>	 with several -i
[19:12:33] <bblack>	 and the OR'd regexen
[19:12:34] <ema>	 varnishlog -n frontend -I '^(/w/load\.php|if-none-match:|cache-control:|([\d.]+ ?){6}$|[1-5]\d{2}$)'
[19:12:50] <ema>	 yeah, and the various -i
[19:12:51] <bblack>	 yeah but the -i TxStatus -i RxHeader etc...
[19:13:04] <bblack>	 on text that ends up emitting a bunch of fields from non-load.php requests
[19:13:30] <ema>	 probably the cache-control/if-none-match stuff?
[19:13:43] <bblack>	 yeah
[19:14:00] <bblack>	 maybe varnishlog.varnishlog mashes the -i's together differently than commandline varnishlog
[19:14:29] <ema>	 mmh I suspect the magic happens in ResourceLoaderVarnishLogProcessor.handle_log_record
[19:14:41] <bblack>	 maybe
[19:15:17] <ema>	 yeah it's using > 7% cpu on upload nodes
[19:15:38] <ema>	 so I think all those log entries are actually processed, without any real reason
[19:15:56] <bblack>	 if it definitely produces no output on non-text clusters, then there's no reason to keep it on the other
[19:16:05] <bblack>	 I thought it would just on reading it
[19:20:04] <ema>	 yeah I
[19:20:19] <ema>	 I'm pretty convinced it's useless, at least on upload hosts
[19:20:54] <ema>	 ResourceLoaderVarnishLogProcessor.process_transaction never gets called
[19:21:29] <ema>	 and it should get called by varnishprocessor in case there is an interesting transaction
[19:21:32] <ema>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishprocessor/varnishprocessor.py#L76
[19:21:53] <bblack>	 ok
[19:22:52] <ema>	 the thing is, varnishlog -I blabla produces a lot of output but no RxURLs AFAICT
[19:23:19] <ema>	 and that's what gets used by varnishprocessor to populate the transactions dictionary
[19:23:30] <ema>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishprocessor/varnishprocessor.py#L66
[19:24:27] <ema>	 compare the output of the following on a text and non-text node:
[19:24:30] <ema>	 varnishlog -n frontend -I '^(/w/load\.php|if-none-match:|cache-control:|([\d.]+ ?){6}$|[1-5]\d{2}$)' -c -i TxStatus -i RxURL|grep RxURL
[19:25:27] <ema>	 but still on non-text nodes VarnishLogProcessor.handle_log_record gets called for each VSL record (also non-RxURL), hence the CPU usage
[19:28:13] <ema>	 each VSL record matching -i TxStatus, -i RxHeader and so forth, that is
[19:32:39] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181515 (10Gehel)
[19:33:29] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181533 (10Gehel)
[19:34:48] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181542 (10RobH) a:5BBlack>3RobH I'll create and link in #procurement tasks for pricing shortly.
[20:11:58] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2181736 (10hashar) Jenkins execute the jobs on labs instances, so it is not surprising t...
[20:25:25] <wikibugs>	 10Wikimedia-Apache-configuration, 6Operations, 13Patch-For-Review, 7Puppet, 7Technical-Debt: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2181794 (10Aklapper)
[20:51:28] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181940 (10Gehel) According to @BBlack on IRC:  * Specs is "standard, SSD-based varnish cluster machine configurations". * we should not need to buy any new hardware,...
[20:54:16] <wikibugs>	 10Traffic, 6Operations, 10Phabricator, 7Blocked-on-Operations: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2181955 (10MBinder_WMF) @greg thanks for waking this thread up. :) My teams are still asking for live updated boards.  @chasemp I'm happy to...
[21:02:08] <wikibugs>	 10Traffic, 6Operations, 10Phabricator, 7Blocked-on-Operations: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2182000 (10greg) So, as for next steps:  >>! In T112765#2168889, @chasemp wrote: > Someone from #releng could put up changes for both the mi...
[21:02:30] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182002 (10BBlack) Well to be completely clear: should not need to buy any new hardware **this quarter** - all of them need replacing on standard lifetimes, with the...
[21:17:23] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182020 (10Gehel) As we already have the hardware, this just needs @mark's approval.
[21:19:00] <legoktm>	 bblack: I cc'd you on an email about the zero thing
[21:48:53] <wikibugs>	 7Varnish, 6Performance-Team: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2182123 (10Krinkle)
[21:50:44] <wikibugs>	 10Traffic, 6Operations, 10Parsoid, 10RESTBase, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2182138 (10GWicke)
[22:12:14] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182167 (10RobH) a:5RobH>3mark Excellent, I was too quick to claim for processing!  So the request is to allocate the 4 machines in codfw/eqiad/esams/ulsfo each t...
[22:13:09] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2181515 (10RobH)
[22:13:52] <wikibugs>	 10Traffic, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2181515 (10RobH)
[23:04:26] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182311 (10Wrh2) Pinging Jdlrobson as I believe that this issue is the one being referenced in the English Wikivoyage dis...
[23:06:41] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182312 (10Jdlrobson) @Whr2 have there been any reports of this happening on pages that have been edited since the 4th Ma...
[23:06:56] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182313 (10Jdlrobson) p:5Normal>3High
[23:10:32] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182320 (10Wrh2) >>! In T121135#2182312, @Jdlrobson wrote: > @Whr2 have there been any reports of this happening on pages...
[23:14:48] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182325 (10Jdlrobson) a:3Jdlrobson @Wrh2 this article says: "This travel guide page was last edited at 02:45, on 14 Sep...
[23:20:53] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182349 (10Wrh2) Cache is cleared fairly regularly even if articles aren't edited - I've made minor updates to Template:P...
[23:30:33] <wikibugs>	 10Traffic, 10MediaWiki-Parser, 6Operations, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182391 (10Jdlrobson) Thanks. Then it's definitely not fixed. Looking at source it looks like the exact same problem as w...