[08:22:59] <wikibugs>	 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2556345 (10Joe)
[08:36:55] <wikibugs>	 10Traffic, 06Operations, 10Pybal: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#2556355 (10ema)
[09:39:03] <ema>	 yeah so the plan of not spamming #-operations with puppetfails after merging https://gerrit.wikimedia.org/r/#/c/276529/ kinda failed
[09:39:38] <ema>	 I've disabled puppet on all cache hosts and I did a salted 2x puppet run 
[09:40:31] <ema>	 however, if puppet is disabled we get a 'puppet last run' warning, which turns into a 'puppet last run' critical as soon as the first puppet run fails
[09:40:59] <ema>	 which then gets solved pretty quickly but still we get icinga spam
[10:30:58] <wikibugs>	 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2556618 (10akosiaris) Any updates on this one ?
[11:18:31] <ema>	 bblack: the tl;dr of what happened this morning is that I've disabled puppet on all cache nodes, merged https://gerrit.wikimedia.org/r/#/c/276529/ and did a salted 2x puppet run. On v4 hosts this failed because of the vmod_header problem
[11:19:36] <ema>	 so puppet triggered a vcl reload which failed on all v4 services which haven't been restarted before 
[11:19:47] <ema>	 that means misc frontend/backend and maps frontend
[11:20:58] <ema>	 leaving them in the very broken state of systemd thinking that the service was fine, while in reality varnishd was not listening on :80 and :3128
[11:22:04] <ema>	 when we've noticed that I stopped the salted puppet runs and fixed the broken varnishes manually (systemctl stop varnish ; systemctl start varnish ; systemctl stop varnish-frontend ; systemctl start varnish-frontend)
[11:23:56] <ema>	 then I've done a rolling restart of all v4 varnishes to make sure we can do a proper vcl-reload without breaking varnishd 
[11:26:24] <ema>	 re-enabled puppet on cp4019 (maps), run puppet twice and everything went fine, besides the spam on #-operations triggered by the first puppet fail which seems unavoidable I'm afraid
[12:13:55] <ema>	 I'm now re-enabling puppet on the remaining cache hosts one after the other scheduling a brief icinga downtime
[12:30:17] <bblack>	 ok, so the main problem is the vmod_header thing was still outstanding, basically?
[12:30:26] <ema>	 yes
[12:31:10] <bblack>	 that's a really awful failure mode
[12:31:17] <ema>	 so first of all I forgot that the vmod_header thing was still outstanding, meaning that we didn't restart all affected varnishes
[12:31:30] <ema>	 and also, I didn't think that it would have broken varnishd upon a vcl reload
[12:31:36] <ema>	 which is crazy
[12:32:00] <ema>	 I thought only a crash of the child process would have left varnishd horribly broken
[12:32:01] <bblack>	 should file a bug upstream later if you haven't already.
[12:32:24] <bblack>	 there's two design issues here:
[12:32:37] <ema>	 I've filed https://github.com/varnishcache/varnish-cache/issues/2041, which only covers the child-related issue
[12:33:06] <bblack>	 1) If VCL cannot reload (due to vmod shlib changed underneath), it should leave the existing threads running with the existing VCL, like all other failed VCL reloads...
[12:33:43] <bblack>	 2) But if for some reason that's not possible... in general, if the parent cannot spawn a child process it should exit.
[12:46:08] <bblack>	 back on the icinga spam issue, usually we only get the last-run warning when they're disabled for significant time
[12:46:29] <bblack>	 I'm guessing maybe they were still disabled from back when the vmod_header thing was first at issue?
[12:48:07] <ema>	 nope, disabled this morning
[12:48:34] <bblack>	 hmmm ok
[12:48:56] <bblack>	 beats me on that then.  maybe I should file a phab task about the puppet checks instead of just complaining about them for the 10th time :)
[12:49:20] <ema>	 my understanding is this: after we disable puppet we get a 'puppet last run' warning 3/3 because puppet is disabled 
[12:49:40] <ema>	 then, as soon as we re-enable it and puppet fails the first time, that makes 'puppet last run' switch to critical
[12:49:56] <ema>	 (immediately 3/3, as it turns out)
[12:50:10] <ema>	 and that spams #-operations
[12:50:26] <ema>	 then the second puppet run fixes everything, and that also spams #-operations :)
[12:50:28] <bblack>	 well, even if it succeeds the first time we often get spammed
[12:50:49] <bblack>	 the logic of how the "last run is long ago' check works is borked
[12:53:38] <ema>	 only way I found to avoid spamming is a for loop calling icinga-downtime (3 minutes) and then re-enabling puppet
[13:09:51] <_joe_>	 bblack: could you take a look at https://phabricator.wikimedia.org/T142944?
[13:10:08] <_joe_>	 it's about the wikidata articleplaceholders and caching
[13:12:27] <bblack>	 all my good+cheap options for the offsite travel show things like "4 seats left" in orbitz :P
[13:12:34] <bblack>	 the sooner we can get approval the better heh
[13:13:43] <mark>	 yes
[13:14:20] <mark>	 it's in in process ;)
[13:15:20] <paravoid>	 sometimes they lie about that, btw :)
[13:16:24] <wikibugs>	 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2551996 (10BBlack) 30 minutes isn't really reasonable, and neither is spamming more purge traffic.  If there's a constant risk of t...
[13:16:34] <bblack>	 paravoid: yeah but still
[13:16:59] <mark>	 i'd like to start booking as soon as we have approval
[13:17:12] <mark>	 and in some cases we may be able to do self booking (even though there's still a process for that)
[13:20:04] <_joe_>	 paravoid: they don't /lie/, they just make up numbers so that they have 3 fare options at the same price and "only 4 left" refers to one of the 3
[13:20:14] <_joe_>	 so technically they don't lie :P
[13:20:39] <bblack>	 in this case, it's all of the seats in that price band that have such warnings
[13:20:54] <bblack>	 if I knock out the "only 3/4 left" options, the price doubles up quickly
[13:23:58] <mark>	 just take an extra leg via some other airport and it becomes cheaper!
[13:24:19] <mark>	 perhaps greenpiece could look into that
[13:24:22] <mark>	 peace
[13:24:29] <bblack>	 those are the cheapest flights with no other filter :)
[13:24:55] <bblack>	 (and happen to only be one stop, but the 2+ stop options are more-expensive.  airline logic)
[13:28:01] <mark>	 well that would be logical
[13:28:16] <mark>	 what isn't logical is that my preferred direct flights to SFO are $2000-ish
[13:28:36] <mark>	 and those same flights, if hopping 1 stop via another airport nearby first, are $700
[13:30:16] <bblack>	 it's all about their lack of ability to adapt their flight schedules to realtime-ish conditions
[13:30:51] <bblack>	 so they have relatively-fixed flight schedules with relatively-fixed seat counts, and it's all about optimizing what money they can make from the seats that will be flying with or without you.
[13:31:49] <bblack>	 when a flight gets too costly because nobody's taking the seats at reasonable prices, I guess they must occasionally re-balance the fixed set of flights.
[13:38:42] <bblack>	 ema: did something happen with cache_upload this morning? ~10:00 UTC?
[13:38:52] <bblack>	 well maybe starting at 9
[13:39:13] <ema>	 bblack: not AFAIK
[13:39:30] <bblack>	 it could be traffic-induced, or it could be late fallout from the VCL changes yesterday
[13:39:57] <bblack>	 there was a big ramp-up of miss-rate and "hit-remote" rate, which is now trailing back off towards a new normal that's probably still higher than before
[13:40:37] <bblack>	 the VCL changes yesterday went out at ~14:xx though, so we're not even at the 24h mark for those yet (where we'd expect some big change, if any)
[13:40:37] <ema>	 sounds like the VCL change then, perhaps due to hfp expiration?
[13:41:08] <bblack>	 any chance they lacked the VCL update until just this morning?
[13:41:42] <bblack>	 the thing is, all those fears about hfp expiration seem "wrong", because hfp get marked (even on initial creation) as "pass" in X-Cache, and we had 0% pass-traffic.
[13:41:54] <ema>	 mmh I wouldn't think so, puppet was enabled
[13:42:10] <bblack>	 hmmm
[13:42:49] <bblack>	 there are all kinds of possible causes here, it may have nothing to do with the VCL change
[13:44:02] <bblack>	 but usually remote-hit is <1%, and applayer is ~2.5%.  they spiked out to 3.3% and 6.3% respectively, and now are slowly coming back down
[13:44:16] <bblack>	 but the slope of their dropping back towards normal looks like it's seeking a new normal that's higher than the old, at this point
[13:45:17] <bblack>	 and we're still seeing ~0.01% pass-traffic in the graph.  it used to be low enough that it would read zero to that precision
[13:45:28] <bblack>	 so at least that part is probably VCL-related
[13:45:50] <bblack>	 burt that tiny bump in pass has been there since VCL deploy time, so nothing really novel this morning
[13:46:40] <bblack>	 possible things causing the graph bump....
[13:46:50] <bblack>	 1. Something in traffic from the outside world
[13:47:01] <bblack>	 2. A mass invalidation of a bunch of images via PURGE
[13:47:23] <bblack>	 3. A delayed effect from our VCL after some objects expired off
[13:47:36] <bblack>	 4-99. lots of other possibilities heh
[13:47:41] <ema>	 :)
[13:48:12] <ema>	 so, I did restart both frontends and backends, but only on v4 machines
[13:49:23] <ema>	 unless I made a mistake while firefighting of course
[13:49:48] <ema>	 that happened roughly at 10ish UTC 
[13:50:56] <bblack>	 yeah we could check varnishd uptimes on cache_upload
[13:51:06] <bblack>	 restarts of some various layers/tiers in cache_upload would explain that graph, too
[13:51:32] <paravoid>	 https://librenms.wikimedia.org/graphs/to=1471355400/id=11521/type=port_bits/from=1471269000/
[13:51:46] <ema>	 nope, backend last restarts on 2016-06-01
[13:52:11] <ema>	 frontends on 2016-07-15
[13:52:26] <bblack>	 the hit-remote part would indicate a remote DC
[13:52:35] <bblack>	 but it doesn't necessarily mean it would be the only effect
[13:52:43] <bblack>	 hmmm
[13:53:17] <bblack>	 oh!
[13:53:35] <bblack>	 ema: we failed to account for an effect in the backend renaming patch, I think...
[13:53:44] <ema>	 uh
[13:53:49] <bblack>	 renaming the labels of the backends probably changes the chashing of objects to them?
[13:53:56] <ema>	 shit
[13:54:01] <bblack>	 that must be it
[13:54:42] <bblack>	 so it probably effectively invalidated a bunch of our backend storage at all sites when it rolled
[13:54:58] <bblack>	 something on the order of N-1/N invalidated on average, where N is count of machines in the cluster
[13:55:39] <bblack>	 that all it did was spike our stats around a little in a totally handle-able way is further argument that we're ok without persistence :P
[13:56:25] <ema>	 :)
[13:56:47] <bblack>	 you don't see it much on text, because text is oversized for the hot set both in the FE mem caches and in the backends.  so the FE statistically covered the problem up.
[13:57:07] <bblack>	 in upload, the FEs actually offload significant traffic to the local backend layer, hence the stats anomaly from re-chashing.
[13:57:45] <ema>	 and this morning I thought "I'll just rename the backends, easy day today. What can go wrong?"
[13:57:51] <bblack>	 lol
[13:58:45] <bblack>	 also, good thing we got our new esams wave online just in time, by like 36 hours :)
[13:58:53] <paravoid>	 :P
[13:58:57] <paravoid>	 the only one was burstable to 10G too
[13:59:02] <paravoid>	 we'd just pay for it a little more
[13:59:04] <mark>	 whether it could actually sustain that
[13:59:04] <bblack>	 for that 3.3Gbps traffic spike that would've been on mpls
[13:59:08] <mark>	 is a question
[13:59:15] <paravoid>	 yeah
[13:59:15] <mark>	 and if we do more than 3 Gbps on that one, they might cap us
[14:01:24] <bblack>	 I'll link this again for the second time in the past short while: https://twitter.com/codinghorror/status/506010907021828096
[14:01:37] <bblack>	 we managed to hit 2/3 in one go: we had a naming problem that caused cache invalidation :)
[14:03:06] <ema>	 we also had suicidal varnishes, but that's not a hard problem
[14:04:09] <mark>	 that might be another off-by-one error
[14:10:29] <bblack>	 :)
[14:29:54] <bblack>	 ema: so now that we have the fqdn in the backend label, we can undo a hack on cache_upload
[14:30:20] <bblack>	 in hieradata/common/cache/upload.yaml, there's separate backends for thumbs and originals (which is necc regardless)
[14:30:30] <bblack>	 so that we can route them separate to codfw/eqiad during transition testing
[14:30:46] <ema>	 nice
[14:30:49] <bblack>	 but they use different hostnames, as in eqiad: 'ms-fe.svc.eqiad.wmnet' and eqiad: 'ms-fe-thumbs.svc.eqiad.wmnet'
[14:31:11] <bblack>	 the -thumbs hostname is just an alias to the main hostname, because without the FQDN in the label we could route them separately
[14:31:28] <bblack>	 since they both became be_ms_fe
[14:31:44] <bblack>	 s/could/couldn't/ two lines up heh
[14:32:53] <bblack>	 we still need the port in the labels somehow.  maybe best if we only include it when it's not 80 or 443 in the logic
[14:33:03] <bblack>	 so we don't rename everything pointlessly, again heh
[14:33:27] <bblack>	 but we have at least one upcoming use-case where we want separate logical backends in varnish for foo.eqiad.wmnet:80 and foo.eqiad.wmnet:1234
[14:33:42] <bblack>	 (phabricator, for the notification websocket port)
[14:34:58] <ema>	 oh so that's one of the hostnames with a dash
[14:35:15] <ema>	 ms-fe
[14:37:43] <ema>	 and we don't s/-/_/ in golang, but that's not a problem because it's a static backend
[14:39:18] <bblack>	 right
[14:39:24] <bblack>	 it's just a static naming problem
[14:39:36] <bblack>	 really anything we have a port differential it's a static backend, so we can ignore go
[14:40:04] <bblack>	 just put some logic in the puppet/ruby side that says "if port != 80 && port != 443, append _port"
[14:40:26] <bblack>	 then we're covered for the common case today, and the common case in the future when we switch to https for applayer
[14:41:41] <bblack>	 hmmm restbase uses non-default port on cache_text
[14:41:46] <bblack>	 so that will change
[14:41:55] <bblack>	 on the other hand, there's no chashing to the applayer
[14:42:03] <bblack>	 so I guess we don't face this problem there
[14:42:11] <bblack>	 we just want to avoid impacting the varnish<->varnish chashing
[14:42:23] <bblack>	 which is of course on :3128
[14:43:00] <ema>	 until we start using TLS on the cache backends :)
[14:43:05] <bblack>	 right :)
[14:43:27] <bblack>	 that will probably be on some alternate port too (one that's not LVS, but still handled by the same nginx instance)
[14:47:33] <bblack>	 hmmmm
[14:48:00] <bblack>	 or, we could first do the split I was talking about yesterday, categorically separating varnish<->varnish backends from applayer backends in all the code/logic.
[14:48:09] <bblack>	 and then only deal with the port case for the applayer
[14:48:39] <bblack>	 I think it might add lines of template code in the net
[14:49:04] <bblack>	 but it will reduce their complexity, and get rid of all the "dynamic => no" in all the applayer definitions, and make the assumptions about the cases go has to deal with saner.
[14:49:46] <bblack>	 I'm not really sure how to factor all of that correctly yet
[15:11:59] <wikibugs>	 10Traffic, 06Operations: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2557297 (10BBlack) This traffic is still going strong, and we still don't have a solid explanation.  To recap some further investigation since:  The common pattern is these UAs are doing a...
[15:26:17] <ema>	 bblack: I've added a comment to the github vmod issue mentioning that a reload is enough to kill the child: https://github.com/varnishcache/varnish-cache/issues/2041#issuecomment-240137422
[15:27:56] <ema>	 bblack: looks like cp4006 might have crashed
[15:28:42] <bblack>	 yup
[15:28:48] <bblack>	 trying console
[15:30:49] <bblack>	 ema: for once (hardly ever happens), there was console buffer output on connect heh
[15:31:03] <bblack>	 it's still outputting
[15:31:09] <bblack>	 repeatedly oom-killing varnishd heh
[15:31:15] <ema>	 wonderful
[15:32:11] <bblack>	 I'll see if my ssh can get in somewhere between oomkills for a minute or two, if not I'll just hard-reset it
[15:32:48] <ema>	 sounds good
[15:33:09] <bblack>	 should depool it though
[15:33:20] <bblack>	 can you go conftool depool it for all layers?
[15:33:25] <ema>	 sure
[15:34:29] <ema>	 bblack: done
[15:36:26] <bblack>	 ok apparently systemd eventually downed varnishd for good or something, I'm back on the host
[15:36:55] <ema>	 we're getting icinga recoveries as well
[15:37:35] <bblack>	 yeah
[15:37:48] <ema>	 I'll leave it depooled for the time being
[15:37:50] <bblack>	 it could be the depool brought it back, which means traffic-induced, which means it could hit another cp4xxx now
[15:37:53] <bblack>	 so watch out
[15:38:13] <bblack>	 it was apparently the frontend that was getting oomkilled
[15:38:52] <ema>	 cp4005 also got varnishd killed by oomkiller
[15:39:04] <ema>	 [Tue Aug 16 14:34:44 2016] Out of memory: Kill process 3207 (varnishd) score 919 or sacrifice child
[15:39:37] <bblack>	 the really magic question is, how did it keep repeat-oomkilling so fast on restart
[15:39:46] <bblack>	 (for the stateless frontend)
[15:40:01] <bblack>	 unless something else is causing the oomkill and the frontend is really the wrong target, but oomkill logic keeps picking it
[15:40:35] <ema>	 on cp4005 the frontend is using 53% of memory, the backend 33%
[15:41:13] <bblack>	 yeah...
[15:41:33] <bblack>	 but technically the backend's use is resident but not required, it's cache for disk files
[15:43:04] <bblack>	 is 4012 ok?
[15:44:16] <ema>	 it looks good
[15:45:26] <ema>	 cp4012 puppetfailed on varnishkafka.py --generate-pyconf
[15:45:36] <ema>	 except for that it looks fine
[15:45:47] <bblack>	 I think the trigger is coming from all the rehashing today, but the underlying problem is our FE mem size is too aggressively big to handle all possible scenarios with varnishd overheads and jemalloc inefficiencies, etc...
[15:46:00] <bblack>	 at least on cache_upload
[15:46:21] <bblack>	 I had been slowly raising it over a period of months, and stopped when we got to ~50%
[15:46:37] <bblack>	 it's been stable at that value for a while, but we probably haven't had a huge rehash/invalidation event since then
[15:47:12] <bblack>	 T135384
[15:47:12] <stashbot>	 T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384
[15:48:09] <ema>	 mmh the oom killer has been active in ulsfo for a while though
[15:48:09] <bblack>	 so it's been about a month since the raise to 50% on upload/text, was 37% for a while before that
[15:48:26] <ema>	 eg: cp4005.ulsfo.wmnet: [Sat Jul 23 20:13:16 2016] varnishd invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[15:48:30] <bblack>	 the ulsfo caches are different than most for cache_upload.  they have a bit less memory all total
[15:48:57] <bblack>	 the FE is sized by percentage, but maybe some of the overhead is relatively-fixed (e.g. based on the disk storage size) and thus the percentages don't scale down well
[15:49:07] <ema>	 yeah only ulsfo is affected by this, confirmed
[15:49:20] <bblack>	 (and again, probably chash churn triggered things to be worse today than usual)
[15:50:27] <bblack>	 I'm gonna say I should back it off to ~40%, and reboot cp4006 just in case of whatever unknown fallout there
[15:50:39] <bblack>	 and then we can do some frontend restarts on the others to bring their memory use back down
[15:50:49] <ema>	 OK
[15:52:19] <ema>	 bblack: I'm gonna set some downtime for cp4006 then
[15:52:24] <bblack>	 there's lots of subtleties in this, but the bottom line is we need a reasonable size that works on ulsfo-sized boxes too
[15:52:51] <bblack>	 probably we could've noticed this earlier if oomkill and such had any impact we seen icinga when it's rare
[15:57:00] <bblack>	 going to salt puppet on all the caches to deploy the systemd-level change to the malloc param, but obviously it won't have effect until frontend services are restarts
[15:58:53] <ema>	 alright, I've downtimed cp4006 so we can reboot it anytime 
[15:58:58] <bblack>	 cp4006 rebooting :)
[15:59:10] <bblack>	 I applied puppet to it manually before the reboot instead of waiting on salt for it
[15:59:26] <ema>	 great
[16:00:07] <bblack>	 probably should step through the cp4 upload frontends first for varnish-frontend restarts post-puppet, since they're the ones apparently at-risk
[16:00:15] <bblack>	 the others are probably lower priority
[16:00:41] <ema>	 yeah, no oom-killer on other DCs
[16:00:52] <bblack>	 did it hit caches other than upload in ulsfo?
[16:01:12] <bblack>	 well text anyways, misc/maps are unlikely to be so active or have such a large hot dataset at this time
[16:01:36] <ema>	       4 cp4005.ulsfo.wmnet
[16:01:36] <ema>	       9 cp4006.ulsfo.wmnet
[16:01:36] <ema>	       6 cp4007.ulsfo.wmnet
[16:01:36] <ema>	       6 cp4013.ulsfo.wmnet
[16:01:36] <ema>	       5 cp4014.ulsfo.wmnet
[16:01:38] <ema>	       3 cp4015.ulsfo.wmnet
[16:01:54] <ema>	 first number is the number of oom assassinations
[16:02:29] <ema>	 so no, it looks like it's only upload
[16:02:48] <bblack>	 upload's patterns are very different, and it does a lot more chashed traffic to its local backends (at all DCs)
[16:03:04] <bblack>	 so makes sense it would be more-impacted by both things (chashing, and memory pressure)
[16:07:47] <bblack>	 cleared cp4006 downtime
[16:08:34] <ema>	 bblack: I guess we can repool it now
[16:09:03] <bblack>	 yeah doing so now
[16:09:48] <ema>	 it didn't take long to see that in dstat :)
[16:12:59] <bblack>	 ema: going to step through the ulsfo uplaod frontends with depooled restarts now
[16:13:02] <bblack>	 with this locally on each:
[16:13:13] <bblack>	 confctl select name=`hostname -f`,service='nginx|varnish-fe' set/pooled=no; service varnish-frontend restart; sleep 15; confctl select name=`hostname -f`,service='nginx|varnish-fe' set/pooled=yes
[16:13:53] <ema>	 which could become a script BTW
[16:14:04] <bblack>	 yeah :0
[16:14:21] <bblack>	 since there's only 5 to hit, just doing it manually
[16:15:11] <ema>	 ok, the cp2012 puppetfails is unrelated (though mysterious)
[16:15:40] <bblack>	 I put another sleep before the restart too heh
[16:16:09] <ema>	 uh yeah good idea
[16:16:46] <bblack>	 also --quiet so it doesn't complain about its inability to fire off the icinga alerts when running directly on the hosts (confctl that is)
[16:17:25] <ema>	 how about --action?
[16:17:44] <bblack>	 select does use --action
[16:17:50] <bblack>	 or doesn't need it, anyways
[16:17:54] <ema>	 ah great
[16:19:08] <bblack>	 I did 4005 and 7 so far, and 6 was rebooting of course
[16:19:39] <bblack>	 I haven't touched 13/14/15 yet, figure I should let these frontends refill a bit first, and worst-case we'll only lose 3/6 to this bug now, probably, if it recurred
[16:20:02] <bblack>	 in theory it's safe to do it faster, but no point pushing our luck when it's already been an unluck day heh
[16:20:46] <ema>	 yeah it hasn't been boring
[16:21:46] <ema>	 I don't get cp2012's 'puppet last run' alert, everything looks good in /var/log/puppet.log
[16:22:57] <bblack>	 Aug 16 16:07:48 cp2012 salt-minion[985]: [ERROR   ] Command 'puppet agent -t' failed with return code: 6
[16:23:10] <bblack>	 it's from my salted puppet runs, those don't hit puppet.log
[16:23:29] <bblack>	 probably a conflict with a cronjob puppet run
[16:23:32] <bblack>	 or something
[16:24:08] <bblack>	 oh no it's the varnishkafka-python thing, which failed on the salted run
[16:24:11] <bblack>	 either way
[16:24:22] <bblack>	 that varnishkafka pyconf thing is a mess I've never been able to decipher
[16:24:38] <bblack>	 it sometimes takes many puppet runs for the state of that to settle to where it stops (re-)doing things on every run
[16:25:10] <bblack>	 there's some very complex interactions between multiple softwares with interdependencies there, to generate a varnishkafka config bit or something
[16:25:43] <ema>	 oh another thing I've seen today and forgot about: varnishxcache and friends do not necessarily always like varnish restarts
[16:25:50] <bblack>	 ema: if you're around and want to, watch ulsfo+upload front/hit rates in https://grafana.wikimedia.org/dashboard/db/varnish-caching
[16:25:59] <bblack>	 and do 13/14/15 and note it here
[16:26:09] <bblack>	 otherwise I'll be back from lunch in an hour or so and pick it up then :)
[16:26:18] <bblack>	 but I think we're pretty stable/safe at this point, statisticall
[16:26:19] <bblack>	 y
[16:26:32] <ema>	 ok, enjoy your lunch!
[16:27:47] <bblack>	 probably later today, I'll do some depooled frontend restarts on the other DCs/clusters too to get them all down to 40% for the long haul
[16:59:57] <ema>	 cp4013 done, waiting a bit before going on with 14
[17:08:46] <ema>	 14 done
[17:26:33] <ema>	 15 done
[18:04:04] <gehel>	 ema: I'm seeing an icinga warning about the number of HTTP 5xx for Maps and Misc
[18:04:59] <gehel>	 ema: I just started re-generating tiles on the eqiad maps cluster (which does not have traffic, so probably completely unrelated).
[18:07:12] <gehel>	 I'm just surprised as I don't see anything in the kartotherian logs.
[18:07:24] <gehel>	 disregard, warning disapeared...
[18:13:45] <bblack>	 ema: awesome, thanks :)
[18:17:24] <bblack>	 ema: I'm gonna play with https://gerrit.wikimedia.org/r/#/c/305029/1 before messing with more frontend caches, because otherwise too much stats noise to tell what's going on maybe
[18:17:32] <bblack>	 just FYI
[18:23:40] <wikibugs>	 10Domains, 10Traffic, 06Operations: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Mjohnson_WMF)
[18:26:14] <wikibugs>	 10Domains, 10Traffic, 06Operations: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558082 (10Mjohnson_WMF)
[18:31:58] <bblack>	 and... apparently v3 can't call error from vcl_deliver :P
[18:39:38] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558153 (10Danny_B)
[18:41:14] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558158 (10Danny_B)
[19:44:19] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2558383 (10BBlack) Update from the spam of VCL experiments above (and several that weren't through the repo as one-off tests on one host):  In attempting to interfere...
[20:20:05] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2558501 (10BBlack) The interesting bit is that while these were prominent requests, apparently Chrome/41-on-Windows isn't the whole story of the mysterious rise in `EC...
[20:26:45] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Dzahn) Hi, this will be almost completely like creating a public wiki, except for some config options later on, afaict.   All the steps for a new w...
[20:28:38] <wikibugs>	 10Traffic, 06Operations: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10BBlack) While looking into T141786 I stumbled on this again... lots of probably-illegitimate traffic to geoiplookup.wm.o with no referer header and no user-agent, spamming from all over.  So it's bugg...
[20:34:00] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558621 (10Dzahn) We should be sure "projectcom" is the name to stick too, as renaming wikis later is not really a viable option.  I'll start with that "step...
[21:13:23] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Platonides) Why "projectcom"? I would have expected something like "projectgrants" or "grantscom" / "grantscommittee".  Also,...
[22:21:30] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2559236 (10Mjohnson_WMF) Program Officers are already using projectgrants and projectcom as handles to distinguish between general progr...
[23:44:18] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2559538 (10Dereckson) [ Moving in discussion on the site requests workboard pending definitive name choice. ]