[00:23:34] <wikibugs>	 10Domains, 10Traffic, 06Operations, 06WMF-Legal: register .wiki gTLD domains - https://phabricator.wikimedia.org/T88873#2651081 (10BBlack) I see some refs to this ticket flying around, and recheck on some old DNS commits to add language/project domains under .wiki to our DNS, which were abandoned long ago....
[00:30:40] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651084 (10Neil_P._Quinn_WMF) >>! In T135762#2506770, @ellery wrote: > In order to do statistical testing, you would need to compare the fraction of users who clicked the button...
[07:04:33] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651429 (10ellery) @Neil_P._Quinn_WMF I'm saying that for any online AB test you to be able to group the experimental data by user. The proposed framework does not provide a mec...
[07:16:51] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651433 (10ellery) @Nuria I certainly don't disagree that segmentation must be done at the user level. I'm saying that the test statistics (or metrics as you are calling them)...
[07:54:34] <ema>	 _joe_: back to the etcd issue, I remember seeing something similar on Sunday while firefighting, let me look at IRC logs
[07:54:48] <ema>	 12:24 < ema> depooling and restarting varnish on cp4005
[07:54:48] <ema>	 12:24 < ema> ERROR:conftool:Error when trying to set/pooled=no on cp4005.ulsfo.wmnet
[07:54:51] <ema>	 12:24 < ema> ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out
[07:54:54] <ema>	 12:24 < ema> worked fine on 2nd try
[08:44:04] <_joe_>	 on sunday?
[08:44:14] <_joe_>	 jeez, on sunday I have no bad logs from the servers
[08:48:24] <ema>	 _joe_: yes that was on Sunday, 12:24 GMT+2
[08:50:16] <ema>	 however IIRC that was one of the machines with insane load averages. Could the timeout be due to that, or does that error imply an issue with the etcd server?
[08:50:42] <_joe_>	 no
[08:50:55] <_joe_>	 it means the etcd server had an ongoing election
[08:51:17] <_joe_>	 so, I really have to remove auth there
[10:39:54] <_joe_>	 ema, bblack https://gerrit.wikimedia.org/r/311678
[10:40:18] <_joe_>	 still needs some work though
[11:07:53] <elukey>	 bblack,ema: varnishkafka code review updated - https://gerrit.wikimedia.org/r/#/c/311415/6
[11:08:35] <elukey>	 bblack: this patch should resolve what we have discussed yesterday in a very simple way
[11:09:51] <elukey>	 the other thing that I wanted to ask is if somebody could help me debug why https://yarn.wikimedia.org/proxy/application_1472219073448_117202/ returns 503
[11:10:00] <elukey>	 I checked the apache backend and it returns 200
[11:10:17] <elukey>	 I suspect that there is something weird between varnish and apache
[11:29:28] <bblack>	 it returns 401 Unauthorized for me :)
[11:32:55] <bblack>	 only half a day in on the cp1099 storage stuff, but looks pretty decent on varnishstat so far.  cp1074 is a good comparison host, as their last varnishd start times are similar and same-DC
[11:33:47] <elukey>	 bblack: o/ via curl or browser? 
[11:33:54] <bblack>	 cp1099 sees a bit higher overall cpu utilization and iowait, but it's also keeping more total objects in memory than 1074, with fewer total nukes and fewer mails to the expiry thread
[11:34:09] <bblack>	 elukey: either way, I don't know that I have a login
[11:35:01] <elukey>	 ah sorry it is like the other tools that we have, it checks ldap
[11:35:08] <bblack>	 ok
[11:36:27] <bblack>	 now that I logged in, that URL returns some normal-looking output?
[11:39:21] <elukey>	 ok sorry the spark job finished, https://yarn.wikimedia.org/proxy/application_1472219073448_117303/ should be a repro
[11:39:51] <bblack>	 I got a "200 OK" with a content-length of zero
[11:40:04] <bblack>	 ah the first time. second time I got a 503
[11:41:46] <bblack>	 heh that one's done now too
[11:43:12] <elukey>	 and no more spark jobs available :D
[11:43:34] <bblack>	 is "200 ok" + "CL: 0" the expected (apache) output?
[11:43:59] <elukey>	 nope, I saw CL > 0 in the logs plus it should return a lot of html
[11:44:12] <elukey>	 so I'll start digging a bit more if apache is doing weird things
[11:44:14] <elukey>	 thanks :)
[11:44:26] <bblack>	 so, possibly same bug as the CL:0 responses we see on cache_upload, too
[11:44:42] <elukey>	 hope not :D
[11:45:07] <elukey>	 just for curiosity (and for next time) - how did you find the CL: 0 ?
[11:57:52] <bblack>	 sorry I lost my DSL for a bit there :/
[11:58:04] <bblack>	 from my browser
[11:58:56] <bblack>	 I use this chrom(e|ium) plugin all the time: https://chrome.google.com/webstore/detail/http-headers/mhbpoeinkhpajikalhfpjjafpfgjnmgk?utm_source=chrome-app-launcher-info-dialog
[11:59:17] <bblack>	 gives a nice little button to pop up a quick view of req and resp header details on whatever you're looking at in-browser
[11:59:35] <bblack>	 (you can get the same thing from developer tools too, but this is a little simpler and more-concise without dragging that out)
[12:13:21] <elukey>	 ah ok I thought you took a look to varnish logs, easier than I thought :) thanks!
[12:20:47] <bblack>	 well I was going to, but the buggy URL kept fixing itself too fast :)
[12:21:24] <bblack>	 usually use the above HTTP Headers extension to take a quick peek at which varnish servers are in the path, then ssh to the relevant frontend or backend and start up varnishlog with the right args to catch a repro
[12:21:51] <bblack>	 "take a quick peek at which varnish servers are in the path" == "look at X-Cache response header in browser"
[13:08:35] <ema>	 bblack: the answer to your question "even possible on upload?" in https://gerrit.wikimedia.org/r/#/c/311600/6/modules/varnish/templates/upload-backend.inc.vcl.erb seems to be yes
[13:09:03] <ema>	 at least looking at SMF.bin1 in varnishstat
[13:13:49] <bblack>	 ema: how can you tell from SMF.bin1?
[13:15:39] <bblack>	 (also, while I assume there might be some corner cases, I think we confirmed before that normal swift objects always have CL?)
[13:17:22] <ema>	 oh sorry I've missed that 256K objects also end up in bin1
[13:17:48] <bblack>	 yeah I was just picking one of the existing bins to re-use in case of no-CL
[13:18:01] <bblack>	 I would really like to know if significant traffic is falling through that loophole though
[13:18:13] <bblack>	 I guess I could've added a std.log() or something
[13:19:01] <bblack>	 in any case, I think the stats still look encouraging, but only time will tell
[13:19:30] <bblack>	 it's a little busier on cpu/iowait, but I guess that's the cost of caching more objects and tracking more separately-busy LRUs, etc
[13:19:53] <ema>	 right
[13:19:57] <bblack>	 that it seems to cache more objects while also nuking/mailing less seems on the right track though
[13:24:07] <ema>	 the http-headers extension is cool
[13:24:35] <ema>	 thanks to it, TIL reddit uses varnish
[13:25:26] <ema>	 or Fastly, actually
[13:25:52] <bblack>	 :)
[13:27:06] <bblack>	 so I was looking at the two-hit-wonder patch and my included commentary about it being scary for cold frontends
[13:27:23] <bblack>	 but really the scary bit isn't for a cold frontend, it's for cold backends, since the info comes from there anyways
[13:27:41] <bblack>	 which isn't nearly as scary, and should resolve itself pretty quick for hot URLs given N frontends hitting a given backend anyways
[13:28:34] <bblack>	 so in other words: on with higher N-hit-wonder experimentation :)
[13:28:49] <ema>	 :)
[13:29:09] <ema>	 re: uptime VCL patch, perhaps it would be cleaner to write a vmod for that?
[13:29:23] <ema>	 to expose whatever varnishstat has access to in VCL basically
[13:29:38] <bblack>	 do vmods even get access to varnishstat?
[13:29:54] <bblack>	 the varnish patch is pretty trivial and seems upstreamable
[13:29:59] <ema>	 yeah I found a vmod that does, let me look for it
[13:30:12] <bblack>	 (and doesn't have to go through varnishstat, can look directly)
[13:30:12] <ema>	 https://github.com/varnish/libvmod-rtstatus
[13:30:40] <ema>	 well it doesn't access varnishstat but it exposes similar metrics
[13:30:43] <bblack>	 (also, apparently so far vmod abstractions don't buy us as much as we thought over varnish patches heh)
[13:30:54] <ema>	 that's sadly a very good point
[13:31:25] <bblack>	 but given that it's the backend coldness that matters and it doesn't matter as much, meh on the whole patch really
[13:31:43] <ema>	 ok
[13:31:59] <bblack>	 I guess we could expose that, e.g. in backend VCL do something like "if (uptime < 1h) { set resp.http.BE-IS-COLD }" and act on that info in the FE N-hit-wonder code
[13:32:20] <bblack>	 but it seems like pain for little gain.  it's not going to be a big inrush/load problem really.
[14:27:31] <bblack>	 ema: so another thing, m.ark and I were chatting in here yesterday a bit about varnish5 again and looking at its commits a bit
[14:28:02] <bblack>	 ema: it looks like some of the improvements there are highly-relevant to the storage issues, and possibly there's some other improvements re HTTP-compat, 304/ims/imn issues, etc...
[14:28:20] <bblack>	 ema: and it's likely we don't face many real issues from v4->v5
[14:28:59] <bblack>	 I'm not saying we should just throw it into prod now or disrupt our ongoing stuff much
[14:29:02] <ema>	 bblack: time to start packaging then?
[14:29:41] <bblack>	 but we should probably start thinking about getting there sooner rather than later.  yeah at least packaging it if we need to? I don't think we need any local patches.  the last few left were all for spersist.
[14:30:15] <bblack>	 then we can do some depooled trials on maps and/or misc and/or upload and see how compatible and easy the upgrade are, and try to plan when/how we do it
[14:30:30] <bblack>	 if it's trivial, it's possible text goes straight to v5 without stopping at v4 even.
[14:30:41] <ema>	 right yeah there's no patch left to forward-port
[14:31:02] <ema>	 so we just need to rebuild for jessie really
[14:31:47] <bblack>	 all in all, given the state of v4.1, I don't think 5.0 is all that scary.  it may actually have fewer bugs and issues.
[14:31:47] <ema>	 and the vmods might need some work
[14:32:00] <bblack>	 the API changes don't look huge for vmods
[14:32:20] <bblack>	 the biggest PITA about it will be managing v3/v4/v5 conflicting packages in our repo/hosts somehow
[14:34:25] <ema>	 well it's gonna be a bit ugly with stuff like if (hiera('varnish_version5')) and then the right APT pin, but it should be trivial
[14:35:40] <ema>	 mmh the repos perhaps could be problematic, we'll need experimental and very-experimental :P
[14:36:12] <bblack>	 I guess yeah if we need it for the package part
[14:36:24] <bblack>	 I suspect/think we'll not need many (hopefully any) VCL compat changes, etc
[14:36:41] <bblack>	 and can define both varnish_version4 and varnish_version5 for v5 hosts to make it easier
[15:09:11] <_joe_>	 ema,bblack: https://gerrit.wikimedia.org/r/#/c/311678/ does this sound good to you?
[15:09:26] <_joe_>	 basically confctl will exit with status 1 if any of its operations fail
[15:10:53] <ema>	 _joe_: +1
[15:15:07] <bblack>	 lgtm
[15:19:13] <wikibugs>	 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2652358 (10RobH) a:05MoritzMuehlenhoff>03RobH I can try to get our money back, but it is doubtful.  I'll pull this to me for now.
[15:38:02] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/311671/, the best name I could come up with was run-no-puppet
[15:38:08] <ema>	 imagine the worst
[15:40:17] <bblack>	  /usr/local/sbin/run-with-puppet-disabled,-unless-puppet-is-running,-in-which-case-wait-a-while.-also,-do-not-undo-other-disables
[15:41:21] <ema>	 pudo would have been my favorite name but I can't find a word starting with 'u' that makes sense in this context
[15:48:58] <bblack>	 ema: maybe the disable/enable message should have some unique id in it? otherwise it will race itself only matching on $0 for some unrelated or overlapping things or whatever
[15:49:16] <bblack>	 maybe the pid of the run-no-puppet script?
[15:49:37] <bblack>	 $1 I meant above, but still
[15:50:19] <bblack>	 also +1 to volans about propagating exit status
[15:50:40] <volans>	 thx
[15:51:28] <bblack>	 if you really want to get crazy, should trap the common signals like HUP/INT/TERM and run the re-enable cleanup on those too :)
[15:51:50] <bblack>	 but that might put you over the shellscript lines limit where someone says this should really be a python script heh
[15:52:02] <volans>	 eheheh
[15:54:51] <volans>	 also +1 for adding the PID (or any other ID) in the message
[15:57:54] <ema>	 alright I'll pretend the signal thing has never been said and go ahead with the other suggestions :)
[16:00:52] <bblack>	 :)
[16:02:14] <bblack>	 as an optimization to avoid many insane conflicts, it would be kinda cool to add some logic to our standard puppet-run script (which is what cron invokes for the agent)...
[16:02:36] <bblack>	 to skip the current run if a root shell is open on a terminal that's been idle for <X minutes
[16:03:20] <bblack>	 I get caught by that more often that I would've statistically-expected.  thinking I can do something quick and what are the odds puppet agent starts up right in the middle of this....
[16:03:55] <volans>	 bblack: on some hosts with this puppet would never run ;)
[16:04:21] <bblack>	 yeah :)
[16:04:33] <volans>	 but it happen to me too
[16:04:41] <bblack>	 maybe I could make some kind of bashrc hack to have a nopuppet shell mode?
[16:05:26] <volans>	 simpler... run-no-puppet bash :-P
[16:05:30] <bblack>	 which does all the same protections and dis/re-enable wrapping as run-no-puppet and executes an identical subshell in the middle with a big flag in the PS1 prompt
[16:06:10] <bblack>	 yeah basically that, but could be integrated snazzier.  and for interactive use, you probably want feedback that it's waiting on an already-running puppet (maybe even tail the output till it finishes and gives you a shell)
[16:06:26] <bblack>	 and then it re-enables automatically when you drop back out of course
[16:07:27] <volans>	 name: npsh
[18:43:53] <bblack>	 it's really starting to look like the storage split thing has other wins, too
[18:44:10] <bblack>	 it's storing more total objects at any one moment in time, with fewer allocator requests and such
[18:44:35] <bblack>	 the implication is that fragmentation in the unsplit storage was so bad that it had a big effect on the total set we could store
[18:45:05] <bblack>	 like, ~25% of space that was wasted on fragmentation and related issues is being reclaimed now
[18:45:18] <bblack>	 (25% of the total space was probably being wasted, I mean)
[18:54:34] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653227 (10Nuria) >@ellery, I talked about this with @mpopov Friday, and he told me that Discovery uses unique tokens as standard practice in their experiments. (They set an >ex...
[18:58:30] <wikibugs>	 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653269 (10Nuria) @Neil_P._Quinn_WMF , @ellery  Please have in mind that in any of discovery's test there is no knowledge as to whether the user is part of another test (ex: hov...
[20:19:20] <ema>	 bblack: awesome
[20:32:37] <Platonides>	 great
[21:51:18] <bblack>	 I'm gonna skip the daily cron restart on cp1074, so I can keep using it as a comparison point against cp1099 for an extra day or two
[21:51:54] <bblack>	 they're only like 30 minutes apart in uptime and in the same DC
[21:54:26] <bblack>	 also since it seems safe to keep a split-storage node up for at least a day, I might go ahead and start working through transitioning some more eqiad nodes to it slowly (slow enough that we if we do have a failure at, say, 3 days out or whatever, the impact will be staggered and we can deal with it easy)