[00:23:34] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: register .wiki gTLD domains - https://phabricator.wikimedia.org/T88873#2651081 (10BBlack) I see some refs to this ticket flying around, and recheck on some old DNS commits to add language/project domains under .wiki to our DNS, which were abandoned long ago.... [00:30:40] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651084 (10Neil_P._Quinn_WMF) >>! In T135762#2506770, @ellery wrote: > In order to do statistical testing, you would need to compare the fraction of users who clicked the button... [07:04:33] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651429 (10ellery) @Neil_P._Quinn_WMF I'm saying that for any online AB test you to be able to group the experimental data by user. The proposed framework does not provide a mec... [07:16:51] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651433 (10ellery) @Nuria I certainly don't disagree that segmentation must be done at the user level. I'm saying that the test statistics (or metrics as you are calling them)... [07:54:34] _joe_: back to the etcd issue, I remember seeing something similar on Sunday while firefighting, let me look at IRC logs [07:54:48] 12:24 < ema> depooling and restarting varnish on cp4005 [07:54:48] 12:24 < ema> ERROR:conftool:Error when trying to set/pooled=no on cp4005.ulsfo.wmnet [07:54:51] 12:24 < ema> ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out [07:54:54] 12:24 < ema> worked fine on 2nd try [08:44:04] <_joe_> on sunday? [08:44:14] <_joe_> jeez, on sunday I have no bad logs from the servers [08:48:24] _joe_: yes that was on Sunday, 12:24 GMT+2 [08:50:16] however IIRC that was one of the machines with insane load averages. Could the timeout be due to that, or does that error imply an issue with the etcd server? [08:50:42] <_joe_> no [08:50:55] <_joe_> it means the etcd server had an ongoing election [08:51:17] <_joe_> so, I really have to remove auth there [10:39:54] <_joe_> ema, bblack https://gerrit.wikimedia.org/r/311678 [10:40:18] <_joe_> still needs some work though [11:07:53] bblack,ema: varnishkafka code review updated - https://gerrit.wikimedia.org/r/#/c/311415/6 [11:08:35] bblack: this patch should resolve what we have discussed yesterday in a very simple way [11:09:51] the other thing that I wanted to ask is if somebody could help me debug why https://yarn.wikimedia.org/proxy/application_1472219073448_117202/ returns 503 [11:10:00] I checked the apache backend and it returns 200 [11:10:17] I suspect that there is something weird between varnish and apache [11:29:28] it returns 401 Unauthorized for me :) [11:32:55] only half a day in on the cp1099 storage stuff, but looks pretty decent on varnishstat so far. cp1074 is a good comparison host, as their last varnishd start times are similar and same-DC [11:33:47] bblack: o/ via curl or browser? [11:33:54] cp1099 sees a bit higher overall cpu utilization and iowait, but it's also keeping more total objects in memory than 1074, with fewer total nukes and fewer mails to the expiry thread [11:34:09] elukey: either way, I don't know that I have a login [11:35:01] ah sorry it is like the other tools that we have, it checks ldap [11:35:08] ok [11:36:27] now that I logged in, that URL returns some normal-looking output? [11:39:21] ok sorry the spark job finished, https://yarn.wikimedia.org/proxy/application_1472219073448_117303/ should be a repro [11:39:51] I got a "200 OK" with a content-length of zero [11:40:04] ah the first time. second time I got a 503 [11:41:46] heh that one's done now too [11:43:12] and no more spark jobs available :D [11:43:34] is "200 ok" + "CL: 0" the expected (apache) output? [11:43:59] nope, I saw CL > 0 in the logs plus it should return a lot of html [11:44:12] so I'll start digging a bit more if apache is doing weird things [11:44:14] thanks :) [11:44:26] so, possibly same bug as the CL:0 responses we see on cache_upload, too [11:44:42] hope not :D [11:45:07] just for curiosity (and for next time) - how did you find the CL: 0 ? [11:57:52] sorry I lost my DSL for a bit there :/ [11:58:04] from my browser [11:58:56] I use this chrom(e|ium) plugin all the time: https://chrome.google.com/webstore/detail/http-headers/mhbpoeinkhpajikalhfpjjafpfgjnmgk?utm_source=chrome-app-launcher-info-dialog [11:59:17] gives a nice little button to pop up a quick view of req and resp header details on whatever you're looking at in-browser [11:59:35] (you can get the same thing from developer tools too, but this is a little simpler and more-concise without dragging that out) [12:13:21] ah ok I thought you took a look to varnish logs, easier than I thought :) thanks! [12:20:47] well I was going to, but the buggy URL kept fixing itself too fast :) [12:21:24] usually use the above HTTP Headers extension to take a quick peek at which varnish servers are in the path, then ssh to the relevant frontend or backend and start up varnishlog with the right args to catch a repro [12:21:51] "take a quick peek at which varnish servers are in the path" == "look at X-Cache response header in browser" [13:08:35] bblack: the answer to your question "even possible on upload?" in https://gerrit.wikimedia.org/r/#/c/311600/6/modules/varnish/templates/upload-backend.inc.vcl.erb seems to be yes [13:09:03] at least looking at SMF.bin1 in varnishstat [13:13:49] ema: how can you tell from SMF.bin1? [13:15:39] (also, while I assume there might be some corner cases, I think we confirmed before that normal swift objects always have CL?) [13:17:22] oh sorry I've missed that 256K objects also end up in bin1 [13:17:48] yeah I was just picking one of the existing bins to re-use in case of no-CL [13:18:01] I would really like to know if significant traffic is falling through that loophole though [13:18:13] I guess I could've added a std.log() or something [13:19:01] in any case, I think the stats still look encouraging, but only time will tell [13:19:30] it's a little busier on cpu/iowait, but I guess that's the cost of caching more objects and tracking more separately-busy LRUs, etc [13:19:53] right [13:19:57] that it seems to cache more objects while also nuking/mailing less seems on the right track though [13:24:07] the http-headers extension is cool [13:24:35] thanks to it, TIL reddit uses varnish [13:25:26] or Fastly, actually [13:25:52] :) [13:27:06] so I was looking at the two-hit-wonder patch and my included commentary about it being scary for cold frontends [13:27:23] but really the scary bit isn't for a cold frontend, it's for cold backends, since the info comes from there anyways [13:27:41] which isn't nearly as scary, and should resolve itself pretty quick for hot URLs given N frontends hitting a given backend anyways [13:28:34] so in other words: on with higher N-hit-wonder experimentation :) [13:28:49] :) [13:29:09] re: uptime VCL patch, perhaps it would be cleaner to write a vmod for that? [13:29:23] to expose whatever varnishstat has access to in VCL basically [13:29:38] do vmods even get access to varnishstat? [13:29:54] the varnish patch is pretty trivial and seems upstreamable [13:29:59] yeah I found a vmod that does, let me look for it [13:30:12] (and doesn't have to go through varnishstat, can look directly) [13:30:12] https://github.com/varnish/libvmod-rtstatus [13:30:40] well it doesn't access varnishstat but it exposes similar metrics [13:30:43] (also, apparently so far vmod abstractions don't buy us as much as we thought over varnish patches heh) [13:30:54] that's sadly a very good point [13:31:25] but given that it's the backend coldness that matters and it doesn't matter as much, meh on the whole patch really [13:31:43] ok [13:31:59] I guess we could expose that, e.g. in backend VCL do something like "if (uptime < 1h) { set resp.http.BE-IS-COLD }" and act on that info in the FE N-hit-wonder code [13:32:20] but it seems like pain for little gain. it's not going to be a big inrush/load problem really. [14:27:31] ema: so another thing, m.ark and I were chatting in here yesterday a bit about varnish5 again and looking at its commits a bit [14:28:02] ema: it looks like some of the improvements there are highly-relevant to the storage issues, and possibly there's some other improvements re HTTP-compat, 304/ims/imn issues, etc... [14:28:20] ema: and it's likely we don't face many real issues from v4->v5 [14:28:59] I'm not saying we should just throw it into prod now or disrupt our ongoing stuff much [14:29:02] bblack: time to start packaging then? [14:29:41] but we should probably start thinking about getting there sooner rather than later. yeah at least packaging it if we need to? I don't think we need any local patches. the last few left were all for spersist. [14:30:15] then we can do some depooled trials on maps and/or misc and/or upload and see how compatible and easy the upgrade are, and try to plan when/how we do it [14:30:30] if it's trivial, it's possible text goes straight to v5 without stopping at v4 even. [14:30:41] right yeah there's no patch left to forward-port [14:31:02] so we just need to rebuild for jessie really [14:31:47] all in all, given the state of v4.1, I don't think 5.0 is all that scary. it may actually have fewer bugs and issues. [14:31:47] and the vmods might need some work [14:32:00] the API changes don't look huge for vmods [14:32:20] the biggest PITA about it will be managing v3/v4/v5 conflicting packages in our repo/hosts somehow [14:34:25] well it's gonna be a bit ugly with stuff like if (hiera('varnish_version5')) and then the right APT pin, but it should be trivial [14:35:40] mmh the repos perhaps could be problematic, we'll need experimental and very-experimental :P [14:36:12] I guess yeah if we need it for the package part [14:36:24] I suspect/think we'll not need many (hopefully any) VCL compat changes, etc [14:36:41] and can define both varnish_version4 and varnish_version5 for v5 hosts to make it easier [15:09:11] <_joe_> ema,bblack: https://gerrit.wikimedia.org/r/#/c/311678/ does this sound good to you? [15:09:26] <_joe_> basically confctl will exit with status 1 if any of its operations fail [15:10:53] _joe_: +1 [15:15:07] lgtm [15:19:13] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2652358 (10RobH) a:05MoritzMuehlenhoff>03RobH I can try to get our money back, but it is doubtful. I'll pull this to me for now. [15:38:02] bblack: https://gerrit.wikimedia.org/r/#/c/311671/, the best name I could come up with was run-no-puppet [15:38:08] imagine the worst [15:40:17] /usr/local/sbin/run-with-puppet-disabled,-unless-puppet-is-running,-in-which-case-wait-a-while.-also,-do-not-undo-other-disables [15:41:21] pudo would have been my favorite name but I can't find a word starting with 'u' that makes sense in this context [15:48:58] ema: maybe the disable/enable message should have some unique id in it? otherwise it will race itself only matching on $0 for some unrelated or overlapping things or whatever [15:49:16] maybe the pid of the run-no-puppet script? [15:49:37] $1 I meant above, but still [15:50:19] also +1 to volans about propagating exit status [15:50:40] thx [15:51:28] if you really want to get crazy, should trap the common signals like HUP/INT/TERM and run the re-enable cleanup on those too :) [15:51:50] but that might put you over the shellscript lines limit where someone says this should really be a python script heh [15:52:02] eheheh [15:54:51] also +1 for adding the PID (or any other ID) in the message [15:57:54] alright I'll pretend the signal thing has never been said and go ahead with the other suggestions :) [16:00:52] :) [16:02:14] as an optimization to avoid many insane conflicts, it would be kinda cool to add some logic to our standard puppet-run script (which is what cron invokes for the agent)... [16:02:36] to skip the current run if a root shell is open on a terminal that's been idle for I get caught by that more often that I would've statistically-expected. thinking I can do something quick and what are the odds puppet agent starts up right in the middle of this.... [16:03:55] bblack: on some hosts with this puppet would never run ;) [16:04:21] yeah :) [16:04:33] but it happen to me too [16:04:41] maybe I could make some kind of bashrc hack to have a nopuppet shell mode? [16:05:26] simpler... run-no-puppet bash :-P [16:05:30] which does all the same protections and dis/re-enable wrapping as run-no-puppet and executes an identical subshell in the middle with a big flag in the PS1 prompt [16:06:10] yeah basically that, but could be integrated snazzier. and for interactive use, you probably want feedback that it's waiting on an already-running puppet (maybe even tail the output till it finishes and gives you a shell) [16:06:26] and then it re-enables automatically when you drop back out of course [16:07:27] name: npsh [18:43:53] it's really starting to look like the storage split thing has other wins, too [18:44:10] it's storing more total objects at any one moment in time, with fewer allocator requests and such [18:44:35] the implication is that fragmentation in the unsplit storage was so bad that it had a big effect on the total set we could store [18:45:05] like, ~25% of space that was wasted on fragmentation and related issues is being reclaimed now [18:45:18] (25% of the total space was probably being wasted, I mean) [18:54:34] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653227 (10Nuria) >@ellery, I talked about this with @mpopov Friday, and he told me that Discovery uses unique tokens as standard practice in their experiments. (They set an >ex... [18:58:30] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653269 (10Nuria) @Neil_P._Quinn_WMF , @ellery Please have in mind that in any of discovery's test there is no knowledge as to whether the user is part of another test (ex: hov... [20:19:20] bblack: awesome [20:32:37] great [21:51:18] I'm gonna skip the daily cron restart on cp1074, so I can keep using it as a comparison point against cp1099 for an extra day or two [21:51:54] they're only like 30 minutes apart in uptime and in the same DC [21:54:26] also since it seems safe to keep a split-storage node up for at least a day, I might go ahead and start working through transitioning some more eqiad nodes to it slowly (slow enough that we if we do have a failure at, say, 3 days out or whatever, the impact will be staggered and we can deal with it easy)