[00:51:54] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163941 (10Dzahn) done. removed Apache and config from argon. removed puppet role, class, template... [00:52:15] 7HTTPS, 10Traffic, 6Operations: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163942 (10Dzahn) [00:53:44] 7HTTPS, 10Traffic, 6Operations: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2152741 (10Dzahn) 5Open>3Resolved [01:08:59] 7HTTPS, 10Traffic, 6Operations, 10Wiki-Loves-Monuments-General: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2163951 (10Dzahn) p:5Triage>3High [02:25:05] elukey / ema: I've manually tested the "Value already assigned" hack: it 'works' for variables that are re-set multiple times, but if the final disposition is an un-set, the kafka output still ends up showing the last value that was set before the final unset [02:37:26] heh, what's fascinating is that stock varnishncsa-4 does the same thing [02:38:06] if you log a resp header as e.g. %{X-DELIVER-TESTME}o , and have vcl_deliver do set resp.http.X-DELIVER-TESTME = "123"; unset resp.http.X-DELIVER-TESTME; then varnishncsa still logs it as 123 [02:38:40] varnishncsa is doing what vk does: it hooks on RespHeader, but does not hook into RespUnset [02:38:49] (ditto for ReqHeader + ReqUnset) [02:39:45] also fascinatingly related: the whole reason we log so many varnish-generated resp headers is because in the v3 world, logging request-side headers would only capture what the client set, not any vcl_recv modifications to req.http.* [02:40:08] under varnish4, both vk (with the value-already-assigned hack) and varnishncsa can see modified req.http.* values [02:40:37] so in the long run, we could maybe move a lot of things like X-Analytics onto the request/recv side of things and not have it in deliver, and not be forced to send them to the client either [02:40:52] but that'll have to come after all the clusters are on v4 [02:41:40] (and in some cases, we might want to put them in the output for other reasons anyways. but still, it's nice to know we *can* log modified request-side headers) [02:44:52] it's probably possible to make some not-horrible changes to vk.c to support ReqUnset/RespUnset stuff, but if varnishncsa doesn't do it right either, I question the utility of it. [02:45:44] we'd have to hack in to treat the RespUnset like ReqHeader and hook both for header names, and if it's an unset some flag says to clear the existing lp->match[x] for it. [02:57:57] I was going to file an upstream bug with varnish about varnishncsa not honoring unset resp.http.*, but their stupid trac system is borked, at least for my account [02:58:21] when I log in, it just throws some stupid trac preference-panel related error on every page fetch and doesn't let me do anything :P [03:13:26] elukey / ema: https://gerrit.wikimedia.org/r/280612 has what seems to be a correct/working fix for vk.c, to mirror the current varnishncsa-4.1 behavior with regard to all of these things, although tbh the current behavior of vncsa is questionable too (as noted in commitmsg). [03:18:36] also note I'm technically off thurs/fri, but will still be semi-around and check in on things [03:19:44] IMHO at this point, assuming you don't run into new problems, I'd just package up https://gerrit.wikimedia.org/r/280612 and it should work for e.g. X-Analytics and such for now, and fix the ganglia+varnishstat thing, and deploy cp1044 and mark a bunch of things Done in phab :) [04:16:59] re: ganglia+varnishstat, the big changes seem to be: [04:17:47] 1) the flags "i" and "a" are now basically replaced with "g" and "c" respectively, which is the basic counter/gauge split in both cases [04:19:38] well, mostly that may be enough to fix this, I'm not sure [04:20:07] for metric, properties in stats_cache[instance].iteritems(): [04:20:07] if metric == "timestamp" or metric.startswith(("VBE.", "LCK.")): [04:20:10] continue [04:20:12] slope = (properties['flag'] == 'i' and 'both' or 'positive') [04:20:40] mostly the above in /usr/lib/ganglia/python_modules/varnish.py might need updating, for new metric.startwith stuff to ignore, and definitely changing flag "i" to "g" [05:48:54] bblack: thanks a lot! I believe the vk compromise is super good for the moment, we'll fix the unset inconsistency later on (we can't be better than upstream :P). I am still convinced that your patches are valuable and we should think about packaging them and testing them in maps.. [05:52:11] also great news, the memleak seems completely gone according to cp1043 top's output [05:54:31] (brb) [08:27:41] 10Traffic, 6Analytics-Kanban, 6Operations, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2164270 (10elukey) Summary of last updates: 1) varnish-kafka code changes in https://gerrit.wikimedia.org/r/#/c/276439/ and https://gerrit.wikim... [09:10:02] ema: bblack: some MediaWiki end user complained its Varnish hit rate went down after upgrading to MW 1.27. So I took the liberty to point him at this channel ;-D [09:10:10] he might show up eventually or fill a task against #traffic [09:10:33] I think he was looking for appropriate metrics and logging tip to debug the issue [11:50:31] alright now puppet is finally enabled on cp1043 running varnish4 [11:51:01] * elukey dances [11:51:04] it uses varnishkafka 1.0.8-2 which would be https://gerrit.wikimedia.org/r/280612 [11:51:23] everything seem to be working fine to me, even ganglia stopped exploding [11:51:46] so in the afternoon we can proceed with cp1044 IMHO [12:19:11] oh, did you look at the varnish+ganglia "flag" issue? I think all the things that used to be 'i' are now 'g', so it needs that slope change mentioned above [12:19:36] (it would "work" without it, but I think it would turn some gauges into counters or vice-versa) [12:19:44] bblack: nope, I've only tried to make sure it does not kill our aggregator first [12:19:50] but we should do that too, you're right [12:20:04] ok [12:20:15] so in the end the DoS was due to parentheses and quotes in metric descriptions [12:21:23] nice [12:21:42] mmh [12:21:51] 12:21:16 ema@cp1043.eqiad.wmnet:~ [12:21:51] $ varnishstat -1 -j | jq '.'|grep flag|sort |uniq -c [12:21:51] 1 "flag": "b", [12:21:51] 216 "flag": "c", [12:21:51] 49 "flag": "g", [12:21:59] 12:21:26 ema@cp1044.eqiad.wmnet:~ [12:21:59] $ varnishstat -1 -j | jq '.'|grep flag|sort |uniq -c [12:21:59] 171 "flag": "a", [12:21:59] 1 "flag": "b", [12:21:59] 13 "flag": "c", [12:22:01] 12 "flag": "g", [12:22:04] 24 "flag": "i", [12:22:40] in the varnish docs, they say all a were deprecated in favor of c, and all i deprecated in favor of g [12:22:58] c/b remain, but b is bitmap [12:23:30] so I guess we need a conditional and check for == "i" on v3 and == "g" on v4 [12:23:53] I think we're just doing it wrong on v3 too, possibly [12:24:17] you mean it should be: `in ("i", "g")` on v3? [12:24:26] possibly, not sure though [12:25:03] vk works really well BTW! [12:25:07] from the varnish source (only real docs on this) from the commit that killed the old flags: [12:25:10] * f - Format: Semantics of the value in this field [12:25:13] - * 'a' - Accumulator (deprecated, use 'c') [12:25:16] * 'b' - Bitmap [12:25:18] * 'c' - Counter, never decreases. [12:25:21] * 'g' - Gauge, goes up and down [12:25:23] - * 'i' - Integer (deprecated, use 'g') [12:25:49] ema: \o/ [12:26:22] I'm guessing they already switched to 'c' for some things even back in v3, and if those happen to be any of the ones we actually log in ganglia, they're probably wrong today in v3 with out `in ("i", "g")` [12:31:59] bblack vk contains all you patches plus mines, we went all in :P [12:32:05] yeah so currently in v3 we're using a positive slope for stuff like this: [12:32:08] "SMP.main1.g_free_pending": { [12:32:10] "type": "SMP", [12:32:13] "ident": "main1", [12:32:15] "value": 0, [12:32:18] "flag": "g", [12:32:20] "description": "Free space pending" [12:32:23] }, [12:33:55] we should use 'both' I guess [12:34:42] yeah I think so [12:34:57] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Text%20caches%20eqiad&h=cp1065.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1459427618&event=hide&ts=0&v=377852256&m=varnish.SMP.main1.g_free_pending&vl=N%2Fs&ti=Free%20space%20pending&z=large [12:35:15] that's g_free_pending on cp1065's main1 [12:35:20] right [12:35:54] also if you're going to patch that... [12:36:39] when I had to look at varnish.py last night, there were a couple lines that really bugged me when reading it... [12:37:03] slope = (properties['flag'] == 'i' and 'both' or 'positive') [12:37:11] 'units': (slope == 'positive' and 'N/s' or 'N'), [12:37:37] if that's really idiomatic python, I guess leave it, but surely there's an idiomatic way to right that which doesn't sound wrong? [12:37:47] using an if I guess [12:37:48] s/right/write/ heh [12:38:07] in most languages you'd at least have two difference preferences or types of operator there or something [12:38:39] with 'and' 'or' being from the same set of operators/precedence, it's not immediately obvious wtf is happening there [12:39:23] it sounds more like perl5's crazy set operators, as in "check if slop equals 'positive' and 'N/s' at the same time, or just the value 'N'" [12:39:30] err perl6 I mean [12:40:08] my mind binds more tightly on >>'positive' and 'N/s' or 'N'<< as a grouped thing on the left of the == [12:40:20] s/left/right/ [12:40:25] ok clearly I need coffee [12:42:17] the internet seems to suggest something like: [12:42:53] slope = 'N/s' if properties['flag'] == 'i' else 'N' [12:43:00] which also seems kind of ugly [12:43:05] yeah no [12:43:10] I'll suggest being verbose [12:43:32] let me prepare a CR and let's see if it looks better or worse :) [12:43:44] not that it matters much, this is just early morning griping [12:44:11] they should just implement a proper ternary operator :P [12:44:33] slope = properties['flag'] == 'i' ? 'N/s' : 'N' [12:45:51] I checked the kafka maps topic and the Analytics field looks good [12:46:00] also double checked with my team [12:46:25] the only thing different is "cache_status":"HIT", for cp1043 rather than "cache_status":"hit", (cp1044) [12:47:14] I've mentioned before in side notes on some random ticket, we should kill the cache_status field completely for analytics logging (e.g. webrequest) [12:47:23] probably should make a real ticket about that [12:47:55] it's not helpful - it's only telling us what was a hit in the frontend, not what the frontend got a hit for on the backend 0ms away behind it [12:48:23] but sometimes analytics (or other consumers) sees there's a cache_status field, and assumes it's meaningful about our cache hitrates, which it isn't [12:48:24] all right, so we are good from vk's perspective then :) [12:49:27] we should replace it with something distilled from the full value of the X-Cache response header, too [12:49:41] but that's neither here nor there, could still kill the misinformational one first and work on that later [12:50:05] bblack: https://gerrit.wikimedia.org/r/280658 [12:51:09] hey it lifts a conditional out of a loop anyways, always a good optimization :) [12:51:26] right! +1 for speed! [12:54:21] bblack: if nothing breaks in the next minutes I'd propose to go ahead with cp1044 https://gerrit.wikimedia.org/r/#/c/280660/ [13:07:53] for the record the "new" upgrade procedure is: [13:07:56] service varnish-frontend stop ; service varnish stop ; apt-get -y remove libvarnishapi1 ; rm /srv/sd*/varnish* ; puppet agent -tv [13:08:23] after bblack's point yesterday about persistent storage [13:09:02] OK, depooling cp1044 [13:11:53] varnish 4 on cp1043 currently serving all maps requests [13:13:51] looking good! [13:15:31] upgrading cp1044 [13:18:43] running puppet [13:20:04] 0 puppet-errors [13:20:37] \o/ [13:21:04] awesome, varnishkafka also restarted fine and all [13:21:15] another random notable change: in v3, both varnishd processes ran as user 'varnish', now the master runs as 'varnish' and the slave that does the real work runs as the new user 'vcache [13:21:19] ' [13:22:07] vk config files looks good! [13:22:28] nice for privsep, but I wouldn't be surprised if somewhere down the line we eventually find something broken that assumed some file ownership or process uid and gets nicked by the change [13:22:37] bblack: heh [13:22:50] another notable change is that _.vsm files are not world readable anymore [13:22:57] :) [13:23:13] I chmoded that by hand on both cp1043 and cp1044 otherwise everything using varnishstat as non-root would fail [13:23:25] making ganglia explode and who knows what else [13:23:35] re: phabricator for all of this: I'd close up the really-done ones like https://phabricator.wikimedia.org/T122880 and the ones for vmod packages, VCL port, etc... [13:24:00] but really we need some new broader-scope tasks going forward to manage the remaining open blocked/blockers [13:24:24] maybe an overall task for "convert all clusters to varnish4", and subtasks to do misc/upload/text, and move various blockers/blocking around that [13:24:38] and then we can later pick out which tasks within all that apply to various upcoming goals [13:24:57] if you guy need help I'd be glad to help during the next months [13:25:07] (too many help but you got my point) [13:25:10] e.g.: [13:25:11] T114662: RFC: Per-language URLs for multilingual wiki pages [13:25:11] T122867: Evaluate the feasibility of cache invalidation for the action API [13:25:17] ^ those would block on text-cluster to v4 [13:26:11] just packaging->installing libvmod_xkey as it exists today is probably legitimately unblocked now (for v4 hosts), but we probably need follow-on blockers to actually dig into whether it needs code changes for our setup, and how we integrate it and lay out key schema, etc... [13:26:40] basic sanity checks on cp1044 green [13:26:42] repool? [13:26:48] yup [13:27:29] done [13:27:59] cluster upgraded to varnish 4! [13:28:19] congrats guys, really big achievement [13:30:00] all seems good with kafkacat too [13:30:42] yep! [13:31:41] keeping an eye on vk's mem footprint but it should be good [13:33:08] yes 0.3 on cp1043, 0.1 on cp1044 [13:33:37] on the former it's been running for 1h 43min though [13:33:49] "x_analytics": "https=1;nocookies=1", [13:33:50] yeehaw! [13:33:51] :) [13:34:26] :) [13:34:41] ottomata: does the rest good look too? [13:34:47] ema: I am checking RES and on cp1043 it is stable at around ~80000 [13:34:57] yeah well s/good look/look good/ [13:35:12] elukey: awesome [13:35:32] and ganglia hasn't exploded! [13:35:44] https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20eqiad&m=network_report&r=hour&s=by%20name&hc=4&mc=2 [13:35:50] yeah i think so, i compared cp1043 vs 1044 yesterday, and the x_analytics was the only thing I noticed [13:36:09] ottomata: I'd double-check again, a lot of vk.c diffs went in since then [13:36:56] (most of which were late-night patches I made, with the only "testing" being "it looks right and it compiles" :P) [13:37:12] man it worked on first try [13:37:15] are those two running the same thing? [13:37:24] cp1043 and 1044 [13:37:24] ? [13:37:29] ottomata: yep [13:37:31] ok [13:39:35] I tested every change with tons of requests multiple times, should be more than "it looks right and it compiles" "_ [13:39:38] :) [13:39:58] 10Traffic, 6Operations, 13Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#2164742 (10ema) 5Open>3Resolved [13:39:59] ok fair enough :P [13:40:00] 10Traffic, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164744 (10ema) [13:40:19] I'm just saying, without a proper testsuite and good coverage, C bugs can be very subtle [13:40:19] hmm i think i just noticed one difference, not sure it matters [13:40:36] cache_status is lowercase in misc [13:40:38] and uppercase on maps [13:40:42] misc: [13:40:42] "cache_status": "pass", [13:40:45] yeah that one's been noted already [13:40:49] maps: "cache_status": "PASS", [13:40:50] ok [13:40:54] do we know why? [13:41:12] it's a change to the field in varnish, and we're passing that through [13:41:16] but also from earlier here: [13:41:17] 12:46 < elukey> the only thing different is "cache_status":"HIT", for cp1043 rather than "cache_status":"hit", (cp1044) [13:41:20] 12:47 < bblack> I've mentioned before in side notes on some random ticket, we should kill the cache_status field completely for analytics logging (e.g. webrequest) [13:41:23] 12:47 < bblack> probably should make a real ticket about that [13:41:26] 12:47 < bblack> it's not helpful - it's only telling us what was a hit in the frontend, not what the frontend got a hit for on the backend 0ms away behind it [13:41:29] 12:48 < bblack> but sometimes analytics (or other consumers) sees there's a cache_status field, and assumes it's meaningful about our cache hitrates, which it isn't [13:41:32] 12:48 < elukey> all right, so we are good from vk's perspective then :) [13:41:35] 12:49 < bblack> we should replace it with something distilled from the full value of the X-Cache response header, too [13:41:38] 12:49 < bblack> but that's neither here nor there, could still kill the misinformational one first and work on that later [13:42:20] cool [13:42:21] ok [13:42:47] also, i gotta sync with joal about removing xff! gah! bblack you should keep poking us about that, it slips our miiindds [13:42:54] i'm pretty sure we can just remove it [13:43:06] it's on my list of 400 small-ish tasks that always slip my mind too :) [13:43:10] heh [13:44:43] interperting X-Cache meaningfully is probably non-trivial too, we should probably have more possible states than just simple hit/miss/pass [13:45:42] 10Traffic, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164749 (10ema) [13:45:50] we should probably differentiate a recv-based pass (pass based on attributes of the request itself, always applies) and pass from hit-for-pass (where the request generally looks like a miss, until we notice something about the *response* that makes us cache hit-for-pass for those reqs for 10 minutes at a time) [13:46:25] and differentiate what layer we got a cache hit at too, but I'm not sure the best way to represent that [13:46:56] but it would be nice to differentiate frontend, local-backend, and remote-backend hits at least, and maybe which DC the hit came from. remote-dc hits are better than misses, but still have a latency penalty [13:50:06] maybe something like: "miss" for a full miss through the stack, "pass" for a recv-based pass, "hfp" for a hit-for-pass, and "hit/L/D" for hits where L is layer (fe, be, remote-be), and D is the datacenter name (only present for remote-be). [13:50:24] oh and we need a status "int" for internally-generated on the frontend (e.g. varnish-level redirects and errors) [13:50:33] 10Traffic, 7Varnish, 6Operations: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2164751 (10ema) [13:50:42] 10Traffic, 7Varnish, 6Operations: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2164765 (10ema) p:5Triage>3Normal [13:50:51] anyways [13:51:07] vacation-mode: mostly on [13:51:14] 10Traffic, 6Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2164767 (10ema) [13:51:17] 10Traffic, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164768 (10ema) [13:51:19] (I'm off today and tomorrow, I guess starting about now heh) [13:51:19] 10Traffic, 6Operations, 13Patch-For-Review: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2164766 (10ema) 5Open>3Resolved [13:51:25] bblack: enjoy! :) [13:51:32] yep enjoy! [13:53:20] 10Traffic, 6Operations, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164772 (10ema) [13:53:23] 10Traffic, 10Continuous-Integration-Infrastructure, 6Operations, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2164773 (10ema) [13:56:10] hahah [13:56:43] bblack has documented his grand plan via IRC :) [14:04:40] ema: RES on cp1044 looks stable :) [14:04:52] elukey: fantastic [15:10:02] ema: https://gerrit.wikimedia.org/r/#/c/279308/ - upgraded the code review, basically the idea is to ship and create the vk rsyslog/logrotate files in the module [15:10:23] removing any trace of it (at least for the standard vk, not its instances) from role and files/varnish [15:12:46] elukey: looks good [16:59:21] 10netops, 6Operations, 10ops-eqiad: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2165365 (10Cmjohnson) [18:52:03] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165704 (10mmodell) [18:53:37] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165728 (10mmodell) So I'm guessing that iridium -> gallium:4730 is probably fixable wit... [19:02:31] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Dzahn) >>! In T131375#2165682, @mmodell wrote: > @chasemp: connecting to gall... [22:55:55] 10Traffic, 6Operations, 10Phabricator, 7Blocked-on-Operations: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2166588 (10greg) >>! In T112765#1822062, @chasemp wrote: > We need to make a plan to get connectivity through to the end host for this. Thi... [23:30:44] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2166726 (10Dzahn) on gallium: Notice: /Stage[main]/Contint::Firewall/Ferm::Service[gear...