[00:40:32] i didn't think of test.wikidata [00:40:34] i'll ask them [00:42:36] one other reason for wanting to do this is redundancy [00:43:09] it's ok for us to lose X-Wikimedia-Debug functionality if eqiad is down, but testwiki and testwikidatawiki should remain up [01:04:22] bblack: hoo said it's ok [01:55:47] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1964746 (10JanZerebecki) I added it so I can look at the things related to this ticket in one graph (queue siz... [02:03:32] ori: I'm confused. So we're keeping testwiki + testwikidata as wikis and hostnames, but just getting rid of automatic X-Wikimedia-Debug on them? [02:06:20] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1964774 (10BBlack) Yeah but the rate increase we're looking at is actually in the htmlCacheUpdate job insertio... [02:15:29] 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1964776 (10Mattflaschen) a:3Mattflaschen [02:15:57] 7Varnish, 10MediaWiki-Vagrant, 3Collaboration-Team-Current: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1963478 (10Mattflaschen) [02:53:42] 7Varnish, 10MediaWiki-Vagrant, 3Collaboration-Team-Current: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1964822 (10Mattflaschen) Questions I asked on IRC: ``` [01/25/16 21:38] gilles, in https://gerrit.wikimedia.org/r/#/c/265370/ is there a reason the varnish user... [02:54:00] 7Varnish, 10MediaWiki-Vagrant, 3Collaboration-Team-Current: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1964823 (10Mattflaschen) 5Open>3declined For now at least. See above. [02:55:11] 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1964825 (10Mattflaschen) [03:22:33] ema: FYI for the morning, I didn't touch the esams mobile->text thing yet, but I did switch the pybals to etcd. If you want to pick up with mobile->text before I get online feel free. [05:11:58] bblack: yes. testwiki is used when people want to try out something which requires editing pages, but which they don't want to appear in their contribution history (or recent changes) on their home wiki, or which runs the risk of causing breakage, like making a change to a popular template or lua module [05:13:58] so it can actually be disruptive to editors when MediaWiki developers use mw1017 to debug in production [05:16:04] and the problems people try to debug on mw1017 tend to be the ones which are hard to reproduce locally -- typically a report from a user about a problem encountered on a real wiki [05:16:48] so the use-cases for testwiki and for mw1017 are almost mutually exclusive, and they are frequently in conflict [05:17:57] examples from the logs: hello, anyone know what's up with testwiki? none of the JavaScript is loading and it says "Array" at the top-left of the page [08:37:41] <_joe_> ori: I think we could direct testwiki elsewhere without harm :) [09:02:03] _joe_: I understand esams' pybals have been switched to etcd [09:02:17] if so, can you take a look at https://gerrit.wikimedia.org/r/266475? [09:03:49] <_joe_> ema: seems ok-ish, but I'm involved in other activities atm [09:04:14] _joe_: alright, there is no rush :) [09:24:13] 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1965119 (10Gilles) I'm not even sure that it needs a user directory at all, I was just following a pattern of user creation in Vagrant I had seen elsewhere. [10:44:12] hi :) [10:44:25] bblack: hey! [10:44:46] I was just about to start the mobile->text switch in esams [10:44:48] I have some personal stuff to do in my late-morning, so getting an early start instead :) [10:45:12] cool [10:54:04] bumping weight of existing mobile nodes to 10 [10:58:52] cp3003.esams.wmnet: pooled changed no => yes [10:58:58] (weight=1) [11:01:23] I'm going to bump cp3003's weight to 5 in 4 minutes [11:05:16] cp3003.esams.wmnet: weight changed 1 => 5 [11:10:12] ema: before you go much further, have you looked at the weight differential issue? [11:10:36] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1965334 (10Addshore) As far as I can tell in Wikibase.... - WikiPgaeUpdater::scheduleRefereshLinks creates... [11:11:16] ema: so the esams text cluster has the abornmal property that most of the machines have "weight:3" and then four of them have "weight:10", intentionally, because the hardware differs [11:12:02] bblack: in the mobile cluster all weights were 1 [11:12:19] yeah but we need to preserve the relative weighting from text in the end [11:12:45] aha! [11:13:05] thanks for pointing that out [11:13:14] so cp3003 for example has weight 3 in text [11:13:41] right, 3-14 have weight:3 and 3[01],4[01] have weight:10 [11:14:01] should we set its weight to 3 in mobile as well? [11:14:15] and the existing mobile machines are more or less like 3-14 in terms of hardware, it just wasn't a mixed-hardware cluster before [11:14:36] yeah I'm not sure, let me think a moment :) [11:14:49] sure! [11:15:10] yeah ok most of the things that crossed my mind don't matter heh [11:16:00] so from where we are now, I'd do this to bring us towards where we want to end up (because if we start with a baseline weight of 10 and scale text nodes accordingly, we get dangerously high on sum(weight)) [11:17:13] start with re-weighting the currently-pooled nodes all to 3: drop cp3003 5 => 3, then step through dropping the 4x existing mobile from 10 => 3 to finish getting them evenly weighted [11:18:03] maybe spread those weight shifts out by a few minutes, and once we arrive at 3/3/3/3/3 for cp3015-18 (old mobile) + cp3003, then do the long pause there to let backend caches fill [11:18:33] then just use the text weights as you go (add the other cp3004 and on as weight:3, and cp30[34][01] as weight:10) [11:19:00] do you mean s/currently-pooled/currently-depooled/? [11:19:14] no [11:19:42] I mean, with the 5x pooled machines we have (old mobile + 1x text), fix up the weightings first before pooling more nodes [11:20:00] cp3003 5 => 3, short pause, cp3015 10 => 3, short pause, etc... [11:20:23] aha, understood [11:20:40] so that they're all weight:3 for the 5x currently pooled [11:20:45] <_joe_> bblack: we have no parsoidcache in codfw? [11:20:55] <_joe_> shit, this is a blocker for the migration I guess [11:21:09] _joe_: we have machines for it, but they've never been configured-up [11:21:09] <_joe_> err, switchover [11:21:17] because parsoid didn't exit there back then [11:21:23] s/exit/exist/ [11:21:26] <_joe_> bblack: heh, we need that now [11:22:04] yeah well [11:22:08] parsoid is an oddball case... [11:22:30] we also don't have caches you can hook up directly to mediawiki in codfw either in config terms right now [11:22:40] because they're tier-2 backending to eqiad [11:22:48] cp3003.esams.wmnet: weight changed 5 => 3 [11:23:01] parsoidcache is the only one of our cache clusters that's eqiad-only functionally [11:23:10] it has never had a tier2 [11:23:50] https://phabricator.wikimedia.org/T110472 [11:24:47] (^ phab task to decom parsoidcache and all related things, since they should all be running through restbase by now anyways, but apparently there are still lots of live dependencies blocking!) [11:25:32] cp3015.esams.wmnet: weight changed 10 => 3 [11:25:49] <_joe_> bblack: it should be also "codfw-only" maybe [11:25:57] <_joe_> and we use only the one in the master DC [11:26:08] <_joe_> just to not need to rethink it [11:26:24] <_joe_> we just install the same effing thing in codfw pointing to codfw parsoids [11:26:58] ema: so after 3003+3015->18 are all weight:3, then continue with pausing->adding nodes, but add 300[456789]+301[0234] with weight:3, and then last 30[34][01] as weight:10 (they might need some ramp-in, maybe 5 then 10) [11:27:27] but I guess the first pause is the biggest, which would be the pause after these 10=>3 weight adjustments at this point [11:27:53] _joe_: yeah and/or we can see about just decomming cache_parsoid :) [11:28:55] <_joe_> bblack: within this quarter? [11:29:06] <_joe_> I doubt it [11:29:09] :) [11:29:17] in my mind it should've been a quarter ago! [11:29:34] but anyways, probably first order of business is s/$::mw_primary/$::site/ in modules/role/manifests/cache/parsoid.pp [11:29:42] so that that role always uses its own local DC [11:30:19] <_joe_> yup [11:30:23] cp3016.esams.wmnet: weight changed 10 => 3 [11:30:48] <_joe_> how is confctl going btw? Too clunky/verbose to use? [11:31:06] nah, it's nice [11:31:38] _joe_: right now it looks like codfw cache_parsoid is actually configured, it's just using eqiad backends heh [11:31:57] <_joe_> bblack: oh... didn't know :P [11:32:22] the front edge IPs probably aren't right [11:32:27] <_joe_> ok so we just need to use the local backends maybe [11:32:55] yeah which my s/// earlier should do [11:33:08] and then make sure dns/lvs/cache config or whatever is good on what the front service IPs are [11:33:26] <_joe_> so the point is - either we point parsoidcache to the local cluster and make mediawiki connect to the master parsoidcache, or we do connect to the local parsoidcache from mediawiki, and let the cache decide where to connect [11:33:27] cp3017.esams.wmnet: weight changed 10 => 3 [11:33:58] _joe_: unfortunately cache_parsoid is decidedly not normal. we can't/shouldn't be using it for x-dc routing or tiering [11:34:15] so it really needs to be MW in DCX has to use parsoid.svc.DCX.wmnet [11:34:44] (consider it dc-local only) [11:35:22] <_joe_> bblack: ok I was already doing that :P [11:36:15] cp3018.esams.wmnet: weight changed 10 => 3 [11:36:43] all pooled mobile nodes in esams now have weight = 3 [11:37:37] ema: yeah so this is the "pause a bit and let caches refill" point, which they've already been doing since cp3003 went in I guess [11:38:06] but still, it pays to be cautious with esams, I'd give it at least 30 minutes from now before pooling another [11:38:22] "then continue with pausing->adding nodes, but add 300[456789]+301[0234] with weight:3, and then last 30[34][01] as weight:10 (they might need some ramp-in, maybe 5 then 10) [11:38:26] which is perfectly fine given that I'm hungry :) [11:38:26] " [11:38:31] ok :) [11:39:19] _joe_: I need a refresher on the mess of which IPs/hostnames map to what in parsoid-land, let me look around a bit [11:40:19] ok so LVS defines: [11:40:21] parsoid: &ip_block011 [11:40:21] eqiad: 10.2.2.28 [11:40:21] parsoidcache: &ip_block012 [11:40:21] eqiad: [11:40:23] parsoidcachelb: 208.80.154.248 [11:40:26] parsoidcachelb6: 2620:0:861:ed1a::3:14 [11:40:28] parsoidsvc: 10.2.2.29 [11:40:46] and those are the same as the LVS service names that consume those IP blocks [11:40:52] so ... [11:41:39] parsoid.svc.eqiad.wmnet is 10.2.2.8, which is LVS -> directly to parsoid machines (not caches) [11:41:59] parsoidcache.svc.eqiad.wmnet is 10.2.2.29, which is LVS -> cache_parsoid in eqiad [11:42:25] and then parsoid-lb.eqiad.wikimedia.org is those public IPv4+6, also to cache_parsoid in eqiad [11:43:44] and then restbase, cxserver, citoid, and graphoid all have various .eqiad.wikimedia.org and/or -lb.eqiad.wikimedia.org hostnames mapping to those same public v4/v6 for cache_parsoid [11:44:07] hmmm wrong, not -lb [11:44:30] it's restbase.wikimedia.org + restbase.eqiad.wikimedia.org => [public IPs of cache_parsoid], for all of restbase, cxserver, citoid, graphoid [11:45:25] oh and also the special-case rest.wikimedia.org [11:46:25] all of those services have their own internal LVS behind varnish too, for foo.svc.eqiad.wmnet [11:47:07] but supposedly they should all be flowing the bulk of their traffic through the restbase entrypoints on the text cluster, and not cache_parsoid [11:47:23] but there's still traffic on cache_parsoid, because nobody spends time on decomming the long tail of legacy things [12:05:12] 10Traffic, 10ContentTranslation-cxserver, 6Services, 6operations: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1965406 (10BBlack) Does that imply that **nothing** should be using the hostnames `cxserver.wikimedia.org` and/or `cxserver.eqiad.wikimedia.org`, which map... [12:05:42] 10Traffic, 10Graphoid, 6Services, 6operations: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1965407 (10BBlack) Are things still using the hostnames `graphoid.wikimedia.org` and/or `graphoid.eqiad.wikimedia.org`, which map to the cache_parsoid cluster rather than through res... [12:06:21] 10Traffic, 10Citoid, 6Services, 6operations: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1965408 (10BBlack) Are things still using the hostnames `citoid.wikimedia.org` and/or `citoid.eqiad.wikimedia.org`, which map to the cache_parsoid cluster rather than through restbase? [12:06:59] 10Traffic, 10RESTBase, 6Services, 6operations: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1965410 (10BBlack) Are things still using the hostnames `rest.wikimedia.org` and/or `restbase.wikimedia.org` and/or `restbase.eqiad.wikimedia.org`, which map to the cache_parsoid clu... [12:34:21] bblack: ready to resume [12:34:46] I'll start adding 300[456789]+301[0234] with weight:3 [12:35:44] ema: ok :) [12:37:40] cp3004.esams.wmnet: pooled changed no => yes [12:37:41] cp3004.esams.wmnet: weight changed 1 => 3 [12:42:56] cp3005.esams.wmnet: pooled changed no => yes [12:42:56] cp3005.esams.wmnet: weight changed 1 => 3 [12:47:34] cp3006.esams.wmnet: pooled changed no => yes [12:47:34] cp3006.esams.wmnet: weight changed 1 => 3 [12:48:20] 10Traffic, 10Graphoid, 6Services, 6operations: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1965474 (10Yurik) Not to my knowledge. I sometimes use it for debugging, eg when restbase has a bad day, but I can ssh directly [12:52:31] cp3007.esams.wmnet: pooled changed no => yes [12:52:31] cp3007.esams.wmnet: weight changed 1 => 3 [12:57:16] cp3008.esams.wmnet: pooled changed no => yes [12:57:16] cp3008.esams.wmnet: weight changed 1 => 3 [13:02:10] cp3009.esams.wmnet: pooled changed no => yes [13:02:10] cp3009.esams.wmnet: weight changed 1 => 3 [13:04:09] <_joe_> bblack: I have another "phylosophical" question for you... say I have to upload an image, and my request get served by codfw, but the active "master" is eqiad. I think it makes sense to have the appservers make the request locally, and let varnish forward it along its tiers [13:04:20] <_joe_> instead of pointing mediawiki directly to the source [13:04:37] <_joe_> 1st of all, because I guess that traffic's unencrypted [13:04:54] <_joe_> (mediawiki to upload.svc.eqiad.wmnet atm) [13:05:05] well... [13:05:15] <_joe_> heh! [13:05:31] (a) there is no upload.svc.eqiad.wmnet [13:05:37] <_joe_> no? [13:05:41] <_joe_> ahah [13:05:47] <_joe_> it's in mediawiki-config [13:06:03] <_joe_> wtf? [13:06:05] heh [13:06:27] <_joe_> WAT? [13:06:32] I was going to say (b) honestly I have no idea how to answer these questions, the plan for app servers in 2x primary DCs still seems like a clusterfuck in my head [13:07:08] cp3010.esams.wmnet: pooled changed no => yes [13:07:08] cp3010.esams.wmnet: weight changed 1 => 3 [13:07:24] I mean we have X completely independent app services, and then a bunch of them are configured to talk to other ones "internally" [13:07:40] I don't see how we ever get even close to a synchronous switch of all those... [13:08:14] and in cases where there *is* e.g. an foo.svc.eqiad.wmnet, no, the traffic layer will not magically re-route that hostname to some other DC [13:08:16] <_joe_> the idea is still to go read-only, and switch, in case of need [13:08:17] that's where it lives, period [13:08:29] <_joe_> yup, I know [13:08:42] <_joe_> anyways, this is friggin absurd... [13:08:54] <_joe_> there is an inexistent host in mediawiki-config [13:09:05] it's probably a dead/unused config key then [13:09:16] (in practice, anyways!) [13:09:25] <_joe_> I guess so... [13:09:31] <_joe_> I hope so :P [13:09:59] but my point is, I think within the applayer(s), probably everything should assume it has to use DC-local hostnames, as in *.svc.$::site.wmnet [13:10:05] not *.svc.$::active_dc.wmnet [13:10:26] <_joe_> bblack: that's my preferred option too [13:10:45] in cases where a service is smart enough to be active-active x-dc, it will take care of itself anyways in that case [13:11:15] otherwise we probably want "if varnish sends an outside-world request into some app service in codfw, logically all inter-applayer requests that happen as a result should stick to that dc" [13:12:11] cp3012.esams.wmnet: pooled changed no => yes [13:12:11] cp3012.esams.wmnet: weight changed 1 => 3 [13:12:38] the uglier case is that I'm quite sure we have stuff at the bottom of the applayer stack that then makes requests right back into the front, public, non-geography-specific hostnames [13:12:47] e.g. mediawiki making a request to "upload.wikimedia.org" [13:13:53] <_joe_> bblack: I'm trying to weed out those too if possible [13:14:33] <_joe_> bblack: is it possible upload.svc existed in the past? [13:14:52] _joe_: yes, it is possible [13:15:16] the upload cache service has been through some design revision over time, it used to work quite differently, including how requests route into and out of it [13:15:48] at one point I believe it was handling rendering requests itself too [13:16:13] <_joe_> it doesn't now? [13:16:16] (as in, requests comes into varnish, try swift. if swift doesn't find it, then try rendering.svc directly for the user fetch, and assume that gets stored into swift for future requests somehow) [13:16:29] <_joe_> it does that now, right? [13:16:37] no, now it just fetches from swift. it's up to swift to sort everything else out for itself [13:16:58] <_joe_> ok so now swift does the rendering request to the backend, right? [13:17:18] _joe_: somehow. I don't know if that's directly in realtime, or through some other convoluted mechanism [13:17:24] cp3013.esams.wmnet: pooled changed no => yes [13:17:24] cp3013.esams.wmnet: weight changed 1 => 3 [13:17:29] but the bottom line is, varnish only fetches from the swift service [13:17:54] <_joe_> ok [13:18:04] of course all of this is under some active planned upcoming changes with Thumbor, etc [13:18:11] <_joe_> so it might well be possible we did somthing like this for prerendering images [13:18:18] _joe_: right [13:18:34] <_joe_> bblack: what is the svc name for the cache_upload cluster in eqiad? [13:18:48] _joe_: what do you mean by "svc name"? [13:18:51] <_joe_> (I'd still go through varnish, if possible) [13:19:08] <_joe_> bblack: the A record for the ip associated with the service [13:19:24] <_joe_> it's a public ip, too, mh [13:19:30] there is no internal service at all, there's just the public hostname "upload.wikimedia.org", which is geographically-routed [13:19:34] <_joe_> this doesn't make sense. [13:19:56] DNS also contains e.g. upload-lb.eqiad.wikimedia.org, but that's more for internal/debug purposes [13:20:04] no app should be using that hostname [13:20:14] really, maybe we should kill those "debug" hostnames so that nobody ever does :P [13:20:19] <_joe_> I need filippo to sort this out probably, he did the initial work with gilles [13:21:09] for hostnames like "upload.wikimedia.org" (or for that matter, text-lb, etc...) [13:21:35] the real inbound DNS for the public name uses geographic routing to pick one of 4x datacenter IPs [13:21:57] we put revdns on those 4x IPs as e.g. XXX IN PTR upload-lb.esams.wikimedia.org just as an FYI [13:22:06] and then we define matching forward DNS again just as FYI/debugging [13:22:22] but we really don't want users or apps actually *using* upload-lb.esams.wikimedia.org [13:22:24] cp3014.esams.wmnet: pooled changed no => yes [13:22:24] cp3014.esams.wmnet: weight changed 1 => 3 [13:22:35] so we're done with the machines with weight 3 [13:22:41] ema: \o/ [13:22:56] I'll take a small break and then proceed with those with weight 10, bumping 5 => 10 [13:23:13] \o/ [13:23:17] I'm inclined to remove the forward DNS, but it seems kinda ugly to have the dangling PTR too [13:23:43] we could simply have them all do their revdns directly as "upload.wikimedia.org" I guess [13:24:25] or just change those names to something like upload-lb-omg-dont-ever-use-this.esams.wikimedia.org [13:27:14] _joe_: in any case (a) we do have explicit config-geo mappings to ensure our own eqiad servers always resolve to eqiad, etc..., but... [13:27:47] (b) our resolv.conf commonly falls back x-dc, which means if due to a random hiccup a DNS resolution in codfw fails to get a response from the codfw recdns machines, it will try the eqiad ones and get eqiad geoip routing :P [13:28:35] there's two ways to fix that really: [13:30:02] 1) Fix our recdns to always use the local DC's servers, at least for eqiad+codfw if not for the others yet. Probably involves fixing https://phabricator.wikimedia.org/T104442 first... [13:30:26] or 2) Find a way to make our recdns boxes implement edns-client-subnet [13:35:38] the argument in favor of the current -lb hostnames is it's convenient when debugging things to be able to e.g.: [13:36:13] curl https://en.wikipedia.org/wiki/Foo --resolve en.wikipedia.org:443:XXXXX (where XXXXX is the IP for text-lb.esams.wikimedia.org) [13:36:31] without the ability to do "host text-lb.esams.wikimedia.org" it's hard to remember or go find in the dns repo those IPs [13:37:34] (and many similar debugging/testing things) [13:50:36] cp3030.esams.wmnet: pooled changed no => yes [13:50:36] cp3030.esams.wmnet: weight changed 1 => 5 [13:53:25] bblack: I wrote a small varnish dstat plugin, to try it out become "ema" on cp3030 and try dstat -cdn --varnish 10 [13:54:39] ema: is that tracking both varnishd's combined? [13:55:02] nope that's just the default (backend) [13:55:21] ah ok [13:56:02] bblack: I've changed it to track the frontend [13:56:11] yeah I guess you could do both as separate columns too [13:56:24] right [13:56:41] too many hits on the frontend though, colors change all the time [13:57:51] cp3030.esams.wmnet: weight changed 5 => 10 [14:00:08] so I just had to deal with this the other day sorting out some stats for total hitrate through the layers [14:00:21] we really don't have an easy way to know that, it's always more-complicated than you'd think [14:01:00] if we're looking at the traffic/varnish layers as one whole black box and want to know the hitrate (for all reqs, or a cluster, or a hostname, or whatever) [14:01:21] we basically have to look at the webrequest logs and filter on the contents of the x_cache field (which is the X-Cache response header to the user) [14:01:29] cp3031.esams.wmnet: pooled changed no => yes [14:01:29] cp3031.esams.wmnet: weight changed 1 => 5 [14:01:53] but even then, it's tricky [14:02:08] if x_cache ~ /hit/, it's definitely a cache-hit from the black box perspective [14:02:47] if x_cache !~ /hit/ && x_cache ~ /miss|pass/, it's definitely a real miss or pass, meaning it hit the backend appservers [14:03:08] but there's also an inbetween case, where x_cache doesn't match /hit|pass|miss/, because it was a response generated by varnish itself [14:03:19] (which happens for e.g. HTTPS redirects, and also beacon 204 responses) [14:03:24] right [14:03:40] couldn't we consider that a hit then? [14:03:51] it depends on why you're looking [14:03:58] as in, we didn't hit the backend appservers [14:04:04] :) [14:05:04] if the goal is to determine hitrate as in "percentage of requests that didn't hit the applayer", then the right thing for that is to count non-hits as "x_cache !~ hit && x_cache ~ /miss|pass/", and hits as the opposite. [14:06:10] (hits being x_cache ~ /hit/ || x_cache !~ /miss|pass/) [14:06:22] cp3031.esams.wmnet: weight changed 5 => 10 [14:06:43] that's all from the webrequest log perspective, which is based on the output headers at the frontend varnish [14:07:18] (which varnishlog doesn't have access to) [14:07:38] or varnishstat for that matter [14:08:09] hmmm, doesn't have access to as input filters anyways [14:08:37] X-Cache is cleaner and more-meaningful than it used to be on a layer-by-layer basis, but needs more refinement [14:09:13] webrequest also logs a field like "cache_status":"hit", which it presumably infers from x-cache [14:09:22] but I know from looking at the data that it's not inferring it correctly [14:09:28] :) [14:10:22] "x_cache":"cp1074 hit(28), cp4015 hit(9), cp4014 frontend hit(1205)" [14:10:44] hit can also be "pass" or "miss", which actually indicate the disposition as it passed through that one machine [14:11:07] and then pass|miss can also have a trailing "+chfp" to indicate we went through the code that Creates Hit For Pass objects [14:11:17] (which is common, especially for "pass") [14:11:44] cp3040.esams.wmnet: pooled changed no => yes [14:11:44] cp3040.esams.wmnet: weight changed 1 => 5 [14:12:05] the additional effect to think about is that on a frontend cache hit, the headers from beneath are cached from that fetch (so they don't increment, and might say "miss" even though the object is now cached and hitting down at that layer now) [14:12:31] so: [14:12:42] /hit/ -> definitely a hit [14:13:02] !/hit/ && /miss|pass/ -> definitely an applayer fetch [14:13:12] none of the above -> varnish-internal response [14:14:29] the varnish-internal ones look like: [14:14:36] wait, "definitely a hit" == "definitely a frontend hit"? [14:14:40] "uri_host":"ja.wikipedia.org","uri_path":"/beacon/impression",....,"x_cache":"cp4017 frontend (0)" [14:15:06] ema: no, everything I'm saying above is from the "all of traffic infra as one black box" perspective [14:15:22] OK, we don't care [14:15:37] don't care which layer hit, right [14:15:41] right [14:16:10] we're assuming we'd never see a more-frontend layer "pass" while a deeper layer does "hit" - if so we probably have a VCL bug [14:16:19] cp3040.esams.wmnet: weight changed 5 => 10 [14:16:44] so usually if any layer has "hit", it will be in combination with "hit" or "miss" at others recoreded in the same x-cache [14:16:56] (and will be a future-hit in all recorded machines for other requests) [14:18:13] miss,miss,miss -> full miss. miss,miss,hit -> you just hit the frontend cache object created in the earlier miss,miss,miss, and it's actually stored in all 3 now, but the miss,miss, part is cached with the fe cache object [14:20:17] or hit,miss,miss -> missed the front 2 layers, found a hit at the bottom. then you'd probably see hit,miss,hit for a repeat of the same request right after. [14:20:37] (which is actually in all 3 now, but again the hit,miss, prefix is cached in that frontend object) [14:21:13] cp3041.esams.wmnet: pooled changed no => yes [14:21:13] cp3041.esams.wmnet: weight changed 1 => 5 [14:21:29] if it weren't for the special-case of the varnish-internal ones that don't log anything really in x_cache, you could just split them on whether x-cache matched /hit/ or not. [14:22:58] let's start adding to x_cache on varnish-internal ones then :) [14:23:03] yeah [14:23:10] tricky though! [14:24:08] well, maybe not [14:26:22] cp3041.esams.wmnet: weight changed 5 => 10 [14:26:26] done adding! [14:28:07] awesome [14:32:33] bblack: I'll soon start removing mobile nodes if you agree [14:33:40] yup! [14:39:01] cp3015.esams.wmnet: pooled changed yes => no [14:40:29] bblack: are you upgrading packages with salt? [14:43:40] basically yeah [14:43:58] I did a few machines manually already [14:44:01] now all ulsfo [14:44:29] *usually* not many OS updates have a chance to impact caches negatively. and our varnish/nginx packages in paritcular are our own. [14:44:51] so usually once in a while I test the latest apt-get upgrade on a machine or two, then apply it everywhere, just to keep up with sec fixes, etc [14:44:58] and in this case, kernel updates too [14:45:13] also there are sometimes perf improvements in new kernels [14:45:40] cp3016.esams.wmnet: pooled changed yes => no [14:46:11] well our kernel package updates would be our 3.19's [14:46:15] they're just bugfix stuff [14:47:07] we should be following bleeding edge kernels to have *more* stuff to do :p [14:50:09] cp3017.esams.wmnet: pooled changed yes => no [14:55:06] cp3018.esams.wmnet: pooled changed yes => no [14:55:13] done with esams \o/ [14:55:21] https://gerrit.wikimedia.org/r/266499 [15:00:49] bblack: ^ [15:04:06] :) [15:58:20] 10Traffic, 10RESTBase, 6Services, 6operations: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1965843 (10GWicke) @bblack, there are still users for rest.wikimedia.org. I sent a reminder and announced a shut-down date for March. If we set up a redirect (or rewrite) for the dom... [16:30:47] 10Traffic, 6Performance-Team, 6operations: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1965960 (10BBlack) Quick update, I did a small re-check on just a single text node in esams (mobile + desktop text traffic, random subsample of IPs, mostly in Europe) for 5 minutes: | Protocol | Percentage... [16:40:27] hiyaaaa just checkin in. how goes the mobile -> text thang? [16:41:36] ottomata: done in esams [16:42:06] we only have eqiad left [16:42:34] aye k cool [17:58:24] 10Traffic, 10Citoid, 6Services, 6operations: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1966263 (10mobrovac) >>! In T110476#1965408, @BBlack wrote: > Are things still using the hostnames `citoid.wikimedia.org` and/or `citoid.eqiad.wikimedia.org`, which map to the cache_pars... [18:12:56] 10Traffic, 10Graphoid, 6Services, 6operations: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1966309 (10mobrovac) AFAIK, `graphoid.(eqiad.)wikimedia.org` can be safely removed. [18:40:20] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966496 (10Vituzzu) [18:42:22] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966502 (10Aklapper) @Vituzzu: Thanks for reporting this. https://gerrit.wikimedia.org/r/#/c/266551/ got reverted so things should be back to normal. Can you confirm (by bypas... [18:43:32] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966507 (10I_JethroBT) Agreed, meta.wikimedia.org has been completely replaced with a broken-ish landing page for Wikimedia projects: {F3283563} [18:44:53] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966511 (10Aklapper) and https://commons.wikimedia.org/wiki/Commons:Village_pump redirects me to wmf: [18:45:39] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966518 (10Aklapper) [18:45:51] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966520 (10I_JethroBT) @Aklapper after bypassing my cache, meta.wikimedia.org is still gone. [18:46:51] <_joe_> bblack: around? [18:52:53] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966553 (10Vituzzu) @Aklapper still doesn't work for me. I'm currently served by Amsterdam's cluster btw. [18:53:06] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: Meta, Commons, Wikispecies seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966554 (10OhanaUnited) [18:53:33] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: Meta, Commons, Wikispecies seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966462 (10OhanaUnited) Wikispecies also has the same issue [18:54:47] 10netops, 10MediaWiki-extensions-CentralAuth, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966572 (10matmarex) [18:55:42] 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966580 (10matmarex) [18:55:57] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966462 (10matmarex) [18:56:00] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966592 (10Dzahn) The remaining issues are because a tagged puppet run is now executed on all appservers, which... [18:56:03] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966593 (10Tbayer) Office.wikimedia.org is affected too, just for the record. [18:56:39] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966597 (10Dzahn) >>! In T124804#1966593, @Tbayer wrote: > Office.wikimedia.org is affected too, just for the r... [18:57:04] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966603 (10MZMcBride) This issue is definitely going to require incident documentation ( 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966615 (10Izno) >>! In T124804#1966597, @Dzahn wrote: > > everything under .wikimedia.org is affected but not... [18:58:19] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966618 (10Mike_Peel) >>! In T124804#1966597, @Dzahn wrote: > > everything under .wikimedia.org is affected bu... [19:00:08] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966649 (10Dzahn) >>! In T124804#1966618, @Mike_Peel wrote: >>>! In T124804#1966597, @Dzahn wrote: >> >> every... [19:04:35] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966668 (10Mike_Peel) This kind of outage should probably appear on http://status.wikimedia.org/ ... (unless th... [19:06:45] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966674 (10Pine) @Mike_Peel agreed. [19:08:57] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966694 (10Pine) Update: Commons is working now, but not Meta. [19:11:11] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966709 (10matmarex) >>! In T124804#1966668, @Mike_Peel wrote: > This kind of outage should probably appear on... [19:12:25] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966713 (10Pine) Commons is down again for me. [19:13:44] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966722 (10Vituzzu) "There is no user by the name "Vituzzu". Check your spelling." again at meta. [19:13:59] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966723 (10Aklapper) >>! In T124804#1966713, @Pine wrote: > Commons is down again for me. Please see T124804#1... [19:15:06] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966736 (10matmarex) [19:15:16] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966738 (10Dzahn) >>! In T124804#1966668, @Mike_Peel wrote: > This kind of outage should probably ap... [19:16:18] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966744 (10RobH) Operations is still working on this issue. At this time the underlying issue has b... [19:17:38] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966746 (10Pine) @Dzahn @RobH thank you. [19:19:31] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966757 (10jcrespo) At 7:36 PM, for reasons operations team has not yet investigated, a wrong config... [19:23:35] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966774 (10Mike_Peel) >>! In T124804#1966757, @jcrespo wrote: > @Mike_Peel That panel is not handled... [19:23:52] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966776 (10jcrespo) Correction, it was 19:23 UTC. [19:24:52] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966779 (10jeblad) ...and from Oslo, 10 points for well-done cleanup! :) [19:25:29] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966782 (10jcrespo) > The webpage showing the status of operations isn't handled by the operations t... [19:38:24] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966833 (10MZMcBride) >>! In T124804#1966722, @Vituzzu wrote: > "There is no user by the name "Vituz... [19:38:29] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966835 (10Vituzzu) >>! In T124804#1966833, @MZMcBride wrote: >>>! In T124804#1966722, @Vituzzu wrot... [19:38:34] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966836 (10jcrespo) Update: while we believe most issues have been solved now, the caching purge has... [19:39:41] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966845 (10Harej) [19:41:04] 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1966857 (10Mattflaschen) 5declined>3Open >>! In T124711#1965119, @Gilles wrote: > I'm not even sure that it needs a home directory at all, I was just following a pattern of user creation in Vagrant I had... [19:45:38] bblack: another reason for un-special-casing testwiki is multi-dc -- testwiki should not go down in case of a DC failover, but it's OK for x-wikimedia-debug functionality to be disabled [19:46:39] so testwiki should still work through non-test mw* hosts, even if the test hosts are dead [19:48:48] yeah [20:12:34] 7Varnish, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Wikimedia-Fundraising, and 3 others: [EPIC] Special:RecordImpression should be used at a very low sample rate - https://phabricator.wikimedia.org/T45250#1966952 (10DStrine) [20:14:13] 7Varnish, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Wikimedia-Fundraising, and 3 others: [EPIC] Special:RecordImpression should be used at a very low sample rate - https://phabricator.wikimedia.org/T45250#1045534 (10DStrine) fr-tech has discussed this a few times. We have a few steps l... [21:48:26] 10Traffic, 6operations, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967490 (10RobH) 3NEW a:3BBlack [21:49:23] 10Traffic, 6operations, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967498 (10RobH) I initially assigned this to @bblack, but it can be ac... [22:07:20] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967560 (10MZMcBride) >>! In T124804#1966836, @jcrespo wrote: > Followup will be on this ticket and... [22:10:48] 10Wikimedia-Apache-configuration, 10netops, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967564 (10RobH) 5Open>3Resolved a:3RobH resolving as I've sent the outage notification to the... [22:37:15] 10Wikimedia-Apache-configuration, 10netops, 10incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967786 (10greg) [22:38:50] 10Traffic, 10incident-20160126-WikimediaDomainRedirection, 6operations, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967813 (10greg) [23:11:27] 10netops, 6operations: Peer with SFMIX at ULSFO with 200 Paul - https://phabricator.wikimedia.org/T124843#1967980 (10Reedy) 3NEW [23:16:41] 10netops, 6operations: Peer with SFMIX at ULSFO with 200 Paul - https://phabricator.wikimedia.org/T124843#1968012 (10Dzahn) {meme, src=votecat} let me know if you need smart hands at ulsfo for this [23:25:53] 10Wikimedia-Apache-configuration, 10netops, 10Incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1968040 (10TheDJ) >>! In T124804#1966782, @jcrespo... [23:45:44] 10netops, 6operations: Peer with SFMIX at ULSFO in 200 Paul - https://phabricator.wikimedia.org/T124843#1968096 (10Reedy) [23:51:36] 10Traffic, 6operations: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1968101 (10BBlack) 5Open>3Resolved Fixed up https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging