[02:27:31] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985525 (10BBlack) Well, we have 3 different stages of rate-increase in the insert graph, so it could well be... [02:29:58] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985526 (10BBlack) Continuing with some stuff I was saying in IRC the other day. At the "new normal", we're s... [02:34:20] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985528 (10Legoktm) >>! In T124418#1985525, @BBlack wrote: > it's not like we gained a 5x increase in human ar... [14:00:16] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986077 (10BBlack) Another data point from the weekend: In one sample I took Saturday morning, when I sampled... [15:10:58] hiyaaa [15:11:23] just checking in i'm looking at cp1060 [15:11:34] looks like it has just as many requests flowing through it as cp1052? [15:11:35] am I wrong? [15:12:06] OH [15:12:11] because cp1052 is now a text? [15:12:21] ahhhh [15:12:22] got it [15:12:23] :) [15:13:47] joal: fyi^^ [15:13:59] i sese about 60 / sec in webrequest_mobile topic still [15:14:46] huh? [15:14:55] cp1052 is not now a text [15:15:20] oh sorry, cp1052 is a text and always was. I thought you were referring to an ex-mobile machine [15:15:54] cp1060 should be getting about 10% of the mobile request flow that cp1052 gets, and 0% of the text request flow [15:16:27] probably (unless it's filtered earlier?) the bulk of all cache_mobile traffic at this point is internal monitoring [15:16:40] and then we have that very small slice of real user mobile traffic still hitting cp1060 [15:18:42] sorry, i just looked here [15:18:42] http://config-master.wikimedia.org/pybal/eqiad/mobile [15:18:49] i usually look there for finding mobiles [15:19:03] :) [15:19:24] well that list is accurate, but the way the transition is being handled, that file is mostly full of cache_text machines now :) [15:19:42] (compare to eqiad/text URL) [15:20:43] oook, cool [15:20:43] hm [15:21:14] bblack, cp1060 is the only remaining 'mobile' cache, right? [15:21:21] so it should be the only one sending to the webrequest_mobile topic? [15:22:46] well [15:22:58] not exactly, no [15:23:51] cp1060 is the only machine from the previous cache_mobile set of machines that still gets mobile user traffic coming into the front of it, at something like a 1/81 share of eqiad's mobile users (which is in turn probably something in the ballpark of 30-40% of all mobile users? something like that) [15:24:19] but the rest of cache_mobile are still up and running and configured to send to webrequest_mobile, too [15:24:30] and there's still various sorts of monitoring and healthcheck and purge traffic flowing into them [15:24:47] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986305 (10Lydia_Pintscher) Very strange. Wikidata use on templates on talk pages isn't impossible but I'd con... [15:24:51] I don't know at which stage (or any) those get filtered out before they hit what you're looking at on the analytics side [15:27:11] 10Wikimedia-DNS, 6operations, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986311 (10elukey) @yuvipanda: What are the next steps for this? Do you need more reviewers to get added? [15:27:12] wiat wha? [15:27:34] are you running 2 varnishkafka instances on the text machines that are now serving mobile traffic? [15:27:43] no [15:27:59] are you running 2 varnish instances? [15:28:17] a single instance of varnishkafka can only send to one topic [15:28:38] the mobile traffic (as in, user traffic) moved from being pointed at the cache_mobile machines which log to webrequest_mobile, to being pointed at the cache_text machines which log to webrequest_text [15:28:39] oh [15:28:55] yes right [15:29:05] ok, so the other mobiles are still up, but just serving internal traffic [15:29:12] and they are logging to webrequest mobile [15:29:13] right [15:29:13] ok [15:29:20] but I left a small fraction still pointing at cp1060 so that your flow of webrequest_mobile (after whatever filtering may be in place for internal healthchecks/purges and such) didn't drop to zero. [15:29:29] right, ok cool [15:29:49] so cp1060 is the only instance that serves user traffic that is logging to webrequest_mobile [15:29:50] correct? [15:29:54] yes [15:30:07] and I'd like to pull that, but waiting on your ok [15:30:12] right [15:30:19] ok, am hoping to work with joal on that today [15:30:21] ok [15:31:20] hmm, actually, [15:31:29] as long as any data flows in on webrequest mobile [15:31:33] even if it is internal [15:31:38] you can pull cp1060 now [15:31:52] our stuff will only block when there is 0 data in webrequest_mobile [15:32:14] i see plenty of non cp1060 reqs now, most to /wiki/Special:BlankPage [15:32:22] internal monitoring traffic, i assume [15:33:13] yeah that's pybal healthchecking them [15:33:15] ok [15:33:18] thanks :) [15:33:25] will coordinate before they lose all their monitoring flow too [15:34:40] ok cool, well, we don't use that traffic for anytihng analyticsy (pretty sure) [15:34:42] so, this is good bblack [15:34:45] you can move cp1060 [15:34:49] and then we can fix our analytics jobs [15:34:54] without a race condition [15:35:10] will comment on ticket [15:35:16] ok thanks! [15:37:22] _joe_: we have issues with the pybal->etcd stuff I think :/ [15:37:40] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986404 (10Ottomata) I just talked to @bblack, and also looked at requests in the `webrequest_mobile` topic in Kafka. There are still real user requests from cp1060, but most of... [15:38:19] _joe_: I just did a depool on cache_mobile cp1060.eqiad.wmnet service=nginx. went to confirm on lvs1001, and noticed it wasn't dropped from the live ipvsadm backends [15:38:41] journal logs show some kind of related exception happened back on the 29th, seems likely it hasn't polled etcd since [15:40:23] https://phabricator.wikimedia.org/P2549 [15:42:43] _joe_: I restarted pybal on lvs1003 and it recovered and correctly removed cp1060 from today's depool [15:42:50] will wait on lvs1001 if you want to look at it [15:46:17] oh doh [15:46:23] 15:37 <@bblack> _joe_: we have issues with the pybal->etcd stuff I think :/ [15:46:26] 15:38 <@bblack> _joe_: I just did a depool on cache_mobile cp1060.eqiad.wmnet service=nginx. went to confirm on lvs1001, and noticed it wasn't dropped from the live ipvsadm backends [15:46:30] 15:38 <@bblack> journal logs show some kind of related exception happened back on the 29th, seems likely it hasn't polled etcd since [15:46:33] 15:40 <@bblack> https://phabricator.wikimedia.org/P2549 [15:46:36] 15:42 <@bblack> _joe_: I restarted pybal on lvs1003 and it recovered and correctly removed cp1060 from today's depool [15:46:39] 15:42 <@bblack> will wait on lvs1001 if you want to look at it [15:47:52] kinda looks like (from the exception) that the TCP connection to etcd was dropped at an unexpected time or something? [15:48:17] <_joe_> bblack: not sure, I am looking [15:48:26] <_joe_> that's bad :/ [15:48:42] <_joe_> so that happened on lvs1004 and 1001? [15:48:52] I meant to say lvs1001 + lvs1004 above, not 1003 [15:49:00] <_joe_> and not on others in any other pool? [15:49:10] well they poll different data [15:49:11] <_joe_> I guess there is some race condition somewhere [15:49:22] <_joe_> 1001 and 1004? [15:49:32] <_joe_> actually they're in the same class AFAIR [15:49:43] the last transaction I did on the 29th was the setting of cp1060 weight=>1, which is the successful txn at 10:25:19 in the long [15:49:46] *log [15:49:53] yes, 1001 and 1004 poll the same data [15:50:01] other classes didn't seem to get hit [15:50:02] <_joe_> anyways, upon such an error, it should just re-read the config from etcd again [15:50:20] well today (just now) I depooled cp1060 from confctl, and pybal/ipvs didn't react [15:50:32] restarting pybal on lvs1004 made it check etcd and pull that machine out [15:50:40] <_joe_> bblack: yeah that's because the etcd coroutine was dead [15:50:47] <_joe_> from reading the code [15:51:00] <_joe_> so ok this needs to be UBN! I guess [15:51:15] <_joe_> but first let me work on tin today if it's possible [15:51:52] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986445 (10daniel) @BBlack can you give a few examples of such pages on srwiki? Were these non-existant talk p... [15:58:12] _joe_: ok :) [16:00:31] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986463 (10BBlack) @daniel - Sorry I should have linked this earlier, I made a paste at the time: P2547 . Not... [16:03:08] 10Wikimedia-DNS, 6operations, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986468 (10elukey) p:5Triage>3Normal [16:04:02] _joe_: want a real task? [16:06:12] 10Traffic, 6operations, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986480 (10BBlack) 3NEW a:3Joe [16:06:16] T125397 [16:06:17] <_joe_> bblack: thanks [16:24:09] hey folks [16:24:35] we still have no #codfw-rollout-Jan-Mar-2016 #traffic tasks [16:24:37] 10Traffic, 6operations, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986530 (10Joe) So, I guess this has been some sort of weird race condition. Or that etcd responded with stale/spurious data in some way (the way recursive watch works might have caused that). [16:24:51] paravoid: yeah, I know [16:25:30] really, we only talked about it before informally, as in some conversational line like "Yeah, it should be possible to flip X and Y and then do Z and totally do a manual tier-1 switch without solving the encryption problem first" [16:25:52] but actually sorting that into a real actionable plan is... complicated. [16:26:13] I'm trying to come up with a task that seems reasonable, though! [16:32:40] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986560 (10Lydia_Pintscher) Thanks. I looked at one of them and the only thing in the page is the template for... [16:34:23] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986566 (10daniel) Yea, looks like the srwiki talk pages wasn't us, but an edit to a much-used template. [16:42:55] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986594 (10BBlack) Regardless, the average rate of HTCP these days is normally-flat-ish (a few scary spikes as... [16:50:21] 10Traffic, 6operations: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986611 (10BBlack) 3NEW [16:52:55] 10Traffic, 6operations: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986628 (10BBlack) See also: T125265 + T124418 [16:53:29] 10Traffic, 6operations: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986633 (10BBlack) And maybe-related: T122455 [16:55:43] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986637 (10BBlack) cp1060 is depooled for users now. Once Analytics is done with their oozie thing, we can proceed on the next steps for actually stopping the cache_mobile clust... [19:25:33] 10Wikimedia-Apache-configuration, 10netops, 10Incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987246 (1... [19:27:12] 10Wikimedia-Apache-configuration, 10netops, 10Incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987254 (1... [19:27:44] 10Wikimedia-Apache-configuration, 10netops, 10Incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987257 (1... [19:32:07] 10Wikimedia-Apache-configuration, 10netops, 10Incident-20160126-WikimediaDomainRedirection, 6operations: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987284 (1... [22:17:34] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987865 (10ori) >>! In T124418#1986594, @BBlack wrote: > Regardless, the average rate of HTCP these days is no... [22:29:41] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987905 (10BBlack) @ori - yeah that makes sense for the initial bump, and I think there may have even been a f... [22:33:41] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987938 (10JanZerebecki) Good find. That commit was first deployed in wmf8 which was branched on Dec 8 (rMWb24... [22:59:32] 10Wikimedia-DNS, 6operations, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1988007 (10yuvipanda) I think the patch needs to get merged and babysat :) [23:28:39] 10netops, 6operations: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) All procurement and onsite patch tasks have been completed. (They don't show resolved since they sit in the pending invoice column for a month pending invoice.) @Faidon sho...