[02:01:24] it's interesting that thie 4-hit-wonder code seems to have done basically-nothing to the text frontend hitrate [02:01:44] I'm quite certain our text requests have plenty of one-hit-wonders [02:03:21] I'm starting to wonder if maybe we already are close to what's reasonably theoretically possible there in terms of hitrate (as in FE cache is huge), and the remaining frontend misses are almost all 1-4-ish hit-wonders [02:03:33] in which case obviously the code to miss them intentionally does nothing but is mostly-harmless [02:04:10] but if that were the case, then it's hard to explain how the larger backend caches get ~4% more hits [02:05:12] in any case, not terribly important [03:53:01] ema: esams done, just ulsfo left now [05:58:52] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659577 (10grin) (testing lurking on phabricator made me see this ;-)) my 2'cents: since defgw was not pingable I'd check (apart from arp) irqs on the machine, I suspect you've checked... [06:01:33] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2661382 (10Naveenpf) @Aklapper I know this phabricator ticket was opened for simple change from url forward to giving proper ip address to the web... [08:13:46] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2661528 (10Aklapper) The `Tags` above mentioned #Operations and #WMF-Legal. [08:57:30] 10Traffic, 10Varnish, 10Analytics, 06Operations: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2661614 (10elukey) T138747 upgraded Varnishkafka to a new version able to start at any time and poll periodically the Varnish shm logs to see if they... [09:12:16] vk upgraded in misc/maps, rollout completed :) [10:34:03] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661922 (10jcrespo) Adding Traffic so they can give it a quick look. [11:19:47] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661961 (10doctaxon) p:05High>03Unbreak! changed Priority because there have to run a lot of bot scripts Wikipedia users needs to work with it. The unbreak is open.... [11:26:34] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661965 (10jcrespo) @doctaxon can you indicate the full url you are trying? [11:37:54] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661968 (10doctaxon) from chat about the topic: ``` 11:26 < wikibugs> Labs, Tool-Labs, Operations, Traffic: repeated 503 errors for 90 minutes now on... [11:39:55] <_joe_> ema: around? [11:40:07] _joe_: yep [11:40:18] <_joe_> are you looking into this ^^ [11:40:48] now I am :) [11:41:39] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661971 (10Steinsplitter) Getting the problem when accessing the Wikimedia Commons api via labs or labs grid engine. For example when attempting to getting image info... [11:43:22] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661551 (10Joe) @Steinsplitter do you get the data correctly if you try from your computer? [11:43:33] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661974 (10doctaxon) is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...? [11:44:42] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661975 (10Joe) >>! In T146451#2661974, @doctaxon wrote: > is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...? It... [11:47:33] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661977 (10doctaxon) Who is responsible for that? [11:51:39] I don't seem to be able to repro with https://de.wikipedia.org/w/index.php?title=Kurt_Couto&action=info [11:51:55] I get a 200 with X-Cache: cp1055 pass, cp1053 pass [11:52:40] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661984 (10Steinsplitter) >>! In T146451#2661972, @Joe wrote: > @Steinsplitter do you get the data correctly if you try from your computer? Yes, ~ 40 successful attem... [12:06:45] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662029 (10doctaxon) next error trying this: https://de.wikipedia.org/w/index.php?title=Offshore-Windpark_Borssele&action=info ``` / format json / maxlag 5 / action q... [12:08:16] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662035 (10Joe) @doctaxon do you get an error consistently for that url? if so, trying from where? I still can't reproduce your problem, that seems not to be limited t... [12:09:18] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662036 (10doctaxon) no, it's not consistently but random, it's always API info up to now [12:26:42] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662047 (10doctaxon) runs good for 8 minutes now [12:28:27] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662048 (10ema) I've tried reproducing the issue for a while without success. @Joe restarted mw1280-90 due to memory leaks, perhaps that helped? [12:31:12] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2662061 (10faidon) >>! In T146391#2661380, @grin wrote: > sorry for chiming in. :-) No reason to be sorry — thanks for the input! [12:51:18] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662098 (10doctaxon) Okay, I suppose, the problem has been solved. What have you done to solve it? [12:53:46] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662101 (10ema) @doctaxon: nothing, except for @Joe's restart of the HHVMs mentioned above. [12:54:15] upload storage conversion finished [12:56:05] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662109 (10Joe) @doctaxon I tracked down `mw1203` and `mw1280-1290` as potential source of problems because of how much cpu/RAM they were consuming, and issued a rollin... [12:58:20] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662110 (10doctaxon) Top! Thank you very much! [13:05:34] 10Traffic, 06Labs, 06Operations, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662114 (10Joe) 05Open>03Resolved a:03Joe [13:13:10] nice [13:46:45] 10netops, 06Discovery, 06Operations, 06WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2662197 (10Addshore) [13:48:44] 10netops, 06Discovery, 06Operations, 06WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2662197 (10Addshore) [13:53:44] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662242 (10Aklapper) >>! In T105905#2525620, @BBlack wrote: > The blog is still sending the response header: `strict-transport-security: max-age=86400`. It should be `strict-transport-... [14:08:11] daily restarts are back on [14:08:51] cp1099 will hit its next one at 12:58 tomorrow (so, nearly 23h from now) [14:09:10] it's the only one that will be pushing new boundaries up until then. the rest will restart before they reach the times we've already seen on cp1099 [15:05:00] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2662491 (10BBlack) nginx has added `max_conns` to the open source master branch in http://hg.nginx.org/nginx/rev/29bf0dbc0a77 , which should appear in 1.11.5. That wasn... [15:35:19] FWIW, cp1099 is lately reaching a sort of crossover point in storage allocations [15:36:38] where "bytes freed" is approaching the same value as "bytes allocated" in all the bins (meaning we've been full and nuking for space so long that the earlier easy times are statistically insignificant), and c_fail (allocator failure) is also a large fraction of all requests (c_req) in most of the bins [15:36:54] "allocator failure" in this case meaning it had to nuke to find space [15:37:21] so we're definitely into the heavy sustained pattern for a while now, nuking old objects to make room for new [15:37:38] the only thing that changes from here is probably the ever-increasing fragmentation level [15:38:15] (but even then, the 16x ranges should effectively cap that, the question is whether the cap is sufficiently performant) [15:52:41] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662606 (10Tbayer) >>! In T105905#2463813, @Tbayer wrote: >>>! In T105905#2463476, @faidon wrote: >> After my mail ysterday, Jeff Elder contacted me for clarifications (which I gave).... [15:53:18] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662607 (10Tbayer) a:05Tbayer>03None [16:17:31] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662678 (10BBlack)