[08:26:44] I just logged in to yarn.w.o, all works fine :) [08:30:01] nice! [08:36:07] joal also mentioned that we could expose via basic auth + ldap also hdfs.w.o, that should be basically a proxy for the HDFS namenode webserver [08:36:23] but it will need a new domain of course [12:38:00] elukey: make a patch :) it's usually best to do the varnish side first, then the DNS, so there's no chance to cache a 404 [12:47:49] bblack: want to give it a try? https://gerrit.wikimedia.org/r/308967 [12:49:08] ema: so, no smoking gun yet I assume on CL:0+200 @ codfw? [12:49:30] bblack: nope, just 416s [12:49:35] ok [12:52:09] done, let's stare at the graphs! [12:52:55] and ~ema/varnishlog-{frontend,backend}-cl0-2.log on cp2* upload hosts [13:02:48] image found with CL:0 https://upload.wikimedia.org/wikipedia/id/8/8f/Tingkatan.jpg [13:02:57] but it seems to be like this on swift [13:03:49] I've tried appending ?x=come_on and CL is indeed 0 [13:09:26] we can query swift directly [13:09:32] is that the only case so far? [13:09:46] yes [13:11:09] bblack@neodymium:~$ curl -I http://ms-fe.svc.eqiad.wmnet/wikipedia/id/8/8f/Tingkatan.jpg [13:11:12] HTTP/1.1 200 OK [13:11:15] Content-Length: 0 [13:11:27] Last-Modified: Tue, 08 Oct 2013 00:01:46 GMT [13:11:36] godog: ? [13:13:23] bblack: ah! never seen that before [13:13:45] at least we know our logging is working right to catch them :) [13:14:39] same data in codfw swift too, FWIW, so probably not e.g. low-level storage corruption. [13:14:55] well unless it happened before sync to codfw, so maybe that tells us nothing :) [13:15:36] hehe indeed, easy to imagine having stored all kind of weird things over the years in swift [13:15:55] and/or blips from mw and swift and never cleaned up [13:15:59] yeah [13:17:34] https://commons.wikimedia.org/wiki/File:Tingkatan.jpg -> does not exist [13:17:44] am I looking at the wrong path? [13:18:42] maybe it was deleted from commons, and that sometimes (or always?) results in a CL:0 file in swift? [13:19:09] yeah that's idwiki, https://id.wikipedia.org/wiki/Berkas:Tingkatan.jpg [13:19:20] TIL: File: namespace is localized too [13:19:35] oh right.... [13:20:30] the commons ones are /commons/ of course :) [13:21:27] we went from 1k req/s to 11k in codfw [13:21:38] yeah, incidentally just yesterday I looked at thumb requests for non-commons and it is just a bunch compared to commons [13:22:35] you mean non-commons is bigger than commons? I would've thought the opposite. [13:23:52] yeah the opposite, non-commons is tiny compared to commons [13:24:55] oh ok [13:26:30] https://id.wikipedia.org/w/index.php?title=Berkas:Tingkatan.jpg&action=history [13:26:44] (hey at least the API parameters like "action=history" aren't localized) [13:26:56] that would be fun [13:27:08] what's curious to me is it was uploaded in 2008, modified 3x in 2015, and the LM on the CL:0 file is 2013 [13:28:45] ema: yeah what puzzles me is that we're sane enough not to localize those, but we localize things like "File:" and "Special:" [13:29:06] makes for fun times when you want a simple regex to match 'Special:Autologout' URLs or whatever on all wikis :P [13:29:16] or even just Special:.* [13:29:46] bblack@alaxel:~/repos/puppet$ git grep Special: templates/varnish [13:29:46] templates/varnish/text-backend.inc.vcl.erb: set bereq.url = "/wiki/Special:UrlRedirector" + req.url; [13:29:49] templates/varnish/text-backend.inc.vcl.erb: req.url !~ "^/wiki/Special:HideBanners") { [13:29:52] templates/varnish/text-frontend.inc.vcl.erb: if (req.url !~ "^/(wiki/|(w/index\.php)?\?title=)Special:Banner") { [13:29:57] ^ all of those are probably borked for some wikis [13:30:39] if Special: is localized. maybe it's only the right-hand-side of Special: that's localized? I don't remember now. [13:31:24] ah no, the left is localized too [13:31:31] dewiki's login link is: https://de.wikipedia.org/w/index.php?title=Spezial:Anmelden&returnto=Wikipedia%3AHauptseite [13:33:44] just write a nice regex! /Spe[cz]ial/ would match English, German and Italian :P [13:34:51] * bblack creates a phab task for someone to make regexes that match all the variations in our 800 wikis for all Special:Foo :P [14:13:29] btw I was curious to know what's one-hit-wonder in varnish context (re: last code review) [14:13:58] aha https://phabricator.wikimedia.org/T144187 [14:16:50] godog: my definition would be: stuff that gets requested only once [14:18:12] right, it turns out a lot of things are only ever accessed once, or twice, or maybe 5 times [14:18:37] in our larger disk caches it matters less, but the memory-constrained frontends can benefit from not evicting more-useful objects for a rarely-accessed one [14:23:57] filtering one-hit-wonder seems like it's probably always beneficial. daniel indicated sometimes filtering up through ~4-16-hit-wonders is useful. I figured start at two-hit-wonders and see how it goes from there. [14:24:13] nice [14:24:53] ema: probably want to put it off for later, it just happened to be on my mind. we could wait for post-v4, or wait for codfw+ulsfo on v4 with stable cache stats and then apply only to v4, etc... [14:25:52] yeah it makes sense not to mix multiple experiments :) [14:26:06] our definition of #hit-wonder is slightly different than what applied to his 4-16 data anyways, as that was local to one FE, and this is shared in a site's FEs effectivelyt [14:26:20] I guess we don't really care if this approach doesn't work on backends<->applayer, the important thing is getting it to work on the frontends right? [14:26:29] yeah I think so [14:26:54] I don't know that it would ever make sense to apply this kind of filtering on our backends in general [14:27:17] unless we're confident we lack sufficient storage size there, to the point that it inflicts a notable miss-rate penalty [14:28:24] looping back to the other topic: I do wonder why our miss% seems higher under v4 so far, but it's hard to compare without stable data over a longer period. [14:30:03] well now we've pointed codfw straight to the applayer [14:30:52] so hit-remote is gone [14:31:47] yeah I'm sure that's part of it [14:32:14] but even when it was only ulsfo on v4 over that weekend, with codfw->eqiad v3 data behind it, it wasn't getting the same miss rate in ulsfo it had before [14:32:26] but again, it could take a long time to get down that far, and we had other issues in play [14:33:52] oh upload codfw machines are more powerful than ulsfo ones [14:34:01] that might explain why cpu usage seems better [14:34:06] yeah :) [14:34:31] codfw has our best cache configuration in general, for all the clusters, and yet the least load :) [14:34:38] (all newest-gen machines) [14:35:11] so if everything works fine in codfw we just buy new machines everywhere and we're done! [14:35:17] :P [14:35:19] :) [14:39:22] this helped a lot removing false positives from my varnishncsa logs: [14:39:23] | grep -v " 416 " | grep -v http://upload.wikimedia.org/wikipedia/id/8/8f/Tingkatan.jpg [14:39:57] and no CL:0 so far \o/ [14:40:39] backends still haven't started doing nukelru though [14:42:45] right [14:43:22] so if it's not a v3-compat issue, it's also not a pure "race on fast purge traffic" either, at least [14:43:52] but it could still be a nukelru problem, potentially in conjunction with high request or purge rates [14:44:11] but we had the 503 anomalies fairly early on before, too, and presumably the CL:0 ones [14:44:21] pretty early, yes [14:45:56] self-reflection: I think the only reason I rebased my dns -geoiplookup patch yesterday and today was as a scare tactic. Because I knew I wasn't going to merge it, but it kinda looks like I might be about to heh. [14:46:17] I should be less passive-aggressive and just ping them harder about the CN patch :) [15:13:26] 10Traffic, 06Operations, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2614930 (10BBlack) Looking at 24H of data from oxygen webrequest archive's `sampled-1000.json-20160907`, if I filter just for bits requests, `cut -d/ -f1-3` to coalesce long-path noise a... [15:24:11] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2614974 (10hashar) Looping #netops . We would need contint1001 to be moved to the public network with... [16:34:44] still no 503s, no empty 200s and no backend nukelru [16:34:51] * ema calls it a day [16:34:54] o/ [16:34:55] \o/ [17:25:01] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615667 (10RobH) So I can handle the vlan move and reimage. Just to confirm there is no data that is c... [17:31:04] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615745 (10RobH) a:03RobH checked in release engineering, its cool for me to reimage this now (after... [17:58:49] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615947 (10RobH) [18:01:30] 07HTTPS, 10Traffic, 10Monitoring, 06Operations: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2615965 (10AlexMonk-WMF) causes the labtestwikitech alert that the labs team noticed [19:00:03] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2616279 (10hashar) I confirm the server content on contint1001.eqiad.wmnet can be... [20:56:21] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2616802 (10RobH) [20:57:13] 10netops, 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2458291 (10RobH) a:05RobH>03hashar contint1001.wikimedia.org is online with p...