[05:37:09] 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations, 13Patch-For-Review: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2563345 (10Nemo_bis) [05:37:13] 10Traffic, 10MediaWiki-extensions-UniversalLanguageSelector, 06Operations, 13Patch-For-Review: ULS GeoIP should use the Cookie - https://phabricator.wikimedia.org/T143270#2562516 (10Nikerabbit) The cookie is already always used on WMF sites, because `$wgULSGeoService = false;` in CommonSettings.php. All ot... [05:44:44] 10Traffic, 10Varnish, 06Operations: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#2563362 (10Nemo_bis) [05:44:46] 10Traffic, 06Operations, 13Patch-For-Review: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563361 (10Nemo_bis) [05:46:12] 07HTTPS, 10Traffic, 10Varnish, 06Operations: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#1042675 (10Nemo_bis) [06:40:41] 10Traffic, 06Operations, 13Patch-For-Review: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563420 (10Nemo_bis) [06:46:36] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563422 (10Nemo_bis) > I do see lots of legit referer headers. ULS uses it, for instance, and [hundreds of standalone wikis](https://wikiapi... [09:15:08] ema: cp1008 has / 100% full [09:15:38] /var/cache/varnishkafka is 2.8G alone [09:15:50] but / is 9GB in total, that's not much [09:45:25] paravoid: thanks! [09:49:21] I've removed linux-image-4.4.0-1-amd64-dbg which is about 2.7G [09:52:48] and the stuff in /var/cache/varnishkafka [10:04:11] it's possible this is LVM and we have more space in the VG [10:04:17] I didn't actually check [10:05:52] nope, / is on a raid device [11:24:06] <_joe_> well cp1008 is the "test host", right? [11:34:02] _joe_: correct [12:19:40] usually vk is the culprit on cp1008 [12:19:57] it tends to spam some logs or outputs when we're testing things or restarting varnishes, etc [12:26:34] bblack: anything we can do except for looking at HTTP 5xx? [14:40:18] ema: not usually :) [14:41:10] I mean in theory yes, but I don't know juniper routers well enough to help tbh. Trying to interject and help would likely slow things down. [14:45:10] bblack: I was looking at the 5xx spikes on https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes and https://grafana.wikimedia.org/dashboard/db/varnish-http-errors, the one at 12:10ish is shown on both graphs, while the later one at 12:50 is only on varnish-aggregate-client-status-codes [14:45:34] is it because the later spike was varnish<->varnish only? [14:49:51] well the metrics used seem to be the same (when filtering for status_type 5 on varnish-aggregate-client-status-codes) [14:51:28] ema: I think that's just a graphing issue, maybe varnish-http-errors is averaging things out too finely to see the latter spike [14:51:46] whereas varnish-aggregate-client-status-codes is basically raw data [14:51:58] (I didn't set up varnish-http-errors, I'm not terribly familiar with it) [14:52:25] yeah there's a movingMedian involved in -http-errors [14:53:36] so varnish-aggregate-client-status-codes is once again a more reliable source of insights :) [14:56:23] during the earlier/bigger event, the stats spiked in the int-front direction on varnish-caching, too [14:56:46] which to me implies 503 on reachability from fe->be (chashing to different machines in different rows) [14:58:07] int-front spike, hit-front drop [14:58:20] well the other dropouts are hard to reason about [14:58:27] because the request rate also got smaller [14:58:41] and they're percentages so any one thing increasing implies others decreasing [14:58:48] but +int-front is what stands out [14:58:54] right [16:02:23] just saw this [16:02:43] I was looking at varnish-http-errors at the time too [16:02:44] and was confused [16:03:07] if it's not maintained by you guys, perhaps it should be dropped or something? [16:03:15] although the widgets at the top seem interesting [16:03:41] I think it might've been an ori creation, and it does have some more-useful views [16:03:55] oh this is probably the gdash equivalent -- we had a task to set up grafana dashboards for all the gdash dashboards we had, as a blocker to killing gdash [16:04:03] yeah [16:05:41] unrelated: [16:06:00] I was digging into what UAs are causing all the ChaPoly-Draft traffic to see how soon we can expect it to die [16:06:27] most of it is outdated Chrome or something Chrome-based on android (e.g. UCBrowser, YoloBrowser, etc... barf), so those may update from the app store eventually [16:06:35] as newer Chrome on all platforms should do RFC mode [16:07:02] but ~20% of the draft-mode hits are also coming from the Google Search App on iPhone/iPad [16:07:11] and they're coming from the latest release of it, which was August 8 [16:07:26] so I guess Google is behind on switching that app's connections from draft to RFC, hopefully they will eventually. [16:09:16] it's amazing how many outdated Chrome installs there are in the wild in general [16:09:24] so much for auto-update-magic :P [16:09:38] some of it's controlled installs that only update infrequently (e.g. corporate desktops) [16:10:25] so of it's recent enough and looks like it should auto-update, but I'm guessing the device+browser hasn't restarted in forever to take the update, or in the android cases maybe they have auto-update for it turned off in the store and don't routinely upgrade their apps in general. [16:11:40] when/if Chrome average version rises, we should see both +chapoly in general and less %draft (vs rfc) [16:28:54] 10Traffic, 06Operations, 10Pybal: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#2564802 (10ema) p:05Triage>03Normal [16:43:35] 10Traffic, 06Operations, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2564863 (10BBlack) [16:43:39] 10Traffic, 06Operations, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2564860 (10BBlack) 05Open>03Resolved a:03BBlack With no movement for a couple of weeks here and the various above comments (only outdated app versions, ok... [16:45:25] bblack: re: MMDB/UTF-8, yeah, I guess it has always been an issue [16:45:35] I wonder if "city" is used a lot in our javascript, perhaps we could drop it? [16:46:13] I mean, if it outputs S_o_Paulo for a city of 20 million.. [16:46:47] maybe they're aware and if they need a campaign there they match it to S_o_Paulo :) [16:47:15] heh [16:47:20] in a very cursory look at the CN JS code, it seemed like they only use Country-level data? But I'm not entirely sure. It may be that individual campaigns have custom JS snippets that can access more [16:47:38] I know lat/lon have been used before [16:47:51] for targetting photo contents and such [16:47:54] so yeah, probably the per-campaign JS [16:48:12] yeah [16:48:45] 10netops, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2564917 (10Ottomata) [16:49:02] hm [16:49:05] https://meta.wikimedia.org/wiki/Help:CentralNotice doesn't mention it though [16:49:35] ah, there's https://en.wikipedia.org/wiki/Wikipedia:Geonotice [16:49:46] well I suspect per-campaign JS has access to raw Geo={...}, it's a global variable [16:49:56] (window-global in JS) [16:50:10] that's a gadget [16:50:19] do you know of mwgrep? [16:50:29] nope [16:50:38] it searches all gadget code [16:50:46] indexed in elasticsearch [16:50:54] faidon@tin:~$ mwgrep Geo [16:51:02] fun :P [16:51:13] faidon@tin:~$ mwgrep geoiplookup [16:51:13] ## Public wiki results [16:51:14] svwiki MediaWiki:Common.js/watchlist.js [16:51:14] svwiki MediaWiki:Gadget-Geonotice.js [16:51:15] fwiw [16:53:05] there's the almost trivial https://packages.debian.org/sid/libunac1 / http://manpages.ubuntu.com/manpages/wily/man3/unac.3.html btw [16:53:08] but it sounds like an overkill [17:02:08] these are awesome: [17:02:10] 90 RxURL c /favicon/wikipedia.ico [17:02:10] 90 RxHeader c Host: bits.wikimedia.org [17:02:10] 90 RxHeader c Referer: https://bits.wikimedia.org/favicon/wikipedia.ico [17:04:23] those are the bulk of the remaining hits to bits that have a referer from our own domains [17:04:27] it refers from its own URL :P [17:10:52] 10Traffic, 06Operations, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2565036 (10BBlack) I'd like to start the decom here with the DNS removal of the `bits.wikimedia.org` hostname itself, so that the traffic dies... [17:15:28] bblack: So weird question. SSH access to the cluster for me has been *absurdly* slow for the last couple of days. HTTPS has been just fine. Once I'm connected I'm ok (no major lag), but it's the initial connection that has been absurdly slow to the point of being almost useless. Thoughts? [17:16:17] do you have any idea what step in the ssh connection is causing the delay? connection to the bastion? dns lookup of the bastion? have you looked at / updated your .ssh/config recently? [17:17:41] It's not the DNS part afaict. [17:17:48] And yeah, some minor tweaks, but that was last week [17:17:55] paravoid: yeah there's clealy lots of things accessible the global Geo object from window context, once CN creates it for them heh [17:17:58] bblack@tin:~$ mwgrep Geo.city [17:17:59] (mainly cuz I rotated a bunch of public/private keys) [17:18:01] ## Public wiki results [17:18:03] commonswiki MediaWiki:Gadget-EnhancedPOTY.js [17:18:06] commonswiki MediaWiki:Gadget-WatchlistNotice.core.js [17:18:09] commonswiki MediaWiki:GeoEdit.js [17:18:11] commonswiki MediaWiki:JSONListUploads.js [17:18:13] commonswiki MediaWiki:WatchlistMessageCreator.js [17:18:16] zhwiki MediaWiki:Gadget-AdvancedSiteNotices.js [17:18:41] bblack: I can pastebin, one second. [17:19:32] FWIW, ignoring Cipher/Kex stuff, mine is down to just this: [17:19:41] Host *.wmflabs *.wmflabs.org !restricted.bastion.wmflabs.org ProxyCommand ssh -4 -W %h:%p restricted.bastion.wmflabs.org [17:19:44] Host *.wikimedia.org *.wmnet !bast* !gerrit.wikimedia.org !wikitech-static.wikimedia.org 10.*.*.* ProxyCommand ssh -4 -W %h:%p bast2001.wikimedia.org [17:20:28] bblack: https://phabricator.wikimedia.org/P3850 - comment at the bottom to point out where the slowdown is. [17:22:12] ostriches: does the IP you connect from (the outside world's view) have working revdns? just a random thought, maybe something's timing out trying to resolve it [17:22:47] Ummm, I dunno! How could I check easily? [17:23:20] ostriches: host $(curl http://icanhazip.com) [17:23:56] 122.232.234.98.in-addr.arpa domain name pointer c-98-234-232-122.hsd1.ca.comcast.net. [17:24:06] also, maybe try putting "-4" on your ProxyCommand to see if it's a v6-related problem [17:24:46] (I've had mine in there forever because I often experiment with turning on the v6 that AT&T offers me, and then find out yet again that it's kinda borken) [17:25:03] Aha! [17:25:05] Much faster! [17:25:17] Comcast fucked up their ipv6 again [17:25:22] I don't think they (AT&T) configure v6 on customer modems/routers by default, but if you log into the modem you can turn it on in the settings and feel the fail. [17:25:58] it mostly-works, but random loss/breakage here and there. I don't think it's internet-level v6 routing problems, I think it's problems with the modem they give me and how it does v6 ACLs and such. [17:26:00] my 6in4 tunnel from he.net has been a bit flakey for the last 2 weeks [17:26:12] Comcast's IPv6 has been *ok* for me recently, but when it first rolled out it was beyond useless. [17:26:16] Must've fucked themselves again [17:27:17] I used to have he.net tunnels for home and my linode boxes. then I undid all that setup when linode got native v6 and AT&T started offering it. then later I figured out AT&T was causing problems, but never set he.net back up. [17:28:02] Anyway forcing v4 works. Thx for the insight guys [17:28:32] np [17:36:29] 10Traffic, 06Operations, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2565144 (10BBlack) (edited above to note it's just favicon, not others, that's these bulk). Also notable, many of these self-referred favicon... [18:18:39] 10netops, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2565307 (10faidon) 05Open>03Resolved a:03faidon Should be done! [19:15:42] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2565503 (10BBlack) >>! In T100902#2563422, @Nemo_bis wrote: >> I do see lots of legit referer headers. > > ULS uses it, for instance, and [h... [21:00:36] 10netops, 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2565934 (10Dzahn) [21:48:55] bblack: that comment only makes it more intriguing [21:49:39] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10Platonides) I think a new ULS version not relying on that should be released before the shutdown, then. [22:24:12] bblack: around? [23:59:05] 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2551996 (10DaBPunkt) >>! In T142944#2560827, @hoo wrote: > For this, we also desire to get the placeholders into search engines, to...