[00:00:28] Coren: even for a simple 304 event [00:01:09] Coren: take a look to compare resources loaded from commons [00:02:03] hedonil: You may be hitting connection limits. [00:02:25] Coren: raise 'em ;) [00:03:04] Ah, hm. No, that's looks more like it's an issue with DNS. Again. Eff. [00:03:08] Coren: thers *must* be a knot somewhere [00:15:45] I hate having to do this, but I'm going to add all the tool hosts to the hosts file. [00:17:07] Coren: what does that do? [00:17:30] Betacommand: Bypasses DNS locally. [00:18:08] andrewbogott: FYI, the symptom is becoming increasingly frequent and I still have no idea what can possibly be causing it [00:18:29] :( [00:18:52] Now I'm getting random delays, timeouts and (that's new) ECONNREFUSED [00:18:58] Host tools-exec-03.pmtpa.wmflabs not found: 5(REFUSED) [00:19:20] It looks like it's overloaded(?) but it's strictly internal. [00:22:55] andrewbogott: Holey sheets! Have you seen how much DNS traffic reaches virt2? [00:23:24] I don't think I would know where to look... [00:23:28] is that on ganglia? [00:23:57] andrewbogott: I looked with tcpdump. [00:25:01] Coren: whats on vrit2 ? [00:25:04] so just 'tcpdump port 53'? [00:25:04] I mean, it's surprisingly high since that should be labs-only. [00:25:20] andrewbogott: Yeah, on br103 (the virtual interface for the openstack network) [00:25:34] Betacommand: That runs the openstack network. [00:26:17] andrewbogott: 500 requests/secs it looks like, with some bursts. I have no idea whether that silly dnsmasq thing can handle that. [00:26:44] (I should say that clearly it cannot) [00:27:03] 828 packets captured [00:27:03] 1037 packets received by filter [00:27:03] 171 packets dropped by kernel [00:27:19] That's not good. [00:27:46] OK, so, for example... [00:27:47] 00:26:50.202861 IP 10.4.1.91.48655 > 10.4.0.1.domain: 28971+ AAAA? www.archidiocesedelome.org. (44) [00:28:01] is 'archidiocesedelome.org' the originator of the request? [00:28:04] Or the content? [00:28:13] * andrewbogott has barely used this tool [00:28:44] andrewbogott: It's the request. 10.4.1.91 requested AAAA records for www.archi... [00:30:01] so half the traffic is coming from 10.4.1.91 [00:31:26] which is… what? dig doesn't know :( [00:32:05] It's not part of tools; lemme dig around. [00:32:26] dwl.pmtpa.wmflabs. [00:33:37] Wikipedia Dead Weblinks Checker Bot; contact: gifti@toolserver.org [00:33:58] giftpflanze: Looks like it might be your "fault" after all. [00:34:20] Although that dnsmasq thing is really shitty if it starts having trouble at <1k requests/s [00:34:41] yeah. [00:35:03] Nonetheless, it would be nice to check if that fixes the problem... [00:35:05] Thankfully, we're changing the networking in eqiad [00:35:36] It's running in a detached screen; I should be able to suspend it [00:39:17] dns is suddenly much worse! [00:39:30] I've just suspended it. [00:39:36] Let's see if that help. [00:39:52] Most of its requests are for recursive "outside" queries, that might be much worse. [00:40:48] * hedonil rejoices: resource loading time (304 not modified): 0ms :-D 10.17s before. that's what you call a difference! [00:41:34] hedonil: I worked around the problem on tools with hosts file. [00:41:36] I'm not getting timeouts anymore... [00:41:45] Still lots of traffic from tools-webproxy, but that's to be expected... [00:41:51] andrewbogott: I'm still getting pretty random resolution delay. [00:41:52] Coren: yeah. thx [00:41:59] But almost no timeouts anymore. [00:42:24] So yeah. Limit reached on that thing. Lemme see if it has a tunable we could tweak. [00:48:39] Coren: hell yeah! this is how pages should perform. [00:49:52] andrewbogott: I'm seeing absolutely no way to tune that thing to increase its connection pool or listen queue size. [00:50:44] So… maybe we just need to adopt a "Don't do that, then" solution until the migration. [00:50:56] It hasn't been much of a problem, historically... [00:50:59] We can alleviate the pressure somewhat, I'm guessing, by installing bind on the worst offenders. [00:51:27] We're still getting a couple timeouts even with the giftpflanze bot asleep. (Though not nearly as much) [00:54:13] Also, no instance DNS caching whatsoever. [00:55:02] * Coren ponders. [00:57:14] * Coren enables nscd host cache on all tools instance [01:02:52] That reduced the pressure a bit too. [01:45:25] Wa da fu is enwp10 doing? [02:00:51] !somethingtorelax is http://www.flickr.com/photos/110698835@N04/ [02:00:52] Key was added [02:00:57] lol [02:13:31] Coren: priority: jfi: recurrent hiccups remain: https://tools.wmflabs.org/wikiviewstats/2013-12-07-030900-loading_page.png [02:14:09] Coren: moved all img resources to commons now, better cache. [06:59:48] Coren: ddos attack on tools-webserver-01? [07:10:29] petan: ^^ [09:34:28] Coren: there? [11:16:31] \join pywikipediabot [15:09:34] zhuyifei1999: Not a willful one. It's enwp10 but I'll have to disable it for the moment. I was hoping they were just running a job yesterday that woudl end relatively quickly, but they'll need some serious improvement. [15:22:55] Coren: did i overload dns resolving in labs? [15:23:38] giftpflanze: You did, but it's not your fault so much as it is the extraordinarily crappy networking layer of openstack. [15:24:01] so, what can we do? [15:24:05] Easiest way to fix: install a bind on your instance. [15:24:15] what's that? [15:24:18] (Or any other DNS server that'll resolve and cache) [15:24:31] mhm [15:25:47] so i didn't access tools (too much)? [15:50:13] Coren: do i have to do any configuration for bind? [15:52:10] https://help.ubuntu.com/community/BIND9ServerHowto#Caching_Server_configuration [15:52:38] ok, what are the ip addresses? [15:56:00] Coren: ^ [15:57:30] giftpflanze: I was about to say you'll find them in /etc/resolv.conf; but avoiding that one is the whole point. :-) Use 208.80.152.131 and 208.80.152.132 [16:01:58] ok, thank you [16:10:35] Coren: when labs has migrated to eqiad that won't be necessary anymore? [16:11:23] giftpflanze: Presumably not. If the new version of openstack still relies on dnsmasq I'll put a real DNS box alongside it. [16:11:48] ok [16:17:42] is ipv6 supposed to work? [16:17:44] johang@tools-dev:~$ ping6 en.wikipedia.org [16:17:44] connect: Network is unreachable [16:17:56] Coren: http://ganglia.wmflabs.org/latest/?c=tools&h=tools-login&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [16:21:46] something seems very wrong with s5 according to that page [16:22:01] johang: Not from labs, no. The current version of the networking stack doesn't speak V6. The new version we are deploying in eqiad should. [17:18:47] Coren, thanks [17:39:54] Coren: i think dns caching made the bot a lot faster [17:41:18] giftpflanze: That wouldn't be very surprising; DNS roundtrips add up. [17:55:51] python speaking user around? [17:59:37] print "Hello, Steinsplitter!" [18:01:20] Steinsplitter: do you mean the programming language or the snake? [18:01:27] ^^ [18:01:41] lol [18:01:53] !somethingtorelax [18:01:54] http://www.flickr.com/photos/110698835@N04/ [18:03:57] anyone online happen to know how i can see if runJobs.php is run as a cron job on http://commons.wikimedia.beta.wmflabs.org/ ? [18:04:43] Python Programming Language :P [18:05:38] i am working since a hour on a error handler... but .... HMPF [18:36:32] Steinsplitter: maybe pasting some lines will improve feedback rate https://tools.wmflabs.org/paste/ maybe. [19:06:05] Steinsplitter: you need some python help? [19:10:31] Steinsplitter: have you fixed your error handling issue? For python questions, you can also try #python, which, in my experience, is very friendly and to the point. [19:13:41] re, oh thx :) not resolved [19:14:39] pm? ther ar som bugs, so i dos not like to post it public? :) [19:14:46] Steinsplitter: sure. [21:02:52] !log wikistats (wx) - chapter wikis - fixed URLs due to site redesigns: wikimedia.de, wikimedia.org.ph, pa.us.wm -> pa-us.wm [21:02:54] Logged the message, Master [21:03:51] !log wikistats (wx) - chapter wikis - removed because they all don't use mediawiki anymore (sigh!?) - wikimedia.fr, wikimedia.ch, wikimedia.org.ve, wikimedia.org.il, wikimedia.org.ar ..(less and less chapters actually use mw) [21:03:53] Logged the message, Master [21:05:37] !log wikistats - various fixes with some broken updates in wiktionary and mediawiki table, remaining ones with error code "991" means the remote site has database issues, even if it looks like it works in browser, the script gets internal_api_error_DBConnectionError or similar [21:05:38] Logged the message, Master [21:07:49] johang: Python: the language of choice for parseltongues! :-)