[00:19:55] anyone with cloudadmin in here? [00:19:59] I appear to have an instance stuck [00:20:08] and not responsive to the typical reboot command [00:20:32] and actually just as I said that it completed [00:24:11] Coren|Dinner: I'm seeing intermittent errors about not being able to resolve tools-master.pmtpa.wmflabs from qstat [01:03:04] anomie|away: are those errors ongoing? We had that issue earlier in the day, I'm hoping it's fixed... [01:12:27] andrewbogott: Yeah, still happening occasionally. At the moment, I'm seeing more of qstat -xml -j $job telling me "error: commlib error: access denied (server host resolves rdata host "tools-login" as "tools-login.pmtpa.wmflabs")" then giving me an XML "communication_error" with "unable to contact qmaster using port 6444 on host "tools-master.pmtpa.wmflabs"" [01:14:23] hm… it's not obvious to me that that's a resolution problem. Am I missing a bit? [01:30:36] andrewbogott: could you check if there are any remnants /var/run/lighttpd/newwebtest* on tools-webgrid-01? and delete them? [01:30:56] * andrewbogott looks [01:31:43] root@tools-webgrid-01:/var/run/lighttpd# ls newweb* [01:31:44] newwebtest.conf [01:31:53] ^ like that one? [01:32:30] hedonil: ? [01:32:33] andrewbogott: hmm, was looking for some orphan'ed pid's [01:32:49] Ah, nope. [01:33:03] andrewbogott: what's the content of this conf? [01:33:35] um… context? [01:34:04] I don't understand the question. [01:34:15] andrewbogott: I had 4 webs running http://tools.wmflabs.org/paste/view/5b16fecb killed all, now I can get noc connection [01:35:20] andrewbogott: I don't know how this happened. [01:36:11] andrewbogott: http://tools.wmflabs.org/newwebtest/ 503 w/apache & w/ newWeb [01:36:40] o/~ Here I come to save the day! o/~ [01:36:43] Why did you ask about webgrid-01 in particular? [01:37:03] andrewbogott: it's running there [01:37:13] Oh, I know what's going on. [01:37:28] * andrewbogott doesn't [01:37:43] It's my fault: I disabled dynamic mapping while the webservers were crumbling, and didn't reenable them yet. [01:37:54] So now, the proxy doesn't know where the lighttpds are [01:37:57] * Coren fixies. [01:39:14] Try it now, hedonil? [01:39:21] andrewbogott: I'm not sure whether that error is a dns resolution problem. The other one I was seeing is though, but I haven't seen that one in testing right now. [01:39:28] * hedonil tries [01:39:58] local-newwebtest@tools-login:~$ qstat [01:39:58] error: unable to send message to qmaster using port 6444 on host "tools-master.pmtpa.wmflabs": can't resolve host name [01:40:14] Hey, now that sounds like what anomie|away was seeing! [01:40:15] almost [01:40:17] andrewbogott: ^ that's the resolution error I was seing [01:40:19] Ah. We still have the resolution error. [01:40:36] Coren: andrewbogott: yeah, but no issue, disappears [01:40:42] Coren: so, is that the dns-timeout thing that we talked about earlier? [01:40:44] I'm also still seeing it on and off [01:40:49] * andrewbogott curses [01:40:50] Coren: Ahhhhh [01:40:59] andrewbogott: Yeah, that's the same issue. [01:41:18] Coren: Works. thx [01:41:25] So now we know it's not related to the overload from the js thing; that has completely gone away since. [01:42:23] Coren: but I know now what the real issue is [01:42:40] :~$ webservice restart [01:42:40] Retarting webservice.... [01:42:43] hedonil: I'm talking about the DNS resolution thing. [01:43:00] Oh! Restarts don't actually get rid of the previous one? [01:43:06] Coren: Retarding webservice by default :-D [01:43:45] Coren: there's a typo in your script [01:44:14] Oh, no, that's by design. Lighttpd runs on lemon tarts; this supplies a new one. Hence, "retarting" :-) [01:44:26] Yeah, that's the ticket. :-P [01:44:39] Coren: buhh.. [01:49:50] Coren: So, presuming 'dig' returns non-zero on a timeout... [01:50:04] I should be able to test this by just running dig until it returns nonzero [01:50:09] which I'm doing now and it never does :( [01:50:29] ... does it, actually? (Tries that) [01:51:31] If there's a bug with openstack networking somehow, it'd only show from an instance. [01:51:37] Ah, it returns 9 on timeout. [01:52:07] Aha. It's not a timeout; it returns NXDOMAIN! [01:52:32] andrewbogott: http://pastebin.com/kru1GFCC [01:52:41] andrewbogott: About 5 seconds between attempts [01:52:49] ok, that's not right... [01:52:56] why would it... [01:53:35] Isn't NXDOMAIN a positive response? Like, 'I know for sure that this hostname doesn't exist'? As opposed to 'I can't find this hostname'? [01:53:37] Or could it be either? [01:55:42] andrewbogott: It's a positive response, but from within the openstack there is a DNS proxy that might confuse things. [01:56:14] Note: http://pastebin.com/piu4FRDS [01:56:27] It takes 5 seconds sometimes rather than the usual couple microseconds. [01:56:36] (from virt2) [01:57:36] So it might be the dnsmasq thing. [01:58:57] Nothing in the logs. [02:00:40] * Coren bleeping hates heisenbugs. [02:03:01] well… my test script must not test the right thing because I never see a failure :( [02:03:14] tools-webproxy:~andrew/foo.py [02:08:11] Ah, I managed to find the root cause behind the less clear message: [02:08:37] error: commlib error: access denied (server host resolves rdata host "tools-login" as "tools-login.pmtpa.wmflabs") [02:08:37] is also a DNS error, it's the master failing to resolve reverse of the client and refusing access because of it [02:15:19] I see both failures -- sometimes timeout, sometimes NXDOMAIN [02:15:35] A guide I'm reading suggests explicitly killing dnsmasq before restarting nova-network... [02:15:44] I guess we haven't tried that yet, although the effects could be dramatic :( [02:17:39] Ah, sure enough, restarting nova-network did not restart dnsmasq -- I just tried it and the dnsmasq pids remained the same. [02:17:45] so... [02:17:50] * andrewbogott raises the hatchet [02:21:12] still getting timeouts :( [02:26:25] Coren: ok, either virt0 doesn't have a ganglia page or it's known by some secret alias. [02:26:52] * Coren grumbles. [02:26:55] But, any chance that virt0 is just CPU bound? [02:27:00] I don't know of a magic alias. [02:27:10] 'top' looks pretty busy with puppet compilations. [02:27:20] although I don't know how many cores it has [02:27:28] 90.8% idle. [02:28:04] Well, /proc/cpuinfo reports 16 cores [02:28:08] Yeah, you're right, I was watching the wrong numbers. [02:28:17] A few at 100% shouldn't make it fail. [02:29:14] virt2 is even less busy. [02:35:20] Right. [02:35:42] Gaah. I'm still getting the errors, but they're completely intermittent and don't seem to have any sort of pattern. [02:36:57] Have you tried.. [02:37:31] Reedy: we have, several times :( [02:37:39] Although it's always possible we're not turning the right bits off and on [02:39:01] I always like to see this in a config file: [02:39:03] # We are not using wildcards [02:39:04] wildcards=yes [02:39:48] we're using the other wildcards variable [02:45:17] Ryan_Lane, interested in troubleshooting? [02:46:41] Sure, let's dump that on him. Teach 'im right for abandoning us! :-P [02:46:49] It's his fault anyways, clearly. :-) [02:47:26] Well, when he shows up here late at night I assume that he's bored. [02:48:57] Can someone fix permissions for my .ssh/known_hosts on fenari.wikimedia.org? [02:48:59] $ touch /home/krinkle/.ssh/known_hosts [02:48:59] touch: cannot touch `/home/krinkle/.ssh/known_hosts': Permission denied [02:49:13] As a result I get an ssh fingerprint yes/no everytime I connect to anything from fenari [02:49:29] (or, if I can do it myself, how?) [02:50:03] Sorry, meant to post this in -opeations [02:56:30] Krinkle: oops, I logged out. But, I think you /own/ it and just don't have +w [02:57:05] yeah, chown is me [02:57:08] just not chgrp [02:57:25] Hm.. [02:57:35] I'm looking at the wrong thing [02:58:33] Coren, I am about out of ideas and ready to punch out unless you have any other ideas about things to look for... [02:58:51] Are you pretty sure that this problem only started today? [02:59:12] I'm tapped out myself [09:26:08] is the job queue backlogged [09:26:09] ? [09:26:23] i scheduled a job and it has been waiting for 30 min [09:45:33] there, it went [13:35:46] rschen7754: Yes, needz moar powar!!1! [13:36:11] (aka grid could use some extra nodes) [14:00:16] Coren: ..and don't forget about a bigger web-proxy box. moar RAM, moar MaxClient [14:00:28] Coren: blockin IP's is for girls [14:00:47] Coren: *serve ALL the requests* [14:01:03] Coren, HTTP is working again. Thanks. [14:01:10] hedonil: Heh. Tools is never going to scale to project-level; we don't have that much hardware to throw at the thing. :-) [14:01:25] Cyberpower678: Yeah, now that we stopped Wikipedia users from DDOSing us. :-) [14:01:29] Coren: tsss ;-) [14:04:38] Coren: my impression: web-proxy was the only SPOF, other boxes fell to throe :P [14:05:04] !cyberpower [14:05:04] and I say what I need, get no response and realize I just wasted the effort of typing what I need. [14:05:19] hedonil: Not quite, but it was a big part of the picture. [14:05:35] Coren: ;-) [14:05:38] hedonil, pong [14:05:49] !cyberpowersresponse [14:05:57] Uhhh [14:06:06] !coren [14:06:06] !cyber [14:06:06] The toolmeister: http://www.mediawiki.org/wiki/User:MPelletier_(WMF) [14:06:07] and I say what I need, get no response and realize I just wasted the effort of typing what I need. [14:06:13] !Coren [14:06:14] Coren is dead. petan killed him. He now roams about as a zombie. [14:06:46] hedonil: There was a cascade effect (as is often the case when things like that happen) that slowed everything down, which made the problem worse, and so on. [14:07:16] -proxy does cope with some 2000 live connections without issue. [14:07:17] !Cyberpower is This user is made out of awesome plasma. [14:07:18] Key was added [14:07:33] !cyberpower [14:07:33] and I say what I need, get no response and realize I just wasted the effort of typing what I need. [14:07:37] !Cyberpower [14:07:37] This user is made out of awesome plasma. [14:07:52] Cyberpower678: =-O [14:08:12] So the only real performance issue left is tools-db, and this is moving to real hardware in Ashburn. [14:08:36] (which is going to help a /lot/ -- databases are really sucky in VMs) [14:08:51] Coren: sure [14:09:33] Also, we could use an extra grid node or two, but I'm hoping we can cope with the current ones for another month. [14:10:02] Coren, I told you something was off with HTTP before the DDOS was even discovered. ;-) [14:11:17] Cyberpower678: Well yeah, everyone could see things had gotten sluggish; it's figuring out /why/ that's the issue. :-) [14:34:40] Coren, can you paste bin available tables? [14:34:49] in the replication DBs [14:35:03] I currently don't have access to view it myself. [14:36:25] Cyberpower678: Connecting to the DB and using the command "show tables;" doesn't do it for you? [14:36:51] anomie, I don't have the means to access it right now. [15:49:29] someone has the link to the dewiki_p bug? [15:50:33] Steinsplitter: https://bugzilla.wikimedia.org/show_bug.cgi?id=57642 [15:50:44] thx [16:13:14] !newweb [16:13:14] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/NewWeb [16:41:16] petan, where is the magic script that fills in all the instance-specific nagios configs? I have a puppetized icinga server but… no hosts to monitor :( [16:45:29] !log wikimania-support Updated scholarship-alpha to a3cda60 [16:45:32] Logged the message, Master [16:48:13] !log wikimania-support Ran migrations/20131225-01-iso-countries.mysql on scholarships_alpha [16:48:15] Logged the message, Master [16:52:41] Ping: coren [17:22:31] does someone know details about the federated databases? [17:37:09] Could some helpful labs user (as opposed to toollabs user) click on this and verify that it reports sensible, personalized info? https://wikitech.wikimedia.org/wiki/Special:NovaResources [17:37:20] note, that page takes forever to load :( [17:39:06] andrewbogott: The info seems personalized. [17:39:27] andrewbogott: I'm a helpful labs user - with no rights. For me it reports nothing, despite 2 headers [17:39:34] anomie: Cool. And did the page take so long to load that you thought the internet was broken? [17:39:40] No [17:39:51] hedonil: oh yeah, maybe I should customize those headers if the lists are empty :) [17:40:10] thanks y'all [17:46:03] coren or other: i cannnot find the supressionlog table? [17:50:34] Steinsplitter: Is there a suppressionlog table at all? [17:51:31] anomie: on toolserver is a supressionlog table (i think :/) [17:51:39] * anomie greps MediaWiki and finds a "suppressionlog" user right, but no reference to a "suppressionlog" table [17:51:57] --> delete --> supress [17:55:43] Steinsplitter: I'd expect that to be in the normal logging table with log_type 'suppress'. But those entries aren't available on the tool labs replicas because they aren't publicly visible on the wikis. [17:56:14] :/ :(:( [17:56:48] I don't know if toolserver might have done some sort of special setup to expose those as a "suppressionlog" view for certain authorized users, you'd have to ask someone familiar with toolserver about that. [17:59:53] ping: coren^^ [18:22:44] * Coren catches up on scrollback. [18:23:56] Steinsplitter: anomie is correct, suppression log entries are in the logging table but are... suppressed on the replicas. (They are not public information and even the fact that supression has taken place is deemed private given the nature of the material) [18:27:29] thx [18:37:12] Steinsplitter: Read: [18:37:16] !newweb [18:37:17] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/NewWeb [18:39:28] Steinsplitter: ^^ there [18:54:38] Hi [18:54:58] Hello, Myst``` [18:55:19] I want to know if it's possible to request install of a package on tools-login [18:56:28] Hi Coren [18:56:33] it is Myst``` [18:57:14] matanya: ? [18:58:29] yes Myst``` [18:58:54] And where ? :p [18:59:10] bz Myst``` [18:59:39] ??? [18:59:59] Myst```: bugzilla.wikimedia.org [19:00:14] Ok, thanks [19:06:06] What's the difference between -dev and -login, what should i do on -login ? on -dev ? [19:10:00] Coren: Ping? [19:10:06] Pong. [19:10:25] Coren: DDoS resolved I presume? :p [19:10:49] Myst```: They are functionally identical, but we request that if you need to do heavy stuff (compiles, large data crunching) you do so on -dev so as to not impact interactive performance on -login. [19:10:53] a930913: Indeed it is. [19:11:34] Coren: So can you help me with the lighttpd config? :) [19:11:48] a930913: Probably. What, exactly was your issue? [19:12:01] Coren: Need to send a header on a page. [19:12:30] Coren: thanks [19:12:34] Most often, the easiest thing you can do is simply output it from your script. What's it written in? [19:13:02] Coren: Javascript? :p [19:14:52] How unusual. :-) Likely, there's a simple method to output headers. You're using node.js? [19:15:32] Coren: No, I'm using javascript for its intended use. [19:15:59] Coren: you mentioned adding another exec node or 2. Could that maybe be done soon-ish? http://pastebin.com/raw.php?i=6TUxRt6u [19:16:08] Wait, what do you run on the /server/ I meant. [19:16:31] Coren: Nothing, just serving some js script with an access control header. [19:16:53] MrZ-man: I was hoping I could delay until the move, but clearly that won't do. :-( [19:17:47] a930913: Aaah! Then you really need to do it with the server itself. In your .lighttpd.conf add: [19:18:01] setenv.add-response-header = ( "The-Header" => "Your-Value" ) [19:18:15] Coren: Yes, "19:11 < a930913> Coren: So can you help me with the lighttpd config?" ;) [19:18:36] Coren: mod_setenv is already imported? [19:18:53] Coren: += [19:19:06] a930913: Yes, hedonil is right, you probably want += [19:19:26] Coren: However mod_setenv needs to be imported before some other modules. [19:20:21] a930913: And yes, setenv should already be there. [19:20:25] Hm. [19:20:30] Actually, it doesn't seem to be. [19:20:33] * Coren adds it. [19:20:51] That should sort it :) [19:22:26] a930913: Stop and restart your webservice and it should have setenv. [19:24:34] Coren: error.log has a bunch of errors that don't look related to my end. [19:25:06] What tool? [19:25:40] Hmm, restart!=stop+start [19:26:16] Coren: Oh joy, from chrome: "Access-Control-Allow-Origin:en.wikipedia.org" ^_^ [19:26:47] Yeay! [19:26:57] Coren: Thanks for sorting it :) [19:27:30] Now the question is, do I now test this, or do my other coursework? :p [20:34:37] what's up SGE? [20:37:47] Nothing is up, insofar as it's doing its job. We're just out of resources. [20:37:52] I'm adding a node now. [20:46:40] New eqiad NFS server fun: [20:47:06] /dev/md123 7.3T 51M 7.3T 1% /srv/scratch [20:47:06] \/dev/md122 9.1T 40M 9.1T 1% /srv/dumps [20:47:07] \/dev/mapper/store-backups 20T 20K 20T 1% /srv/backups [20:47:07] \/dev/mapper/store-project 30T 20K 30T 1% /srv/project [20:47:22] Not counting the ~22T spare space I keep around for expansion. [20:48:49] More space to fill with multi-gigabyte log files! [20:48:53] * valhallasw rejoices [20:53:57] Coren: yeah, always think big. http://www.flickr.com/photos/110698835@N04/11221971226/lightbox/ ;-) [21:20:22] Damianz: Could you take a look at cluebot3? It seems to be running multiple times atm. [22:24:22] grid nodes++ [22:41:10] Is there an SMTP relay in labs for apps that want to send mail? I tried using smtp.pmtpa.wmnet but it seems to want authentication [22:41:47] This is for my alpha testing of the Wikimania Scholarships application [22:50:43] !log wikimania-support Updated scholarship-alpha to b738f51 [22:50:45] Logged the message, Master [23:15:58] anomie, hedonil: Does https://wikitech.wikimedia.org/wiki/Special:NovaResources still look reasonable? (hedonil, hopefully the headers are more accurate now) [23:16:32] * hedonil logs in to check [23:16:37] andrewbogott: Yes. And still not incredibly slow, BTW. [23:17:07] andrewbogott: yeah same, just headers. [23:18:07] andrewbogott: perfect privacy [23:18:16] * hedonil congratulates [23:20:07] 'k [23:20:09] thanks