[07:24:28] Hi.. I am new to the wikiLabs.. can anyone help me to get started with it? [07:43:31] anyone knows how NFS is structured for beta cluster's deployment-apacheNN servers? Trying to figure out from where they serve latest php [09:43:14] !log gwtoolset Created instance i-00000962 with image "ubuntu-12.04-precise" and hostname gwtoolset.pmtpa.wmflabs. [09:43:15] gwtoolset is not a valid project. [09:43:25] !log glam Created instance i-00000962 with image "ubuntu-12.04-precise" and hostname gwtoolset.pmtpa.wmflabs. [09:43:28] Logged the message, Master [09:48:37] !log glam Successfully associated 208.80.153.148 with instance ID 6e183cfc-2088-42ad-b478-73bdb2221c95. [09:48:39] Logged the message, Master [09:50:38] !log glam Successfully added gwtoolset entry for IP address 208.80.153.148. [09:50:39] Logged the message, Master [10:06:43] anyone know why i can't edit the .bash_profile in the instance i just created? i get a system write error ... [10:08:06] in fact, i can't create anything in my home dir ... [10:08:50] anyone know what might cause that issue? [10:21:50] !log glam installed apache2 [10:21:52] Logged the message, Master [10:33:36] !log glam installed mysql5.5 [10:33:39] Logged the message, Master [10:36:12] !log glam installed php5, php5-dev [10:36:14] Logged the message, Master [10:37:42] !log glam installed install liblua5.1-dev [10:37:44] Logged the message, Master [10:41:24] hashar: or Coren, do either of you know why my home dir in an instance i just created will not let me add anything to it? [10:42:34] dan-nl: maybe the instance is not fully booted up ? [10:42:38] iirc it needs puppet to have run on it [10:42:50] though that is usually done on start [10:42:53] or [10:42:58] that is a bug :-] [10:43:33] the instance is up and running as far as i can tell, and The last Puppet run was at Mon Oct 28 10:38:10 UTC 2013 (1 minutes ago). [10:45:52] can you log on it at least? [10:45:57] double check your id [10:46:01] and the file ownerships [10:46:10] maybe the filesystem has been mounted read only [10:46:45] i can log in, i have been installing software on it [10:47:58] when you say file ownership … the permissions are dan-nl wikidev [10:48:02] on my home dir [10:48:51] i can run as root to install software, but not even root can affect my home dir. how do i mount it so that i can write to it? [10:52:12] no clue honestly [10:52:19] I would try running puppet: sudo puppetd -tv [10:52:22] and restart the instance [10:52:38] k, will try thaht [10:54:17] restart from the command line or from the wikitech site? [10:55:30] strange, still says touch: cannot touch `test': Read-only file system [10:55:43] from command line [10:55:47] that is the same anyway [10:55:48] sudo reboot [10:55:49] :D [10:56:00] ah, yes, did it from the cli [10:56:00] I guess your /home got mounted readonly [10:56:15] ja, how can i make it mount as read/write? [10:56:38] would Ryan_Lane know? [11:00:46] hashar, in any case, thanks for thinking about it with me [11:05:31] dan-nl: coren should be there in a couple hours or so [11:05:39] he is on east coast Canada [11:05:50] cool, thanks, i'll ask him [13:00:23] Coren: when you have a moment, when i log into the instance gwtoolset.pmtpa.wmflabs, i cannot write to my home directory. i get the message. touch: cannot touch `test': Read-only file system. any ideas on what might be wrong? i just created this instance today. [13:16:42] !log deployment-prep restarted elasticsearch nodes to pick up new config [13:16:49] Logged the message, Master [13:18:36] dan-nl: It's our friend Gluster having a fit. [13:19:10] It'll require being kicked and yelled at. I'll be able to do it for you in ~30m after I've had breakfast and coffee. :-) [13:20:12] Coren: okay, no need to kick or yell, just a kind request to fit itself is fine for me :) enjoy your breakfast [13:20:44] No, no, gluster really deserves some violence. :-) [13:45:43] Coren: looks like the web servers have gone belly up again [13:46:44] WFM in HTTPS; if it's stuck in HTTP let's hope it lasts long enough for me to see where the issue is this time. [13:48:16] yep http down https up [13:49:17] I see it in the logs. [13:49:21] * Coren investigates. [14:00:42] Oh, ffs. I *hate* heisenbugs. [14:01:02] I know a couple of things it /isn't/, at least. [14:01:30] Betacommand: Apparenty, even under http tools which use the lighttpd setup don't seem affected. [14:03:58] What I don't get is that the proxy folds everying under http internally. The actual webservers shouldn't even be able to tell, let alone behave differently. [14:06:04] Betacommand: Unless you need some apache-specific stuff, you also get FCGI for free with that setup. [14:07:49] Coren: gotta love those types of bugs, its what makes debugging *fun* [14:08:38] I prefer my bugs deterministic tyvm. :-) [14:09:15] Coren: I prefer not to have bugs :P [14:20:22] Coren: AnomieBOT's (supposedly) lighttpd-using stuff is currently not loading via http. My local proxy reports "the connection to tools.wmflabs.org (208.80.153.201) could not be established". curl -v (not through my local proxy) connects and sends headers, but then doesn't seem to get any response. telnet to 208.80.153.201 port 80 similarly seems to connect, but eventually timed out with no data printed after I sent "HEAD /anomiebot/ HTTP/1.0". [14:21:41] anomie: Aha! That help; it's clearly the proxy that's ill then. [14:22:02] Coren: Hmm. Suddenly that curl connection received a response. Took forever though. [14:22:56] That looks suspiciously like the proxy is trying something, timing out, falling back to something else and /that/ works. [14:23:13] HTTP date header indicates 14:21:33 GMT, despite the headers being sent several minutes before that. Which supports your theory. [14:23:20] hmmm. Could it be trying to IPv6 first? I don't think there are supposed to be AAAA records. [14:23:34] curl indicates 208.80.153.201 [14:23:50] Also, I don't think IPv6 works from here (which, btw, ugh) [14:24:03] Yeah, but that's the proxy itself; I'm wondering if the proxy is trying to talk to the webservers through v6 first. [14:24:29] * Coren tcpdumps. [14:25:28] Coren: Fired off a new curl connection. And one from my linode in Texas. Both are currently sitting at "Trying 208.80.153.201...". [14:26:14] * Coren needs to read packet dumps. Joy. [14:26:55] Connection for the first curl timed out. Re-executed, this time it connected immediately, and is now sitting after sending headers. [14:28:30] Same for the linode connection. [14:32:32] Coren: Hmm. Both connections finally received responses within 6 seconds of each other, according to the HTTP Date headers returned. Even though they were started over a minute apart. Makes me wonder if the proxy's connection pool is filled up or something. [14:56:22] anomie: That was my first guess, but it never got over 7/150 [14:57:22] Coren: does a reboot resolve the issue? [14:58:01] Betacommand: Heh. Rebooting is a last resort. [14:58:22] ... wait, I get timeouts on the original SYN now. WTF? [15:00:50] I've been seeing slow or timed-out connects to the public IP too, besides the long waits for responses when it does connect. [15:02:03] I wonder if it's getting better, now I see only about 2.5 minutes between headers and response instead of 4.5... [15:02:15] I think the reason why I can't see the problem is that it's not at the server level at all. [15:06:37] No such connection issues on port 22, nor 443. [15:06:42] YuviPanda, how are the hands? [15:07:36] that is... wtf? [15:07:52] The sequence when I tcpdump a telnet to port 80: [15:08:10] syn...syn...syn...syn...syn with the normal exponential backoff. [15:08:47] the response? After very many secs of delay: syn ack... syn ack... syn ack... syn ack /with the same timing, only a lot of lag!/ [15:09:27] From my end here, tcpdump showed no SYN+ACK when the connect was failing. But once the connection was established packets from my end were ACKed without delay. [15:09:41] It looks like something is delaying synacks [15:10:23] anomie: Unless I'm wrong, if you try it and look at a packet dump, you should see exactly as many syn acks as you sent syns, only with a long delay. [15:11:58] said delay being variable and possibly long enough that your connect() times out before the answers get there. [15:16:16] Coren: Still no delayed syn+acks seen here. [15:16:40] I'm seeing them server-side. [15:17:46] Something is behaving /really/ oddly at the network layer. [15:19:12] Huh. I think I'm beginning to see what's going on. I think the kernel is in SYN flood protection mode. [15:20:43] * anomie notes that the one syn+ack seen here has a TS ecr matching the TS val in the last SYN sent [15:27:25] Yeah, it's not syn cookies as I thought. [15:27:37] * Coren grumbles. [15:34:33] I'm going to presume that the network stack is borked. [15:35:00] Thankfully, a reboot takes (literally) seconds. [15:36:15] And, indeed, that fixed things. [15:36:26] That's... bad. [15:50:41] does labs has down competition with ts? :P [15:55:19] Base: As far as I can tell, everything is up and full of joy. We had annoying slowdowns on HTTP earlier, but they are fixed now. What's up? [15:56:20] some pages on ts are working faster than on labs… [16:08:58] hey Coren did Gluster react to the kicking and yelling? [16:09:50] dan-nl: It might have, if I hadn't been distracted by the http problem. :-( Sorry about that -- lemme go do this now. [16:10:21] np, i'm sure Gluster was happy not to be kicked and yelled at ;) [16:10:23] dan-nl: What's the project name? [16:10:39] glam [16:11:09] instance is gwtoolset.pmtpa.wmflabs [16:11:32] Can you write to home now? [16:12:36] yes, what was it? [16:12:48] Gluster needed to be kicked. [16:12:57] poor Gluster [16:13:01] thanks Coren [16:13:34] Heh. "Poor gluster" my backside. We're all anxiously waiting for the moment when we'll be able to take it out the back door and shoot it. :-) [16:15:07] oh boy … :) [16:15:19] guess Gluster is a bit malicious [16:52:07] Base: That depends greatly on how things were ported between the two. There are some things for which database queries need to be tweaked to be faster than on TS, and some older ways of doing things which will be suboptimal here. The environment has been made so that things are mostly functionally the same, but things written /for/ labs will generally be faster than those moved without tweaks. [16:53:16] Also, the database performance profile is rather different; anything that does a few complicated/large queries will be significicantly faster but things that do a lot of small queries will be much slower unless parallelized. [17:00:48] Coren: On that subject, have you looked at https://bugzilla.wikimedia.org/show_bug.cgi?id=56029 ? [17:09:29] Coren, MrZ-man: Looks like we need a recentchanges_userindex view? [17:10:29] anomie: Huh, indeed. I wonder why I didn't think of it first. [17:15:25] anomie: Deploying now. That'll take some time because DDL needs table locks. [17:15:38] MrZ-man: ^ [17:15:49] yay [17:22:42] MrZ-man: If you were doing this on enwiki, that was the first in list so the view should already be there. [17:25:09] Ah, right, it's only a new view so it's not going to take as long as view changes. [17:25:25] It should actually be completely done in a few minutes, too. [17:33:28] Coren: http is borked again [17:34:06] thanks [17:34:40] * YuviPanda waves at Coren [17:36:33] Coren: both http and https are down [17:37:40] Huh, https works for me but I got the same issue. [17:38:12] ... and the connection table is filled with SYN_RECV again. [17:40:22] Looks like a low-bandwidth syn flood. How... wtf. [17:40:50] an external attack? [17:41:05] Yeah, it kinda looks like a half-hearted DDOS. [17:41:37] Just a few hundred bad SYN per seconds, not enough to bring anything down but enough to really mess up the accept() queues. [17:42:15] should it be done on enwiki? I'm getting (other) queries stuck with "Waiting for table metadata lock" [17:42:44] MrZ-man: It should have been done long ago. Lemme go see what's up. [17:43:40] Ah, there's a pending view update left indeed. [17:44:16] Which itself is waiting on a really long query that's about to die. [17:44:38] ---TRANSACTION 7D29FDCB, ACTIVE 522015 sec [17:44:51] * Coren needs to have words with that user. [17:46:08] Coren: woah, 6 days! [17:46:40] YuviPanda: It'd rarely be an issue since it only had a transaction lock; but that will stall DDL. [17:46:43] Coren: what kind of query was that? [17:47:00] right [17:47:09] Betacommand: Some sort of cross join between categories and pages. [17:47:31] * Coren goes back to the SYN thing. [17:47:54] ... which seems to be gone. [17:48:02] * Coren stares hard at the 'net. [17:52:21] I don't think it's a willful DDOS. It looks like something is spidering one of the projects and following links to some tools like geohack. [17:52:33] So we get bursts of 1000s of requests. [17:55:25] Time for IP blocking at the firewall? [17:56:34] The traffic seems to be sufficiently legitimate that I'd rather not. I'm going to crank up the number of clients at the proxy, this should allow the faster requests to be answered without being stalled by the slower ones picking all the slots. [17:57:51] So the server will be able to process those bursts without running out of slots. [17:58:36] At first glance, this did the trick; the number of unanswered SYNs just plummeted. [20:23:13] * Damianz poke Coren [20:23:23] * Coren pokes back! Poink! [20:23:39] Seem to be getting a 403 when quering tools-webproxy randomly :( [20:24:08] Weirdly, seems to only happen from my bots server not from my browser =\ [20:24:43] 403? [20:25:42] Yeah [20:25:46] That's... bizzare. There's nothing in the webproxy config that should possibly be able to give 403s. [20:25:55] And I like, don't see the request in the access.log either it seems [20:26:13] (Bit hard to tell as some requests are working, like every third isn't) [20:26:15] o_O [20:27:27] Wait, I lied - some UA are blocked and can give 403s; but I don't see how anything else should. [20:28:24] I just tried with the UA 'Fairy cakes' to test that theory - still get 403's back, sometimes [20:28:54] I saw them. I was just wondering wth UA "Fairy cakes" was. [20:29:02] lol [20:29:15] :D [20:34:57] Damianz: It's definitely being denied by the proxy. [20:35:04] :( [20:35:43] Ah, wait, I see why. You're hitting flood controls! [20:36:33] lol [20:36:43] This bot makes like dozens of requests a second at peak [20:36:59] Really I need to get time to move it to tools then this can directly query the db and skip out https [20:37:02] s/s$// [20:37:05] err [20:37:08] s/$/s/ [20:37:31] Yeah, that's what's going on. You're hitting mod_evasive's triggers because you're hitting the same page over and over. [20:38:11] Yep :) [20:38:21] The table 'vandalism' is full < explains my other problem :( [20:38:37] or you could access the db from bots ;) [20:38:37] * Damianz wonders [20:39:02] giftpflanze: Didn't think that worked quite well, yet [20:39:16] giftpflanze: That's like putting a fresh coat of paint on a car before you have it towed to the junkyard. [20:39:21] The bots sql server has caused me no word of pain since someone deleted one of them after breaking the backups [20:39:34] 10271043 requests since may, not bad [20:39:41] i was not serious [20:39:46] Only 2.5G of logs - still need logrotate [20:40:18] Heh. Nowhere near as bad as some labs miscreants who ended up with terabyte logs. :-) [20:40:29] Damianz: it works for me [20:41:09] Main bot logs get to like 50GB but they rotate, spamming wikipedia api [20:49:47] YuviPanda: up? [21:17:32] Hi! Is there a correspondent to the centralauth/globalusers table on the toolserver for Labs? [21:17:45] I need to access the centralauth user accounts [21:19:08] what is the right way to iterate over a revision table? select rev_id from revision limit 10 offset 100; with incrementing offset? [21:19:51] lbenedix: why do you cut it? [21:20:18] what do you mean? [21:20:29] I should do it without limit and offset? [21:20:34] why not omitting the limit? [21:20:42] then it is one query and faster [21:20:58] but it needs a lot of ram [21:21:05] not? [21:21:14] depends on the implementation [21:21:37] like for e.g. mysqltcl there a 3 to choose from [21:21:37] I want to run a python script on tool labs [21:24:07] https://ganglia.wmflabs.org is down SSL_CONNECTION_ERROR [21:24:26] hm.. works fine without https. [21:24:30] ireas: afair centralauth is not yet replicated; Coren? [21:24:57] It has been for some time, giftpflanze [21:25:05] oh [21:26:02] giftpflanze, coren: i just found it by looking at the replica script: `sql centralauth`. thanks anyway! :) [21:28:13] How hard is it to get stuff installed on tool nodes? [21:30:39] apropos: Coren: what about my tdbc packages? [21:33:17] that's almost 2 months now :p [21:33:31] giftpflanze: I get out of memory, if I remove the limit and offset, even if I only fetch one row from the cursor [21:33:47] no, wait, 4 [21:34:06] lbenedix: that surprises me [21:34:48] lbenedix: do you have means that are not so memory-consuming? [21:35:36] the query is: select rev_id, rev_page, rev_user, rev_user_text, rev_timestamp, rev_deleted from revision [21:36:14] * Damianz waits for node to compile [21:43:20] giftpflanze: Out of memory (Needed 1173888 bytes) is the errormessage [21:44:30] well, then i guess you have to slice it as you mentioned it [21:46:12] hm… am I the only one who’d like to join data from different projects => hosts and therefor could need the FEDERATED engine? [22:03:37] ireas: There really aren't any plans for federating anything else than commons and wikidata. [22:04:18] Coren: okay, thanks for the information [22:08:48] can someone help me using the grid engine? [22:13:06] lbenedix: maybe, what do you need to do? [22:13:44] I have writte an python script that queries the database [22:13:59] I think this is a non trivial task and should be done with the grid thingy [22:14:21] right now I print logging messages about the status to stdout [22:14:41] and the result is written to a file [22:14:48] Well… you can use the grid to launch and maintain a script. That's (the only thing) I use it for. [22:15:15] I'd like to know the status while its running [22:15:33] Yes, the grid engine will collect stdout in a logfile by default. [22:15:59] where is this logfile? [22:16:04] jstart -N /path/to/script [22:16:21] I believe it's written to the homedir of the tool. [22:16:28] So you would 'become' the tool name before running the above. [22:18:02] what is -N ? [22:18:40] -N is optional, it gives a name to the job in the grid. [22:19:28] okay [22:19:35] I can see this with qstat [22:20:10] yep! [22:20:17] Are you getting stdout like you want? [22:20:41] and .err telling me it cant find the output file [22:21:13] seems reasonable :) [22:21:23] I need absolute paths? [22:21:29] Hm… not sure. [22:21:49] I'd think so, since the grid may kill/restart jobs as needed [22:21:54] e.g. if one of the hosts goes down [22:22:25] I hope It wont kill my job... [22:23:47] Well… if you want a persistent service it will need to be able to restart gracefully. [22:24:19] It has to run only once [22:24:36] I don't understand... [22:24:44] Does it persist, or is it just a single batch job? [22:24:56] one job [22:25:12] Oh… there may not be much advantage to using the grid, in that case. [22:25:24] grid is mostly for persistent interactive things [22:25:33] Does it take hours to run? [22:25:40] I'm not sure [22:25:55] It iterates over the wikidata revision table [22:29:40] writing the files seems not to work... [22:31:45] how so? [22:32:30] I have no idea [22:32:41] I mean, what is not happening that you expect to happen? [22:33:19] when I run it with python foobar.py the file is created, when starting it with jsub python foobar.py its not [22:33:40] what file? [22:33:57] Is the file opened by the script, or are you talking about a >> redirect? [22:34:39] If the file is opened by the script, I'd advise using an absolute path for the file and making sure it's on a shared volume, somplace under /data/project or /home [22:36:54] okay... i get a permission denied now, should be a solvable problem now [22:37:11] cool :) [22:37:56] thanks a lot [22:38:33] sure! [22:38:43] I'll try the script without limit now and when it doesnt crash after 30min I'll go to sleep [22:42:12] I dont get an output in ~/foobar.out but the results file is written [22:43:37] * lbenedix crosses fingers [22:46:19] Coren: Is tool-db backed up etc? [22:49:48] Damianz: No. [22:50:18] * Damianz finds his scripts to stick in crontab 15 times [22:52:10] Test [22:52:59] select with limit and offset seems to take more time for bigger offsets... [23:02:30] hey, we have a 'wikidev' group on one of our instances at ee, but I can't add myself to it with usermod -a -G wikidev werdna [23:02:44] I get no output, and then when I log out/in again, it doesn't appear in $ groups [23:05:17] This appears in /var/log/syslog: Oct 28 23:04:35 ee-flow nslcd[1117]: [a293c5] error writing to client: Broken pipe [23:25:58] lbenedix: The joys of mysql. [23:26:17] lbenedix: I hear some people have had success with cursors, provided they fetch several dozen rows at a time. [23:56:48] Coren: I got a out of memory when I tried running the script without limit and offset, just calling cursor.fetchmany(10) for testing [23:57:12] What language is this; python? [23:57:17] yes [23:58:02] I don't know enough about python memory allocation to do more than make a guess, but it's possible that your program heap grows beyond its allocated limit before it tries to collect garbage? [23:58:29] If you have an explicit method of freeing rows you've fetched, that might do the trick. [23:58:54] I think I might actually have sorta got the bots working on tools (finally), even if in a slightly hackish way (yay) [23:59:02] * Damianz waits for it to all implode at the worst possible moment [23:59:06] I'm not sure... I'll look how far it will come tonigt... have to sleep now ;)