[00:16:22] physikerwelt: Was the instance name i-00000906.pmtpa.wmflabs? [00:16:50] as far as I remember yes [00:17:08] but it was deleted and I created a new one [00:17:30] now using labs-vagrant... seems to work really nice until now [00:18:25] physikerwelt: No, I mean the "another project is printed out"; was the output similar to http://permalink.gmane.org/gmane.org.wikimedia.labs/1976? [00:19:16] yes [00:20:07] but after the issue was manually fixed a wrong project and a wrong hostname were listed [00:23:49] Hmmm. [00:25:33] scfc_de: Do you have a similar problem? [00:27:17] No, I had the issue in the mail, but waiting for some Puppet runs in a row fixed it. I've never seen a wrong project though; only the wrong instance name (906). [02:20:09] Hi. [02:20:15] Can someone help me debug a bot? [02:20:18] * huh is stuck [02:58:08] wikitech down? [02:58:25] I can't ssh into production right now, so there's nothing I can do [02:58:32] andrewbogott_afk, Coren: ? [02:58:42] paravoid: ? [02:58:46] * Coren checks. [02:59:04] Times out, even? Huh. [02:59:58] There seems to be a script that's eating all the apache workes. [03:01:16] Ryan_Lane: Seems to be back up after a graceful. [03:01:38] a script? [03:01:46] which script? [03:02:37] Usr /usr/bin/nova [03:02:52] what in the world is running that? [03:02:56] apache [03:03:02] I don't see how [03:03:14] * Coren looks at the logs. [03:05:08] hm. weird. I can't ssh into sartoris-deploy with my new key [03:05:16] Ah, no, corellation != causality. The same thing was starting both; salt was doing something that caused a LOT of gets an puts to puppetmaster; /those/ were the ones who ate all the workers [03:05:17] but I can log into bastion-restricted [03:05:36] salt was? [03:05:48] any idea what salt was doing? [03:05:58] that also doesn't really make any sense :) [03:06:09] ~200 salt-masters working. I'm guessing they were forcing puppet runs. [03:06:15] * Coren tries to find salt logs. [03:06:45] should be about 50 salt masters [03:06:47] or maybe 25 [03:06:55] and they listen to minions [03:07:04] they don't actually run anything [03:07:21] Well no, but they'll cause the minions to run stuff. [03:07:32] the minions are all in labs [03:07:37] Same difference. Apache logs show pretty much every puppetd calling it at once. [03:07:38] and it just publishes something [03:07:42] oohhhhhhhhh [03:07:44] I see [03:07:52] calling in* [03:07:58] I thought you meant a bunch of things running on the master itself :D [03:08:10] someone must have forced a puppet run in labs [03:08:14] I wonder who [03:08:22] yeah, that definitely needs to run as a batch [03:08:39] * Coren can't find it in the logs. Mostly overrun by an error. [03:09:43] Coren: can you see why I can't log into any of the sartoris-* nodes? [03:09:53] * Coren goes check. [03:10:15] I can to bastion-restricted and to nova-precise2 [03:10:32] I have a new key, but that should be via nfs anyway [03:10:41] Jan 9 03:04:27 sartoris-deploy sshd[9312]: Connection closed by 10.4.0.85 [preauth] [03:10:51] Wrong line [03:10:57] Jan 9 03:04:27 sartoris-deploy sshd[9312]: input_userauth_request: invalid user laner [preauth] [03:11:05] ah [03:11:11] can you restart nslcd and nscd? [03:11:37] You exist again. Just -deploy or are there others? [03:12:14] from virt0: salt -G 'fqdn:sartoris*' service.restart nslcd [03:12:18] from virt0: salt -G 'fqdn:sartoris*' service.restart nscd [03:12:19] ;) [03:12:38] there's sartoris-server and sartoris-deploy [03:12:46] and sartoris-target [03:12:51] Bleh; globs in fqdns scare me. [03:12:53] ah. crap [03:13:05] I have a custom salt master on those [03:13:28] * Coren does it by hand. [03:13:32] which is why they are still broken [03:13:38] I should really set those up as syndics [03:14:11] You should be okay now on all three. [03:14:17] cool. thanks! [03:15:28] we need to get all my OpenStackManager changes merged in [03:15:54] anyway, I'll discuss that when I get home :) [03:16:07] I needed access to my trebuchet dev environment for now [03:16:27] I should really have sartoris deleted and a trebuchet project created [03:16:46] * Ryan_Lane creates a trebuchet project [03:18:00] I really need to setup vagrant for this [04:20:44] Late to the party… why did wikitech go down? [04:20:55] Someone made a salt puppet update on every labs node at once? [04:35:04] server reached MaxClients setting [04:35:04] springle: Nothing. Just RECOVERY - HTTP being on the host where it's located [04:35:04] is maxclients really small? [04:35:04] 150 [04:35:05] Sounds pretty busy [04:35:06] puppet is busiest on top [04:35:08] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 1.948 second response time [04:35:12] andrewbogott: ^ [04:36:20] But there's no theory for why MaxClients was reached? [04:37:28] here we go [04:37:29] --> morebots (local-more@208.80.153.165) has joined #wikimedia-operations [04:37:29] phusion_passenger exception then apache came back [04:37:29] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.179 second response time [04:37:29] !log wikitech down. restarted apache on virt0. phusion_passenger exception + MaxClients hit [04:38:50] phusion_passenger was running on virt0? [04:39:35] (For the record, I already read that log in -operations, but didn't understand it.) [04:39:48] Or, at least, didn't understand the 'why did this happen, how can we keep it from happening again' part :) [04:40:24] Ah [04:40:38] I don't think Sean spent that much time on it [04:40:55] Ryan and Coren discussed it in this channel as well, but I don't follow... [04:41:05] Anyway, clearly everyone is asleep now, so probably I should be too :) [04:41:21] I should really be too... [04:41:25] 04:41 here [04:41:45] yikes! [09:14:53] pfffft [09:15:03] wrong window... [09:58:32] .seen kartik [09:58:46] (03PS1) 10Nikerabbit: Add ContentTranslation to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106501 [12:58:02] Hello, when I do a python "welcome.py from tools-login for testing it works, but when I do it with jsub, it does not, and gives error: http://pastebin.com/raw.php?i=xmhAc0xm I am not sure if it's a problem with jsub. Can anybody help? [13:01:26] I did "python welcome.py -edit:0 -time:0 -limit:5 -sul" for trying and then tried "jsub -once welcome.py -edit:0 -time:0 -limit:5 -sul" with jsub. First one works, other one does not. And I don't know what the problem is. [14:02:57] Tanvir: it might be if your first job didn't run yet [14:27:09] matanya, you saw the error log? I think it's running. [14:27:31] yes, i don't really know. though an idea [15:06:50] Coren: you around? [15:08:08] Betacommand: Yep. What can I do for you. [15:09:25] is there a limit to the number of forks a job can have? [15:14:09] Betacommand: Not as such, but the vmem limit is cumulative. [15:21:21] Ok, Ill just need to set it higher then [15:21:52] Ive got a parsing script that can have as many as 25 children [15:22:59] Gotta love python, one of the few languages that supports true multi-processor support [15:31:54] Coren, did y'all figure out what caused that outage last night? [15:37:58] andrewbogott: puppetmaster crumbled over every single instance starting puppet at once. Someone apparently did a salt puppetd -tv :-) [15:38:16] Ah, yeah, that's what I got from the backscroll. No idea who the 'someone' was? [15:38:28] 'root' [15:38:29] :-) [15:38:32] bah [15:38:43] Not worth digging it deeper unless it happens again, for a good trouting. [15:39:12] Yeah, I guess it's not surprising that virt0 didn't handle that all that well. [15:40:05] I'd have been stunned if it did; especially since the apache on it is configured very conservatively. It doesn't take all that much to clog up all the workers. [15:51:33] Coren: I'm traveling today, will be back on in a few hours depending on whether or not airport wifi is working :) [15:51:48] andrewbogott: Architecture thing? [15:52:07] Nah, just escaping this terrible weather. [16:53:25] !log wikimania-support Updated scholarships-alpha to 8383429 [16:53:27] Logged the message, Master [17:48:51] coren: i'm wondering what's happening with bug #58949? [17:50:12] I'm sure you do. The box is up; remains to finish its configuration and tie it into the grid. Probably done this afternoon. [17:52:38] i could do a timezone joke now but i don't :) [17:54:09] giftpflanze: Wow! you will get promoted soon. http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=tools&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:55:53] giftpflanze: .. and tools-exec-gift-01 sounds pretty dangerous too ;) [17:57:06] oh, an abbreviation ... [18:21:08] (03CR) 10Legoktm: [C: 032 V: 032] Add ContentTranslation to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106501 (owner: 10Nikerabbit) [18:21:39] !log tools restarting grrrit-wm https://gerrit.wikimedia.org/r/#/c/106501/ [18:21:41] Logged the message, Master [18:22:30] Fetching gerrit [18:22:31] error: server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none while accessing https://git.wikimedia.org/git/labs/tools/grrrit.git/info/refs [18:22:31] fatal: HTTP request failed [18:22:31] error: Could not fetch gerrit [18:22:36] Coren: ^ [18:23:23] legoktm: I'm not seeing the error. [18:23:43] I can reproduce it pretty easily [18:23:47] become lolrrit-wm [18:23:50] cd lolrrit-wm [18:23:54] local-lolrrit-wm@tools-login:~/lolrrit-wm$ git fetch --all && git rebase gerrit/master [18:25:16] I'm just going to use origin since that seems to work [18:26:22] !log tools rebased grrrit-wm on origin/master since fetching gerrit was failing [18:26:23] Logged the message, Master [18:26:29] <3 logmsgs [18:30:33] hedonil: is wikiviewstats reasonably stable, or is it still under development? People on enwiki are asking for something more reliable than stats.grok.se [18:35:03] MrZ-man: although the job of developer is never done, it has reached now a stable phase with this week's realease of v3. [18:35:58] MrZ-man: OAuth integration is in progress and will follow ~ next week [18:37:41] giftpflanze: You should now be able to send jobs to -q gift [18:41:39] coren: ah, that's conveniently short :) [18:52:19] (03PS1) 10Nemo bis: Setup/restore #mediawiki-feed [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106555 [18:54:05] (03CR) 10Legoktm: [C: 04-1] Setup/restore #mediawiki-feed (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106555 (owner: 10Nemo bis) [18:57:04] (03PS2) 10Nemo bis: Setup/restore #mediawiki-feed [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106555 [18:58:17] (03PS3) 10Nemo bis: Setup/restore #mediawiki-feed [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106555 [18:59:16] (03CR) 10Nemo bis: Setup/restore #mediawiki-feed (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/106555 (owner: 10Nemo bis) [20:16:11] oh heh, I accidentally attempted to create a second instance with the same name... and it worked! [20:19:17] Which instance does the name resolve to? [20:21:02] MaxSem: Is that turning out to work ok? Or do I need to sort things out? [20:21:18] I thought I added logic to prevent name collision, but it may be imperfect. [20:28:44] andrewbogott, deleted both without problems, weren't needed anymore [20:28:52] fair enough :) [21:44:10] giftpflanze: so? How does it feel to hold your own exec node? (btw. it's color is still pretty light blue..) [21:45:38] blabla [21:45:50] giftpflanze: :P [21:48:31] hedonil: i'm glad it's finally there [21:49:13] giftpflanze: sure. but i'd have expected a deep red node color ;) [21:49:19] but i have still lots of time to start using it [21:49:40] giftpflanze: Pfft I could easily max out any exc node [21:50:57] i could too but i'm too tired after work (and even while at work) [21:51:51] giftpflanze: Its a matter of changing one number in one script [21:52:24] not for me [21:52:56] Right now Ive got it set to 25 worker processes, and it uses ~2GB of ram, set it to 50+ and watch the server melt [21:53:22] hehe [21:54:25] giftpflanze: I not only utilize multi-threading, but also multi-processing :P [21:57:42] me too [21:58:36] * hedonil just read about the upcoming infinte ban of Alkim Y and CherryX - popcorn cinema! [21:58:58] hedonil: ?? [21:59:02] link? [21:59:23] https://de.wikipedia.org/wiki/Wikipedia:Checkuser/Anfragen/CherryX,_Alkim_Y [22:02:39] No 1 talk on german wikipedia - by far. https://tools.wmflabs.org/wikiviewstats/dev/?locale=de&lang=de&project=&page=&datefrom=0000-00-00&type=discussion [23:38:45] andrewbogott_afk: ping [23:49:01] /facepalm