[[Resource Type::instance]]

[10:49:51] !ping [10:49:51] pong [13:28:06] petan: wb! [13:28:17] :) [13:28:30] petan: wm-bot was acting weird... something's missing from the log and idk why [13:28:37] oh [13:28:41] (vs. my memory and my own personal log) [13:28:41] when? [13:28:53] it has internal IO cache [13:28:58] lets see [13:29:07] it's possible it crashed before it was stored to file [13:29:41] good question... would that be a bot quit and rejoin? [13:29:41] it stores the text to ram and every minute it's flushed to files, so that it's not depending on IO [13:29:46] no [13:29:49] this bot never quit [13:29:54] k [13:30:03] so how do you detect a crash? [13:30:09] from logs [13:30:13] system.log [13:30:15] or something [13:30:17] I don't remember [13:30:25] it's only log file in there [13:30:36] one sec [13:30:51] you would see an error there [13:31:22] are you sure there wasn't a netsplit when it happened [13:31:32] yes [13:32:00] ok [13:32:20] there is no log for yesterday, weird [13:32:23] my log (which would have shown a netsplit): [13:32:25] 17 14:38:43 <@mark> is probably rack C1, I think [13:32:25] 17 14:40:40 < cmjohnson1> !!log swapping disk3 db44 [13:32:25] 17 14:41:01 < cmjohnson1> mark: lemme look an LC11 [13:32:25] 17 14:42:21 < cmjohnson1> mark so the 1st half of lc11 is mrjp-b1 [13:32:27] 17 14:42:43 <@mark> ok [13:32:36] bot's log: [13:32:38] [14:38:43] is probably rack C1, I think [13:32:40] [14:42:43] ok [13:34:07] and morebots ignored him of course because !!log isn't valid. so i did http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=50452&oldid=50416 [13:34:15] hm, that would need to be either crash on every input from cmjohnson1 or there is a bug inside of bot or there was something preventing it from reading [13:35:01] look elsewhere in http://bots.wmflabs.org/~petrb/logs/%23wikimedia-operations/20120817.txt ; you see cmjohnson1 before and after [13:35:06] ok [13:35:27] you can also check systemdata file [13:35:35] http://bots.wmflabs.org/~petrb/db/systemdata.htm [13:35:46] uptime is a time when core is loaded [13:35:50] so on crash it reset [13:36:59] for reference: [13:36:59] 17 12:51:41 -!- cmjohnson1 [~chrisj_@pool-173-65-204-152.tampfl.fios.verizon.net] has joined #wikimedia-operations [13:37:03] 17 13:42:06 -!- cmjohnson1 [~chrisj_@pool-173-65-204-152.tampfl.fios.verizon.net] has quit [Quit: Chris has quit] [13:37:06] 17 14:20:07 -!- cmjohnson1 [~chrisj_@2620:0:860:2:51f5:f241:66f9:87c4] has joined #wikimedia-operations [13:37:09] 17 21:11:28 -!- cmjohnson1 [~chrisj_@2620:0:860:2:51f5:f241:66f9:87c4] has quit [Remote host closed the connection] [13:37:19] these ipv6 cause troubles to bot [13:37:38] likely was an exception on every input [13:37:55] irc packets are separated by colons [13:38:01] heh [13:38:06] let's see [13:38:40] huh [13:38:50] let's test a different case then [13:39:14] this is what happen: [13:39:16] :jeremyb!~jeremyb@wikimedia/jeremyb PRIVMSG #wikimedia-labs :heh [13:39:28] when you have ipv6 you can't split by colons [13:39:34] that's what bot does now [13:39:45] I created a better parser but I didn't implement it yet [13:41:15] I could do it now [13:45:39] hah, i always thought ori was ori-1 but it wasn't matching. it's ori-l! [13:48:35] petan, you split by spaces [13:48:53] then the one which begins with : follows until the end of line [13:49:00] but remove an initial colon if present [13:49:13] in my latest parser I do it kind of better way [13:49:24] the last colon is always prefixed [13:50:15] so I just remove the leading colon and parse it to source, command parameters and data (string, string, string - array, string) [13:51:00] ok, found another example: [15:47:50] hi zil @ http://bots.wmflabs.org/~petrb/logs/%23wikimedia-tech/20120817.txt [13:51:14] zil never responds but multichil keeps talking to him [13:52:24] petan: any estimate on a fix date? /me is about to tell other people about the bug [13:52:40] hm... within a day I hope [13:53:32] cool [18:27:11] * Ryan_Lane yawns [18:27:18] good morning Ryan [18:27:25] hi Ryan_Lane! morning. [18:27:27] good morning [18:28:15] is there something wrong with nickserv? [18:28:23] I can't msg it, so I can't get into private channels [18:32:31] Ryan_Lane: there were problems yesterday. [18:32:43] specifically? [18:32:59] Services gone completely. [18:33:17] Ah, it appears to be rebooted. Try again. [18:33:24] ? [18:33:33] Ryan_Lane: There have been global notices re: problems within the last hour. [18:33:39] ahhh [18:33:40] ok [18:33:41] right [18:33:46] it's working now [18:33:59] Yeah. They said hopefully that should have been the last crash. xD [18:34:11] [18:35:07] [19:12:57] [groupcat] [Global Notice] Sorry folk, services have run off again. We expect to catch them again shortly, after a brief period of network breakage. [18:35:13] 20 mins ago [18:35:20] There we go. [18:35:23] I hit /clear too often. ^^ [18:36:58] heh [18:39:09] There go services again. [18:39:11] ^^ [18:50:55] [Global Notice] Hi all, autumn tends to bring out the most aggressive of bugs... <- [19:30:48] Ryan_Lane: good luck btw. [19:31:02] on the board vote? [19:31:03] thanks [19:31:09] make sure to vote :) [19:31:27] you're assuming I'll vote for you! [19:31:30] :D [19:31:31] :( [19:31:32] heh [19:31:40] well, not all of my votes went to me [19:32:03] I voted for a few other people too [19:39:45] Change on 12mediawiki a page Developer access was modified, changed by Hiker link https://www.mediawiki.org/w/index.php?diff=574048 edit summary: [19:50:19] Ryan's pic is far too happy, other than that gl :P [19:51:24] the gl? [19:51:56] s/gl/good luck/ [19:52:24] ah [19:52:24] heh [19:52:25] thanks [19:52:28] make sure to vote :) [19:53:32] When does voting actually open? I know last week was the deadline for being eligible IIRC. [19:53:46] it opened today [19:53:49] did you not get an email yet? [19:54:12] Havn't seen one, though I do have a few to go though still. [19:54:26] I'm on the list on the site so should have one I guess :P [19:54:36] heh [19:54:37] yeah [19:54:42] mine came in at like midnight [20:03:54] I definitely didn't get one at midnight but considering it's highly likely based on us time I don't think timezones would make much difference. I'll just make a note to check later, got a mental todo list this week heh [20:06:14] heh [20:06:27] I removed region and dns domain from the instance creation page :) [20:06:42] dns domain is automatic, based on region [20:06:50] and no more zones, so no worries there [20:29:46] Ryan_Lane: Any reason image types have empty parentheses? [20:29:57] yeah, because that data isn't returned by the api [20:30:04] though the spec says it should be [20:30:09] that's fixed in the upgrade [20:30:21] Lol [20:30:31] Have to love when the docs and reality don't match. [20:31:04] well, it's the ec2 api's spec [20:31:21] in the new version I'm using nova's api [20:33:02] I hope nova becomes more standard in the long run. [20:33:22] agreed [22:38:47] Totally didn't have to have the secretary re-send my voting invitation =/ [22:39:08] Weird considering my mail servers /havn't/ been broken this week, probably should get a better mx setup eitherway tbf [22:44:12] heh [22:45:22] I wish gmail did sieve support, would make my world so much better. Refuse to use a web interface to manage dozens of rules and been too lazy to write a convertor to then import. [22:50:16] Damianz: can you log into here for me? https://virt1000.wikimedia.org/wiki [22:51:08] Yep [22:52:03] Page title's are broken and manage instances page is broken but login works. [22:52:13] page titles are brokebn? [22:52:24] well, there's no content [22:52:30] so, red links are expected [22:52:31] https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::instance-5D-5D/-3FInstance-20Name/-3FInstance-20Type/-3FProject/-3FImage-20Id/-3FFQDN/-3FLaunch-20Time/-3FPuppet-20Class/-3FModification-20date/-3FInstance-20Host/-3FNumber-20of-20CPUs/-3FRAM-20Size/-3FAmount-20of-20Storage/searchlabel%3Dinstances/offset%3D0 = [[Resource Type::instance]] [22:52:42] yeah. that's expected [22:52:45] Instance list returned a stack trace, 2 refreshes and I have a list [22:52:55] oh [22:52:57] that [22:52:57] Manage rather [22:53:00] Error getting instance list from Nova: An unknown error has occurred. Please try your request again. [22:53:03] yeah. I imported the sidebar directly [22:53:07] that's incorrect :) [22:53:11] Oh I see [22:53:14] I got redirected back [22:53:16] yeah [22:53:17] * Damianz facepalm [22:53:19] sorry [22:53:29] let me fix the sidebar [22:53:48] https://virt1000.wikimedia.org/wiki/Special:NovaInstance works fine just with no content, I did think it was weird the list looked like the current cluster >.> [22:54:11] heh [22:54:12] ok [22:54:12] good [22:54:15] give me a sec [22:54:54] now try [22:55:09] now you should see an instance [22:55:12] but can't do any actions [22:55:29] Yep, test123. [22:55:32] great [22:55:40] 08/20/2012 - 22:55:40 - User damian may have been modified in LDAP or locally, updating key in project(s): bots,bastion,deployment-prep,mailman [22:55:48] 08/20/2012 - 22:55:48 - Updating keys for damian at /export/keys/damian [22:55:54] oreally labs-home-wm. [22:56:06] heh [22:56:24] I wonder why it doesn't mention essextest [22:56:32] ok. added you as sysadmin and netadmin there [22:56:59] Totally have links now [22:57:02] \o/ [22:57:04] Configure instance page looks a little... empty [22:57:13] yeah [22:57:19] no content, remember? :) [22:57:23] that stuff comes from the mediawiki db [22:57:31] Ah. [22:57:36] Explains why puppet is there as it's from ldap. [22:57:42] yeah [22:57:58] I'd like to eventually move it to openstack [22:58:06] but, we didn't get a very warm reception on that [22:58:21] I can imagine you'd have to argue, why not Juja, Chef etc. [22:58:21] create an instance :) [22:58:41] we'll likely do the puppet stuff as a plugin [22:58:57] I was totally happy to have the puppet stuff rejected, because it made it easier to justify a proper plugin system [22:59:08] Failed to create instance. [22:59:12] really? [22:59:19] what instance name did you use? [22:59:24] rainbowponies [22:59:26] heh [22:59:38] It didn't like RainbowPonies :( Damn lowercase. [22:59:47] abc123 fails too. [23:00:13] I do indeed have a stacktrace [23:00:45] I wonder why I can create one [23:00:48] and you can't [23:00:53] ah [23:00:54] I know why [23:01:18] * Damianz does ldapearch uid=ryan to check if god: True is set [23:01:35] nah. you need to get a project token [23:01:41] which it should do for you [23:01:48] but likely it cached something poorly in memcache [23:01:55] or incorrectly [23:02:06] yeah. it did [23:02:16] I need to invalidate project tokens when a user is added to a project [23:02:30] well, when they are added to a role, anyway [23:02:31] Logout and in didn't clear it it seems. [23:02:39] yeah. it won't [23:02:48] it caches your token until it expires [23:02:52] Ah [23:03:04] it needs to do that [23:03:10] otherwise you couldn't use it on the cli [23:03:23] ok. there's one thing to fix... [23:04:28] One day labs-home-wm might just write the keys out to my homedir and I can ask nova directly ;) [23:04:44] what do you mean? [23:04:55] oh, nova would write it to the filesystem? [23:05:05] I prefer that it keeps it in LDAP [23:05:13] and have the bot sync it [23:05:28] It would actually be better in ldap, as would avoiding the bot writing out ssh keys. [23:05:28] of course, now the bot really only needs to sync it to the shared key location [23:05:41] the other crap it does is kind of unnecessary [23:05:44] But yeah being able to like kinit then use $bunchofscripts to do random admin tasks ftw [23:05:58] openssh won't read keys from ldapo [23:05:59] *ldap [23:06:06] we use the schema for that patch that allows that [23:06:09] Isn't there a pam module for that? [23:06:12] Ah [23:06:13] but upstream won't accept the patch [23:06:16] That sucks [23:06:31] * Ryan_Lane shrugs [23:06:38] it isn't that hard to write to a single shared location [23:06:41] openssh is a little bit meh for custom stuff, suppose that's why loads of people use dropbear or w/e it's called for custom auth rather than hacking pam. [23:06:58] dropbear is really limited [23:07:07] Yeah [23:07:30] But it's sooo much easier for custom ssh things, unless you want to go gerrit style and write your own (which having done once isn't pretty, damn tty handling). [23:09:09] well, there's ssh implementations in a bunch of languages [23:09:14] not so sure I trust them 100% [23:09:30] j^: project groups can't have that [23:09:30] That defines a class [23:09:40] a class underneath a group can [23:09:53] a group is just a grouping of classes and variables [23:10:24] I'm allways a bit un-easy about ssh stuff because of the security side of things, sometimes it's just required though (like for serial console servers with end user restricted access). [23:10:40] Damianz: well, I don't mind using the client portion of those libraries [23:12:42] The main gripe I have with using ssh programatically is it sucks, sometimes it just doesn't return data, blocks forever, you have to deal with multiple streams, ttys etc. Even the frameworks suck at abstracting it out so you end up doing crazy stuff like expect+ssh just to hack around it =/ [23:13:22] Really I should be able to run on a few thousand machines and expect it to work like it does locally, adding a few seconds of latency. [23:21:17] Damianz: ok. try to create an instance now [23:22:22] Damianz: I never use expect +ssh [23:22:25] Failed again. [23:22:29] paramiko for the win [23:22:34] hm [23:23:26] I keep meaning to 'port' RANCID to paramiko, as hacking up tcl for newer firmware with /slightly/ different output is a huge PITA. [23:23:56] hm [23:24:43] try again for me [23:24:50] I think this is a keystone failure [23:25:29] Failed again. [23:25:40] yeah [23:25:44] was expecting a fail [23:26:23] wait [23:26:29] which project are you trying in? [23:26:34] essextest, right? [23:26:41] yeah [23:26:52] I don't have the option on any of the others. [23:26:58] * Ryan_Lane nods [23:26:58] (though it does show me the zone) [23:27:03] well, region anyway [23:27:12] yeah [23:27:47] Special:NovaInstance&action=create&project=essextest®ion=eqiad specifically, base settings for everything. [23:28:08] yep [23:28:18] let's try something [23:28:52] ok. try now [23:28:56] you'll need to log in again [23:28:59] I cleared memcache [23:29:01] logged me out [23:29:13] I want to make sure it's not a cache invalidation issue [23:29:20] Created instance i-0000000d with image 795c75c0-4168-497c-816b-8ff6f8f33b69 and hostname i-0000000d.eqiad.wmflabs. [23:29:27] seems it is [23:30:14] I'm semi-surprised wmf hasn't switched to redis over memcache as standard yet considering the issues memcache has with filling the tcp stack up and causes issues. Different topic though :P [23:34:41] Ryan_Lane: Question, is this suppose to have the bug fixed where by every projected is forced to have a default sudo policy? (essextest doesn't have one). [23:37:14] Logged the message, Master [23:44:28] Damianz: no [23:44:53] ah. I think I see the problem [23:45:13] s/I think// [23:46:40] Damianz: can you visit the list instances page? [23:46:50] I removed you from the groups [23:46:57] and I think I deleted your token properly [23:47:24] so, you'll get a token when you visit the page [23:47:27] List or manage? List is blank as the wiki is blank, manage looks like I have no rights in the project. [23:47:30] err [23:47:31] sorry [23:47:31] manage [23:47:34] great [23:47:36] gimme a sec [23:47:49] refresh the page [23:47:53] then add an instance [23:48:26] Created instance i-0000000e with image 795c75c0-4168-497c-816b-8ff6f8f33b69 and hostname i-0000000e.eqiad.wmflabs. [23:48:29] \o/ [23:48:30] Works fine. [23:48:43] ok. let's try one more thing [23:49:04] I just cleared memcache [23:49:23] can you re-log in [23:49:30] I removed you from the sysadmin group again [23:49:39] I wanted to make sure it didn't have a cached valid token [23:49:43] Just checking it cleared the first time? [23:49:47] mhm [23:49:50] Logged in again [23:50:03] ok. now you have no rights, right? [23:50:07] yep [23:50:15] refresh [23:50:21] and create an instance [23:50:28] (I'm deleting your old ones, btw) [23:51:08] Failed to create instance. [23:51:11] damn it [23:51:37] crap. i restarted memcache again [23:51:38] If you're deleting them I can re-use the name rather than trolling google for random names :P [23:51:39] didn't mean to do that [23:51:44] heh [23:51:49] yeah. you can use the same name [23:54:37] ah [23:54:39] I see another bug [23:54:40] damn [23:55:02] Damianz: ok. log in and go to the manage instances page [23:55:16] I guess I could make a second user for this. heh [23:55:29] last try, I promise :) [23:55:48] no permissions again [23:55:56] This is why I have 4 logins to work's ldap servers :P [23:56:36] ok. refresh and create instance [23:56:58] Created instance i-0000000f with image 795c75c0-4168-497c-816b-8ff6f8f33b69 and hostname i-0000000f.eqiad.wmflabs. works [23:57:02] perfect [23:57:12] There is a really annoying bug though [23:57:22] what's that? [23:57:23] If the table cell doesn't have content the border dips in by like 4px [23:57:25] Horrid to look at [23:57:39] which cell doesn't have content? [23:57:52] actions? [23:57:55] Instance floating ip address, security group, instace ip address [23:58:04] which browser are you using? [23:58:08] chrome [23:58:17] cause I don't see this [23:58:53] still not seeing this [23:59:05] http://stuff.damianzaremba.co.uk/Screen%20shot%202012-08-21%20at%2000.58.27.png [23:59:30] yep. I don't get that [23:59:33] Weird [23:59:40] I do need to update chrome [23:59:41] * Damianz tries [23:59:52] interesting. your instance is in error state