[08:00:33] !log bots petrb: patching bot [08:00:34] Logged the message, Master [08:04:33] !ping [08:04:33] pong [08:57:02] Change abandoned: Hashar; "nobody care :-D" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7304 [08:58:31] Change abandoned: Hashar; "Moved to production https://gerrit.wikimedia.org/r/12151" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7075 [10:40:43] paravoid: so, want to try to build some ciscos today? :) [11:15:24] Ryan_Lane: did we fix nagios? [11:27:58] !lesl [11:27:59] git reset --hard origin/test [11:46:45] PROBLEM SSH is now: CRITICAL on mobile-testing i-00000271 output: CRITICAL - Socket timeout after 10 seconds [11:51:35] RECOVERY SSH is now: OK on mobile-testing i-00000271 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:24:29] !git-puppet is git clone ssh://gerrit.wikimedia.org:29418/operations/puppet.git [12:24:30] Key was added [12:25:58] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:48] New patchset: Petrb; "inserted back ram check to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12164 [12:34:20] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12164 [12:40:58] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [13:10:48] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 10 users currently logged in [13:15:41] !log nagios root: disabled user check for bastion [13:15:42] Logged the message, Master [13:29:51] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:40] so, labs is now running off of production [13:34:50] excluding the private repo, of course [13:36:40] \O/ [13:36:41] congrats [13:38:48] !log deployment-prep updating package on -dbdump [13:38:50] Logged the message, Master [13:40:40] Ryan_Lane: including ram check? [13:40:53] no [13:40:53] !log deployment-prep upgrading packages on -squid [13:40:54] not just yet. gimme a little bit [13:40:55] Logged the message, Master [13:40:57] ok [13:41:23] I get an error on one of our Precise instance : err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/etc/sudoers] is already defined in file /etc/puppet/manifests/sudo.pp at line 50; cannot redefine at /etc/puppet/manifests/apaches.pp:133 on node i-000002d3.pmtpa.wmflabs [13:41:36] that is apache-30 , the other clone, apache-31 does not report that :/ [13:42:05] petan: it'll be updating in a minute [13:42:51] RECOVERY Free ram is now: OK on labs-nfs1 i-0000005d output: OK: 85% free memory [13:44:12] hashar: ah. I see why [13:44:28] I am not sure why it happens one instance and not on the other one :/ [13:44:42] the issue is that they have /etc/sudoers provisioned by some apache puppet class [13:44:47] appsudoers something [13:45:11] hashar: because it's using apaches::service [13:45:15] hashar: may be class is used only on one of them [13:45:23] let me double check [13:45:31] and stupidly that sets /etc/sudoers [13:45:41] when it should use /etc/sudoers.d/ [13:46:23] oh apache-31 lost all its puppet classes hehe [13:46:32] RECOVERY Free ram is now: OK on bots-dev i-00000190 output: OK: 90% free memory [13:46:32] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [13:46:32] RECOVERY Free ram is now: OK on en-wiki-db-precise i-0000023c output: OK: 81% free memory [13:46:32] RECOVERY Free ram is now: OK on feeds i-000000fa output: OK: 89% free memory [13:46:42] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 48% free memory [13:46:54] !log deployment-prep apache-31 : readding applicationserver::labs and imagescaler::labs [13:46:55] Logged the message, Master [13:47:42] RECOVERY Free ram is now: OK on bob i-0000012d output: OK: 86% free memory [13:47:42] RECOVERY Free ram is now: OK on bots-nfs i-000000b1 output: OK: 89% free memory [13:47:42] RECOVERY Free ram is now: OK on demo-mysql1 i-00000256 output: OK: 94% free memory [13:47:42] RECOVERY Free ram is now: OK on deployment-cache-upload i-00000263 output: OK: 88% free memory [13:47:42] RECOVERY Free ram is now: OK on deployment-transcoding i-00000105 output: OK: 66% free memory [13:47:43] RECOVERY Free ram is now: OK on incubator-common i-00000254 output: OK: 86% free memory [13:47:43] RECOVERY Free ram is now: OK on labs-realserver i-00000104 output: OK: 71% free memory [13:47:52] RECOVERY Free ram is now: OK on wmde-test i-000002ad output: OK: 90% free memory [13:47:52] RECOVERY Free ram is now: OK on vumi i-000001e5 output: OK: 85% free memory [13:47:52] RECOVERY Free ram is now: OK on otrs-jgreen i-0000015a output: OK: 90% free memory [13:47:52] RECOVERY Free ram is now: OK on p-b i-000000ae output: OK: 85% free memory [13:47:52] RECOVERY Free ram is now: OK on swift-be2 i-000001c8 output: OK: 82% free memory [13:47:53] RECOVERY Free ram is now: OK on swift-be3 i-000001c9 output: OK: 86% free memory [13:47:53] RECOVERY Free ram is now: OK on webserver-lcarr i-00000134 output: OK: 83% free memory [13:47:54] RECOVERY Free ram is now: OK on testing-groupchange i-00000205 output: OK: 71% free memory [13:47:54] RECOVERY Free ram is now: OK on varnish i-000001ac output: OK: 90% free memory [13:50:45] PROBLEM Current Load is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [13:50:45] PROBLEM Disk Space is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [13:50:55] PROBLEM dpkg-check is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [13:51:06] o_O [13:51:35] RECOVERY Free ram is now: OK on analytics i-000000e2 output: OK: 76% free memory [13:52:45] RECOVERY Free ram is now: OK on bastion-restricted1 i-0000019b output: OK: 84% free memory [13:52:45] RECOVERY Free ram is now: OK on building i-0000014d output: OK: 90% free memory [13:52:45] RECOVERY Free ram is now: OK on embed-sandbox i-000000d1 output: OK: 90% free memory [13:52:45] RECOVERY Free ram is now: OK on fr-wiki-db-precise i-0000023e output: OK: 72% free memory [13:52:45] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 90% free memory [13:52:46] RECOVERY Free ram is now: OK on php5builds i-00000192 output: OK: 94% free memory [13:52:46] RECOVERY Free ram is now: OK on wikidata-dev-1 i-0000020c output: OK: 89% free memory [13:53:15] PROBLEM dpkg-check is now: CRITICAL on mingledbtest i-00000283 output: Connection refused by host [13:53:45] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 10 users currently logged in [13:54:25] PROBLEM Current Users is now: CRITICAL on demo-web1 i-00000255 output: Connection refused by host [13:54:35] PROBLEM Current Users is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [13:54:35] PROBLEM Total Processes is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [13:55:05] PROBLEM Disk Space is now: CRITICAL on demo-web1 i-00000255 output: Connection refused by host [13:55:05] PROBLEM Total Processes is now: CRITICAL on mingledbtest i-00000283 output: Connection refused by host [13:55:34] meh [13:55:55] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: Connection refused by host [13:55:55] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: Connection refused by host [13:55:55] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: Connection refused by host [13:55:55] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: Connection refused by host [13:56:15] PROBLEM Current Load is now: CRITICAL on demo-web1 i-00000255 output: Connection refused by host [13:56:15] PROBLEM Current Load is now: CRITICAL on mingledbtest i-00000283 output: Connection refused by host [13:56:15] PROBLEM Current Users is now: CRITICAL on mingledbtest i-00000283 output: Connection refused by host [13:56:35] PROBLEM Disk Space is now: CRITICAL on mingledbtest i-00000283 output: Connection refused by host [13:56:35] RECOVERY Free ram is now: OK on bots-4 i-000000e8 output: OK: 88% free memory [13:57:06] weird [13:57:18] ? [13:57:19] something is broken but not all [13:57:20] PROBLEM Current Load is now: CRITICAL on build1 i-000002b3 output: Connection refused by host [13:57:20] PROBLEM Total Processes is now: CRITICAL on demo-web1 i-00000255 output: Connection refused by host [13:57:29] this is likely the exact same issue [13:57:45] I just sshed to bots instances nrpe is fine [13:57:50] but on some other instances it's not [13:57:54] yeah [13:58:02] it seems when the nrpe config changes, it breaks nrpe [13:58:08] then when puppet runs, it fixes it [13:58:09] but not everywhere [13:58:12] aha [13:58:15] maybe [13:58:18] this is the same problem we see on instance creation, btw [13:58:27] true [13:58:39] maybe we need to wait for puppet to fix it [13:59:21] yeah [14:00:12] lets see dumps-1... [14:02:21] ahh [14:11:23] PROBLEM Current Users is now: CRITICAL on deployment-mc i-0000021b output: Connection refused by host [14:11:33] PROBLEM Total Processes is now: CRITICAL on maps-test3 i-0000028f output: Connection refused by host [14:13:52] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/etc/sudoers] is already defined in file /etc/puppet/manifests/sudo.pp at line 50; cannot redefine at /etc/puppet/manifests/apaches.pp:133 on node i-000002d3.pmtpa.wmflabs [14:14:02] hashar: puppet is broken on apache30 [14:14:58] yeah [14:15:02] cause of /etc/sudoers [14:15:09] I am submitting a change as we speak [14:15:20] not only apache's meh [14:16:22] Ryan_Lane: /etc/sudoers might be fixed with https://gerrit.wikimedia.org/r/12178 [14:16:47] I have created a class for file { "/etc/sudoers": source => sudoers.appserver } and install the file in /etc/sudoers.d/ as you said [14:16:53] though it is no more overriding sudo::default [14:17:40] petan: that is change 12178 [14:17:58] k [14:18:42] expect some disruption over the next days [14:18:51] since production and test branches have been merged [14:18:54] ;-] [14:20:40] hashar: that won't fix your issue [14:21:19] just when I was telling some local coworker that puppet was fixing all our issues [14:21:19] ahah [14:22:12] what is wrong with the patch? [14:22:19] the problem with the sudoers file is that it's a sudoers file [14:22:21] puppet complains about duplicate definitions for /etc/sudoers [14:22:25] it should be a sudoers.d file [14:22:34] and the sudoers file should be left alone [14:23:39] I am totally confused sorry :-( [14:24:05] I put puppet:///files/sudo/sudoers.appserver as /etc/sudoers.d/appserver [14:24:13] and /etc/sudoers is provided by sudo::default [14:24:17] that's not going to work by itself [14:24:41] you'll need to modify the sudo file [14:25:03] for instance: root ALL=(ALL) ALL [14:25:05] that shouldn't be there [14:25:21] or this: nagios ALL=NOPASSWD: /usr/bin/check-raid.py "" [14:26:06] Ryan_Lane: that's for nagios [14:26:14] so that it can run it as root [14:26:52] yes, but it should be already defined elsewhere [14:26:56] ok [14:27:18] yep [14:27:23] /etc/sudoers.d/nrpe [14:28:40] is root ALL=(ALL) ALL to allow anyone to run command as root ? [14:29:17] or is that root allowed to run anything as anything with any parameter? [14:36:12] root is allowed to run anything [14:36:25] that's in the default sudoers anyway [14:36:28] unless I removed it [14:36:33] it's silly for root to run sudo [14:37:10] Ryan_Lane: Any chance of restoring files after a rm command? [14:37:34] nope [14:37:43] unless they are held open by something [14:37:53] then you can restore them by their file descriptors [14:38:10] god damn... [14:39:27] ? [14:39:38] I deleted files accidentally [14:39:59] and didn't backup, damn... [14:41:30] nevermind, it was restored somehow [14:42:17] Proving once again, Labs is a place where magic happens :) [14:43:16] yeah, /data/project isn't being mounted, magic... [14:59:01] @help [14:59:01] Type @commands for list of commands. This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.8.2.4 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [14:59:13] Hydriz: there is a chance, but it suck [14:59:40] Hydriz: I've seen undelete for ext which works terribly slow [14:59:50] it work on ext3 and ext4 too [14:59:50] blame wmib :P [14:59:54] well, if it's in /mnt... [14:59:58] Hydriz: what's wrong [14:59:59] it'll work [15:00:06] if it's in / or /data/project you'd SOL [15:00:11] *you're [15:00:15] I accidentally deleted the dump dir (my own wmib) [15:00:24] but restarting the bot works [15:00:25] somehow [15:00:29] what dump dir? [15:00:32] Hydriz: ? [15:00:37] :( [15:00:37] are you still downloading to the filesystem? [15:00:45] the WM-Bot has a dump directory [15:00:46] Hydriz: first of all, dump doesn't contain anything important [15:00:51] yeah [15:00:54] Hydriz: second, what bot you talk about [15:00:54] I just realised [15:01:01] my own bot [15:01:08] Hydriz: where it run [15:01:14] IRC? [15:01:20] I mean instance [15:01:25] incubator-bot0 :P [15:01:28] aha [15:01:35] if you restart bot, it recreate all dumps [15:01:39] yep [15:01:43] I did that [15:01:46] k [15:01:47] and it worked :) [15:02:16] Ryan_Lane: Since you are here, can you give me a heads up on why crontab runs twice? [15:02:29] Hydriz: many reasons [15:02:32] which crons run twice? [15:02:41] custom crons [15:02:43] crontab -e [15:02:49] that makes no sense [15:03:01] the same command, but it runs twice when it is supposed to run [15:03:07] hm, check cron daemon config, also are you sure you don't have some fucked crontab? [15:03:16] lol no [15:03:23] then a command is fucked [15:03:33] paste it somewhere [15:03:36] to pastebin [15:03:40] doing... [15:03:49] I removed it a few days back btw [15:03:54] and using some hack of sleeping the script [15:04:06] that's likely what does it [15:04:12] :P [15:05:54] pastebin still loading... [15:06:04] Hydriz: try mozilla's [15:06:07] I like it [15:06:41] Ryan_Lane: can we have a wikimedia pastebin :D [15:06:50] * Ryan_Lane shrugs [15:06:59] not that there isn't million of others [15:07:04] simple version: http://bots.wmflabs.org/~hydriz/crontab.txt [15:07:11] its just one line [15:07:17] but wikimedia would look cool [15:07:45] the old cron.py is already modified [15:07:58] so its hard for me to give a paste for that [15:08:28] but its just a simple script that outputs the full date of yesterday [15:08:52] and feeds it into incrdumps.py (code: https://github.com/Hydriz/incrdumps) [15:16:33] Hydriz: I wanted to see that crontab [15:16:39] :P [15:16:44] ah [15:16:45] you sent it [15:16:51] lol [15:17:38] Hydriz: I recommend you to create a shell script where you echo "Program started at `date`" >> somelog;python cron.py [15:17:55] Hydriz: there you would see if it's because of CRON or script [15:18:12] in case there would be 2 rows it would mean cron is broken [15:18:21] hmm [15:18:22] but I believe that your script is [15:19:01] let me try this on a separate crontab, since I already deleted this crontab [15:19:06] k [15:23:22] hmm... [15:23:37] this crontab runs one command only [15:23:54] nevermind then, I think its some human error somewhere [15:24:05] anyway I am using another way of doing this :P [15:24:39] but do you have any documentation about properly adding people to the bots project? [15:44:34] Ryan_Lane: Looking around, what is /data/home supposed to do? [15:44:59] eventually we'll move the per-project home directories there [15:45:27] paravoid: hey, seems they released 3.3 stable [15:45:29] for gluster [15:45:47] home is 50GB, nice... [15:46:30] paravoid: I was also thinking of just enabling pam_homedir for the projects and getting rid of the script that creates homedirs [15:46:37] it'll just create keys [15:46:52] what does the script do? [15:47:04] it creates home directories in each project for users that are members [15:47:20] it was needed before, when it also added the keys [15:47:25] now pam_homedir can do the same thin [15:47:27] *thing [15:47:45] pam_mkhomedir I guess you mean [15:47:48] yes. that one [15:47:49] yes please :-) [15:47:55] the script *can* do other things, too [15:48:06] like rename home directories when users are renamed in LDAP [15:48:08] but we don't do that [15:48:16] same with changing uidnumber and gidnumber [15:48:26] but I thought you said this isn't supported? [15:48:41] it isn't, so screw the script :) [15:48:48] hahaha [15:49:15] in the future if we need to, we can use salt to rename the user's home dirs [15:50:08] I should test out the new version of gluster [15:50:26] so we can upgrade the project storage [15:50:53] * paravoid shivers [15:50:57] heh [15:51:04] well, the project storage is barely useable right now [15:51:23] +1 [15:51:35] the thing broken last time was share acls [15:51:37] * Damianz blames Hydriz [15:51:46] +1 [15:51:47] :D [15:51:53] lol [15:51:59] well, we *do* need to do network node per compute node [15:52:04] then it'll be much faster [15:52:08] or we just rid of Hydriz [15:52:10] heh [15:52:22] the version of gluster we are using has bugs [15:52:43] Watching spacewalk install is dull hmm [15:52:52] new version does as well, just no one discovered them [15:53:11] Zarro Boogs! [15:53:54] :( [15:54:32] oh speaking of spacewalk [15:54:55] Ryan_Lane et al: https://lists.debian.org/debian-devel/2012/06/msg00645.html [15:55:06] Together with Miroslav Suchý from Red Hat Bernd Zeimetz (bzed) worked on the [15:55:10] packaging of the necessary libraries and daemons to add (basic) Spacewalk [15:55:13] client support to Debian. [15:55:19] [...] managing Debian clients with Spacewalk should [15:55:20] work out of the box. [15:55:28] yep [15:55:29] I saw that [15:55:36] you read debian-devel!?!?! [15:55:36] I looked just the other day :) [15:56:11] I checked spacewalk's website [15:56:20] I still think cobbler is bloated and could be done really nicely in django using zeromq. [15:56:23] it also now works completely with postgres [15:56:43] Well 'works' and postgre only go together if you add 'initial setup pain'. [15:56:44] the unfortunate thing is spacewalk server requires fedora [15:57:36] paravoid: thanks to the stuff you added to sockpuppet, I think we can live without spacewalk [15:57:51] once we have a remote execution, tool, anyway [15:57:52] heh [15:58:05] I don't want to deal with package versioning in puppet [15:58:08] pdsh <3, mccollective <3 [15:58:19] Ryan_Lane: the servermon tree has an alternative implementation btw [15:58:27] that uses paramiko to SSH into the hosts [15:58:27] I'd rather run a command that says "upgrade blah on this set of servers, and do it in batches of 10" [15:58:29] and run apt2xml [15:58:37] not for upgrades, just for collection [15:58:38] no thanks :) [15:58:44] I like our way better [15:58:52] Parsing xml? I'd rather stab myself <3 json [15:59:03] Damianz: that's what libraries are for [15:59:07] Damianz: I actually converted the XML thing into JSON [15:59:19] then did some tests and they were actually BIGGER [15:59:27] Ryan_Lane: You ever tried using element tree for serious shizzle? :( [15:59:28] yes, json was bigger than XML for this thing [15:59:35] I have, yes [15:59:40] in this case I'm using minidom though [15:59:42] and it's trivial [15:59:49] simple xml schema though [16:00:19] I really want to write something in BSON just because it exists =/ [16:00:40] I don't think you can pass binary over puppet facts [16:00:48] if you could, I would just pickle the dict :P [16:01:16] Shame lol [16:01:25] Facter is seriously freaking awesome though [16:01:46] Writing a report processor was easier, writing custom facts way *shudder ruby* [16:01:56] I'll have to look at Facter [16:02:35] Damianz: you should see what I did with the facter database for ops [16:05:04] Sounds interesting, I've not yet extended it to the point of shoving all the host data back into our management app thing, which will make things more awesome because I can kill 'ice' my little custom written go grab stuff of the servers thing that runs on cron. [16:09:09] hey petan, got time to do a favor for me? [16:25:24] Ryan_Lane: when using self hosted puppet for doing swift stuff, will I have to make the same change on every one of my swift hosts? (rather than changing the central puppet and having all of them update.) [16:26:09] for example, puppet manages the ring files. If I understand it right, I'll have to manually copy the ring files to each host (in /var/lib/...) and then trigger puppet to update them? [16:26:38] yes, this sucks at the moment [16:26:44] I couldn't think of a better way to do it [16:26:54] feels like a pretty huge step backwards for working with anything more than a single machine. [16:26:58] it is [16:27:10] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551956 edit summary: [16:27:11] I wasn't intending to ditch test completely at the time though [16:27:37] paravoid: can have a central repo in /data/project, and have all the other repos pull from it [16:27:37] what would happen if the puppet repo was on the (project) shared storage area? [16:27:40] Ryan_Lane: saltstack? oh dear [16:27:49] paravoid: eh? [16:28:11] you're persistent :) [16:28:12] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551957 edit summary: /* Using peer relationships */ [16:28:14] for remote execution [16:28:29] yeah, you've been saying "salt" for many weeks [16:28:47] well, it should have been obvious that I planned on adding it, then, right? :) [16:29:01] I've been testing it out for weeks too [16:29:11] and putting in bug reports and such [16:29:12] yet another thing in the labs stack though [16:29:18] it'll also be in production [16:29:30] the new deployment system is based on it as well [16:29:48] I have mixed feelings tbh [16:29:51] paravoid: otherwise it would be what? func? mcollective? fabric? [16:30:03] because I think it's definitely a nice thing to have [16:30:09] I don't want to write ruby, so I think mcollective can suck it [16:30:23] but otoh, we didn't compare the various tools at all [16:30:36] that leaves func, which is ok, but isn't in debian/ubuntu [16:30:46] and we've added yet another brick to the big labs stack that noone but you knows yet :) [16:30:50] and doesn't have much push behind it [16:31:16] otoh, I've really haven't been sleeping well, so I might just be grumpy. [16:31:20] heh [16:31:27] I've looked at the alternatives :) [16:31:32] everyone else is welcome to do so [16:32:20] fabric is a beautiful babe [16:32:21] we'll likely end up using salt in production before we have per-project salt in labs [16:33:01] we can however, run salt as root from virt0, which would be nice [16:33:08] sshing into all the nodes kind of sucks [16:33:16] Not so sure about fabric for run x on y hosts, but for code deployment apart from the lack of support for rolling blcks (which should be easy enough to add) it's freaking awesome. [16:34:03] there's a bunch of things I like about salt, so far [16:34:15] I've been using it for a consulting project, too. it works well [16:34:29] there's a few things that annoy me, but that's what bug reports are for :) [16:35:15] Does it play nicely with switches? (Also it's far too hard to find python ssh stuff that plays nicely with switches, freaking bash expecting blehs) [16:35:50] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551961 edit summary: /* Using peer relationships */ [16:36:04] Damianz: plays nicely with switches? [16:36:06] what do you mean? [16:36:20] do configuration management for the switches? [16:37:46] I have scripts written in expect that do horrid pipe stuff to login to switches and pull certain bits (like show version that you can't get over snmp), really want to move to python but I keep hitting horrid issues where it just locks up. [16:37:51] one of the things I like about salt is that it can do both configuration management and remote execution, and the configuration management can call the remote execution [16:37:57] and you can do full iteration in both [16:38:09] Though ideally yes, I'm looking to manage switches programatically so I can let noobs make certain changes in our management app and snmp makes me cry. [16:38:47] Trying to remember the name of some paramiko based module I was playing with that was soo close but sucked for expect style usage. [16:39:01] yeah, I'd recommend using paramiko for that [16:39:14] rather than expect [16:39:28] expect is fucking horrible [16:40:37] You don't want to see the expect stuff I wrote to scrape mac addresses off of ports on 4 different types of switches.... that was a long week. [16:41:13] The most annoying thing was, we did the os upgrades and bam they supported grabing the mac address tables via snmp *grr* [16:41:47] heh [16:42:36] crap [16:42:36] My best friend right now is 100% graphite, eventhough it caused me to write loads of stuff using pysnmp. [16:42:41] andrewbogott: you make a good point [16:42:45] I didn't think about initial runs [16:42:59] hm [16:43:15] could likely make a remote branch [16:43:51] Any reason why we can edit the puppet client install stuff and make an option in ldap to do it? [16:43:54] then have a variable for puppetmaster::self, like "puppetmaster_initial_branch" [16:44:17] hm. no [16:44:18] Maybe we should just go ahead and make a project with a local puppetmaster... it doesn't seem like that would be especially hard. [16:44:19] I was thinking we where going to have multiple puppetmasters with a branch for each project which then got cherry picked into test. [16:44:20] that won't work [16:44:32] local puppetmaster & git repo [16:45:08] Damianz: Yeah, in the long run that'd be ideal. I haven't thought much about what's involved in doing that. [16:45:08] paravoid: thoughts? [16:45:32] andrewbogott: The main thing would be merging test back into the branches so they don't get un-mergeable... and who would be responsible for that. [16:45:41] that's the biggest problem... [16:45:46] the test branch diverged horribly [16:46:18] Yeah, but if divergence is project-specific, then... it becomes the project's problem rather than everyone's problem. [16:46:35] andrewbogott: realistically, this is still doable... [16:46:46] Ideally all boxes should be running test+anything we're testing for the projects, then pick into test, merge into prod... but it's stilly messy until we have the production cluster up with projects and we actually use the test cluser for well.. testing/package so we don't take down semi-prod stuff. [16:46:47] Ryan_Lane: Is there some way we could hack the puppet master so that a single variable causes it to pull from a specified repo? [16:46:52] Like, just a path prefix or something? [16:46:59] well, I mean just using the production branch this is still doable [16:47:21] Well, it depends what the bar to get into production is... [16:47:24] hm. it does kind of suck, though [16:47:27] If it's 6months for a merge then we're screwed. [16:47:39] Damianz: for almost all changes, this isn't a problem [16:47:55] it's just for testing installs from scratch [16:48:02] which, in production, we don't do [16:48:10] The best thing would be to just have our current puppet master, but somehow have it be branch-aware. [16:48:24] andrewbogott: that's doable, if we switch everything to modules [16:48:32] Shame you can't give it an ip in each project range and map the repos that way hmm [16:48:36] Oooh <3 modules [16:48:38] that was my original idea [16:48:39] Ryan_Lane: Well, puppetmaster::self also doesn't help if you're developing for a cluster rather than a single machine. [16:48:43] not modules [16:48:50] but an environment per branch [16:48:50] How do modules matter? [16:48:58] but you can't use environments without modules [16:49:28] andrewbogott: actually, with salt, we could say "switch all instances in the project to remote branch x" [16:49:48] then you can test clusters [16:50:15] * andrewbogott reads about environments [16:50:23] andrewbogott: puppet environments have bugs [16:50:32] they work properly with modules, but not without [16:50:56] I think getting rid of test is a good move, for now [16:50:56] Anyone used puppetdb yet btw? It looks kinda cool but not sure about the usage right now. [16:51:06] we break one use case, but make the normal use case way easier [16:51:20] Damianz: nope [16:52:12] Yeah, eliminating 'test' is probably fine, since no matter what we do next we'll be better off with everything merged. [16:53:48] yeah, and puppetmaster::self should work 99% of the time [16:54:24] remote branches + a decent salt module + salt-per-project should solve the reconfigure an entire cluster problem [16:54:48] andrewbogott: https://www.mediawiki.org/wiki/Wikimedia_Labs/Per-project_saltstack_remote_execution [16:55:02] salt needs to add a feature for it to be possible, though [16:55:15] they marked my feature request approved, at least [16:55:57] I think in this case remote execution is a workaround though [16:56:03] I could make a labsconsole interface for this [16:56:11] Is salt a replacement for puppet, or somehow complementary? [16:56:16] paravoid: how so? [16:56:24] paravoid: either way, all instances would need to switch environments [16:56:29] didn't we agree to add features sparingly and focus on stability? :) [16:56:32] andrewbogott: complementary [16:56:39] paravoid: yes. I'm just writing specs right now [16:56:45] adding remote execution to labsconsole surely seems like a feature :) [16:56:53] I was waiting on you for the ciscos ;) [16:57:01] yeah, I'm about to start [16:57:03] what else am I supposed to do? heh [16:57:09] I was finishing up the other stuff, fiiiinally [16:57:19] fix the "no nova credentials"? [16:57:20] :P [16:57:22] now that I see gluster 3.3 is released, I'll probably upgrade that [16:57:43] yeah [16:57:51] I have an idea on how I'mgoing to solvet hat [16:57:53] *that [16:57:54] but... [16:58:03] doing so may affect thousands of users [16:58:07] so, I have to be careful there [16:59:11] the safe way of fixing it is something I don't want to spend my time on [16:59:25] it would be a fairly large change in mediawiki core [17:00:29] Ryan_Lane: You can break en.wiki if you fix the session issue ;) I won't mind [17:00:49] Damianz: it would affect thousands of third-party mediawiki users [17:00:54] that's what I mean ;) [17:01:18] * Damianz changes to 'as long as you don't break the ldapauth plugin' then it's fine :D [17:01:33] that's what I would break [17:01:42] Actually I should update my mw install *sigh*, I really hate having to copy all the externsion stuff over every time. [18:10:36] 06/20/2012 - 18:10:36 - Creating a home directory for jmorgan at /export/keys/jmorgan [18:11:36] 06/20/2012 - 18:11:35 - Updating keys for jmorgan at /export/keys/jmorgan [18:33:18] paravoid: I just watched a video about saltstack, and I'm still not clear what it will do for us that puppet doesn't already do... any ideas? (Or is that strictly Ryan's vision?) [18:34:08] it's Ryan's vision, although I think the point is some kind of doing targetted remote execution, instead of e.g. ssh'ing into hosts [18:34:36] Ah, as an ssh replacement it makes perfect sense. [20:01:36] paravoid: are you going to tell us we are switching to something else [20:02:05] ? [20:02:25] that paravoid: I just watched a video about saltstack, and I'm still not clear what it will do for us that puppet doesn't already do... any ideas? (Or is that strictly Ryan's vision?) [20:02:44] we're not switching away from puppet [20:02:47] period :) [20:02:49] I finally start to understand puppet a bit [20:02:51] petan: It's not instead of puppet, it's in addition to. [20:02:56] ah [20:02:56] ok [20:03:42] there should be new mailing list [20:03:49] ryans-visions-l [21:15:01] hmm yesterday i could load Special:NovaAddress and today i can't [21:15:17] any admin could take a look, what's going on, pls? [21:18:02] do you get no nova credentials found etc.? [21:18:11] if so, logout and login again [21:18:34] again? [21:18:39] oki, mmt [21:19:13] good [21:19:15] thx [21:19:21] should be fixed anyway [22:26:09] does anyone know if something changed last week that would have made ganglia.wmflabs.org inaccessible? i can get to it by its internal IP, but not its external "floating" IP. [22:35:47] ssmollett: You're talking about 208.80.153.234, right? [22:41:40] andrewbogott: yes [22:44:49] I am poking uselesly but not learning anything that you don't already know :( [22:45:23] ssmollett: I take it you have confirmed that it can serve web pages? (I can't even get that far, with a proxy) [22:46:28] i can get to http://aggregator1.pmtpa.wmflabs/latest from other labs instances. [22:48:57] Oh, so can I, it just redirects me immediately. [22:50:06] I am going to click on this link that says 'reassociate IP' if you have not already done so. [22:50:21] i haven't; go for it. [22:51:04] Oh, I'm not netadmin. I bet you are though... [22:53:11] and that did it! [22:53:25] Cool. No doubt it broke something else as well... [22:53:40] At least, the IP was responding to ping, so it was pointed at /something/ [22:54:10] very strange. [22:54:42] I am temporarily unable to log into virt0. If you have access you could ask nova what it thinks is using that ip. [22:57:30] it looks like it has the right instance. [22:58:22] Well. I guess we can wait and see if the problem happens again. [22:58:27] And then ask nova /while/ it is happening :/ [22:59:08] indeed. [23:01:39] It could be that the iptables rules got wiped which screwed the security groups, not sure what the ICMP policy is though. [23:02:08] A broken security group would explain what I was seeing. [23:02:32] Um... except maybe that would break internal browsing too