[00:00:06] RECOVERY Current Users is now: OK on mwreview-proto i-00000286 output: USERS OK - 1 users currently logged in [00:00:36] RECOVERY Disk Space is now: OK on mwreview-proto i-00000286 output: DISK OK [00:01:16] RECOVERY Free ram is now: OK on mwreview-proto i-00000286 output: OK: 88% free memory [00:02:16] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 20.82, 11.87, 7.35 [00:15:14] PROBLEM HTTP is now: CRITICAL on mwreview-proto i-00000286 output: CRITICAL - Socket timeout after 10 seconds [00:27:16] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 2.22, 3.09, 4.46 [00:37:47] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [00:54:16] Does anyone recall what's needed to open up web services on a lab instance? I know how to do the proxying, but right now I can't access port 80 even from bastion. [01:09:37] hm... I guess something about security groups has changed in the gui. [02:39:20] 05/25/2012 - 02:39:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:42:36] RECOVERY dpkg-check is now: OK on demo-web2 i-00000285 output: All packages OK [02:43:27] RECOVERY Current Load is now: OK on demo-web2 i-00000285 output: OK - load average: 0.10, 0.09, 0.02 [02:43:58] RECOVERY Current Users is now: OK on demo-web2 i-00000285 output: USERS OK - 0 users currently logged in [02:44:37] RECOVERY Disk Space is now: OK on demo-web2 i-00000285 output: DISK OK [02:45:20] 05/25/2012 - 02:45:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:45:56] RECOVERY Free ram is now: OK on demo-web2 i-00000285 output: OK: 88% free memory [02:46:20] 05/25/2012 - 02:46:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:46:36] RECOVERY Total Processes is now: OK on demo-web2 i-00000285 output: PROCS OK: 81 processes [02:48:20] 05/25/2012 - 02:48:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:51:20] 05/25/2012 - 02:51:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:52:20] 05/25/2012 - 02:52:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:21:55] RECOVERY Current Load is now: OK on mingledbtest i-00000283 output: OK - load average: 0.05, 0.06, 0.01 [03:22:35] RECOVERY Current Users is now: OK on mingledbtest i-00000283 output: USERS OK - 0 users currently logged in [03:23:27] RECOVERY Disk Space is now: OK on mingledbtest i-00000283 output: DISK OK [03:23:27] RECOVERY Free ram is now: OK on mingledbtest i-00000283 output: OK: 93% free memory [03:23:27] RECOVERY dpkg-check is now: OK on mingledbtest i-00000283 output: All packages OK [03:24:58] RECOVERY Total Processes is now: OK on mingledbtest i-00000283 output: PROCS OK: 95 processes [03:47:56] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:51:34] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:53:34] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [03:57:14] PROBLEM host: mwreview-proto is DOWN address: i-00000287 check_ping: Invalid hostname/address - i-00000287 [04:03:51] PROBLEM Current Load is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:04:31] PROBLEM Current Users is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:05:01] PROBLEM Disk Space is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:05:41] PROBLEM Free ram is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:05:54] Oren_Ishi: howdy [04:05:59] !account-questions | Oren_Ishi [04:05:59] Oren_Ishi: I need the following info from you: 1. Your preferred wiki user name. This will also be your git username, so if you'd prefer this to be your real name, then provide your real name. 2. Your preferred email address. 3. Your SVN account name, or your preferred shell account name, if you do not have SVN access. [04:06:30] (Wiki username can't be changed. if you want your real name to be associated with your work, use that) [04:06:51] PROBLEM Total Processes is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:29] Ryan_Lane: "Large penis has merged " :D [04:07:31] PROBLEM dpkg-check is now: CRITICAL on mw-proto i-00000288 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:38] o.O [04:07:41] Damianz: eh? [04:08:01] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 12% free memory [04:08:16] Username free reign! The possibilities [04:08:21] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:08:41] ah. heh [04:09:03] we really need a username blacklist when we have open registration :D [04:10:02] it's odd that load is now basically constant [04:10:05] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [04:10:21] constantly high too [04:10:44] I wonder what happens if I trip over the power cord to dumps [04:10:44] seems to correlate with swap, though [04:10:51] heh [04:11:05] lemme see if hydriz deleted those instances... [04:11:26] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:11:28] Waiting for code to compile is boring... hurry up nagios so I can go back to yelling at you for refusing stuff exists in *that* struct. [04:11:45] I *really* need to fix this session issue [04:11:55] it's driving me mad now that I have OATHAuth enabled [04:13:11] yep. only 4 instances in dumps [04:13:12] Find a bord core dev and get them to make a sane session handling interface? [04:13:16] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:13:23] well, part of the problem is LdapAuth [04:13:46] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:13:56] RECOVERY Current Load is now: OK on mw-proto i-00000288 output: OK - load average: 1.00, 0.44, 0.38 [04:13:57] Really? Never had a problem with LdapAuth, though I do use it behind http basic so there is allways state to pick up sesisons off [04:14:26] RECOVERY Current Users is now: OK on mw-proto i-00000288 output: USERS OK - 1 users currently logged in [04:15:06] RECOVERY Disk Space is now: OK on mw-proto i-00000288 output: DISK OK [04:15:36] RECOVERY Free ram is now: OK on mw-proto i-00000288 output: OK: 92% free memory [04:15:45] there's an issue with long-term cookies [04:16:04] it creates a session from scratch, so it doesn't know your ldap domain [04:16:17] which isn't much of an issue, usually [04:16:34] but, it's an issue with OpenStackManager, since it needs to know which domain to pull stuff from [04:16:36] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:16:56] RECOVERY Total Processes is now: OK on mw-proto i-00000288 output: PROCS OK: 90 processes [04:17:36] RECOVERY dpkg-check is now: OK on mw-proto i-00000288 output: All packages OK [04:23:38] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:27:58] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:38:14] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:03:37] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: CRITICAL - Socket timeout after 10 seconds [05:08:27] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.005 second response time [05:46:34] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [08:02:07] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: CRITICAL - Socket timeout after 10 seconds [08:02:07] PROBLEM HTTP is now: CRITICAL on deployment-apache22 i-0000026f output: CRITICAL - Socket timeout after 10 seconds [08:06:57] PROBLEM HTTP is now: WARNING on deployment-apache21 i-0000026d output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.015 second response time [08:06:57] PROBLEM HTTP is now: WARNING on deployment-apache22 i-0000026f output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.014 second response time [08:07:06] 05/25/2012 - 08:07:06 - Creating a home directory for py at /export/home/mobile-sms/py [08:08:06] 05/25/2012 - 08:08:06 - Updating keys for py at /export/home/mobile-sms/py [08:10:27] yeah load raising again ;-D [08:10:59] hhhmmmm... now when I try to log into that box I get Connection closed by 10.4.0.140 [08:11:11] getting better from Permission denied! [08:11:12] woo! [08:11:24] is that a new box ? [08:11:26] * notpeter doesn't know his way around labs.... [08:11:33] nope [08:11:37] oh [08:11:44] I just need to get in an grab some built packages [08:11:50] does it need to be rebooted? [08:11:57] my point of contacts at this time are mutante and paravoid [08:12:15] yeah try rebooting, maybe that will trigger a new puppet run [08:12:16] yeah, they are rockstars :) [08:12:19] ;) [08:12:30] well, no, my key is there [08:12:40] but it seems like something is messed up aside form that [08:14:50] hey, do you guys know how to go about applying for a new project? [08:15:33] notpeter: maybe a wrong username ? [08:16:03] I spent a good amount of time figuring out that "antoine" != "hashar" ;-D [08:20:51] xD [08:20:55] !labs | ori-livneh [08:20:56] ori-livneh: https://labsconsole.wikimedia.org/wiki/ [08:21:02] !account [08:21:02] in order to get an access to labs, please type !account-questions and ask Ryan, or someone who is in charge of creating account on labs [08:21:25] if you already have access to labs [08:21:32] a) for a new project, ask ryan [08:21:38] b) for an existing project, ask a member [08:23:52] !log deployment-prep hashar: killed stuck jobs on jobrunner 02 and 03. Restarted loop. [08:23:55] Logged the message, Master [08:24:02] \O/ [08:27:28] * ori-livneh pokes ryan [08:27:50] have an account, need project. [08:35:31] !log deployment-prep root: gzipped /home/wikipedia/logs/archive/*20120525 see {{bug|37012}} :-( [08:35:33] Logged the message, Master [08:38:26] !log deployment-prep root: on dbdump, deleted /etc/logrotate.d/mw-udp2log . Most probably in conflict with the one from deployment-feed which host the udp2log process [08:38:28] Logged the message, Master [08:41:53] hashar: looks like another puppet run took care of it. woo! [08:42:00] great [08:42:03] (the issues I was hacing earlier, imean) [08:43:08] off for sometime [08:43:13] listening to a conf about wikidata [08:43:16] in my city !!!! [08:43:16] http://en.wikipedia.org/wiki/User:Daniel_Mietchen/Talks/Open_Data_Week_2012/Wikidata [08:43:18] \O/ [08:43:22] see ya later [09:11:07] PROBLEM Puppet freshness is now: CRITICAL on ganglia-test3 i-0000025b output: Puppet has not run in last 20 hours [09:48:17] PROBLEM Free ram is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:02] RECOVERY Free ram is now: OK on bots-apache1 i-000000b0 output: OK: 84% free memory [10:29:10] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [11:23:48] 05/25/2012 - 11:23:47 - Creating a project directory for pybal [11:32:48] 05/25/2012 - 11:32:48 - Creating a project directory for pybal-is-awesome [11:32:49] 05/25/2012 - 11:32:48 - Creating a home directory for faidon at /export/home/pybal-is-awesome/faidon [11:32:49] 05/25/2012 - 11:32:48 - Creating a home directory for mark at /export/home/pybal-is-awesome/mark [11:33:47] 05/25/2012 - 11:33:46 - Updating keys for faidon at /export/home/pybal-is-awesome/faidon [11:33:47] 05/25/2012 - 11:33:46 - Updating keys for mark at /export/home/pybal-is-awesome/mark [11:43:58] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [11:52:57] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [11:53:18] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [11:53:18] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [11:53:37] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [11:53:37] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: Connection refused by host [12:22:30] 05/25/2012 - 12:22:30 - Updating keys for mark at /export/home/lvs-labs/mark [12:22:37] 05/25/2012 - 12:22:37 - Updating keys for mark at /export/home/varnish/mark [12:22:45] 05/25/2012 - 12:22:45 - Updating keys for mark at /export/home/preflights/mark [12:22:49] 05/25/2012 - 12:22:49 - Updating keys for mark at /export/home/pybal-is-awesome/mark [12:23:03] 05/25/2012 - 12:23:02 - Updating keys for mark at /export/home/testlabs/mark [12:23:08] 05/25/2012 - 12:23:08 - Updating keys for mark at /export/home/puppet/mark [12:23:10] 05/25/2012 - 12:23:10 - Updating keys for mark at /export/home/testswarm/mark [12:23:15] 05/25/2012 - 12:23:15 - Updating keys for mark at /export/home/bastion/mark [12:23:26] 05/25/2012 - 12:23:26 - Updating keys for mark at /export/home/mail/mark [13:24:54] PROBLEM Free ram is now: CRITICAL on deployment-squid i-000000dc output: Critical: 5% free memory [13:53:54] PROBLEM dpkg-check is now: CRITICAL on tutorial-mysql i-0000028b output: Connection refused by host [13:55:14] PROBLEM Current Load is now: CRITICAL on tutorial-mysql i-0000028b output: Connection refused by host [13:55:54] PROBLEM Current Users is now: CRITICAL on tutorial-mysql i-0000028b output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:56:29] PROBLEM Disk Space is now: CRITICAL on tutorial-mysql i-0000028b output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:57:04] PROBLEM Free ram is now: CRITICAL on tutorial-mysql i-0000028b output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:58:24] PROBLEM Total Processes is now: CRITICAL on tutorial-mysql i-0000028b output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:59:23] !log tutorial Installing mysql-server on tutorial-mysql [13:59:25] Logged the message, Mr. Obvious [13:59:34] lol [14:01:55] !log bots I want to be Mr. Obvious [14:01:56] Logged the message, Master [14:03:31] haha [14:03:36] Even the labs morebots calls me that? [14:05:28] Same bot pretty much [14:18:47] New review: Faidon; "Do you mean $::realm instead of $::cluster? " [operations/puppet] (test); V: 0 C: -1; - https://gerrit.wikimedia.org/r/8575 [14:27:00] !log deployment-prep hashar: installed jobrunner05 and 06 using Ubuntu precise. Should let get a 0.27 ffmpeg installation for {{bug|37043}} [14:27:02] Logged the message, Master [14:33:45] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner05 i-0000028c output: Connection refused by host [14:35:07] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner05 i-0000028c output: Connection refused by host [14:35:07] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused by host [14:35:57] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused by host [14:36:27] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused by host [14:36:58] PROBLEM SSH is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused [14:37:47] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused by host [14:38:29] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner06 i-0000028d output: Connection refused by host [14:39:38] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner06 i-0000028d output: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:08] RECOVERY SSH is now: OK on deployment-jobrunner06 i-0000028d output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:42:57] !log deployment-prep hashar: installing applicationserver::homeless and applicationserver::jobrunner on jobrunner05 [14:42:58] Logged the message, Master [14:43:35] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:58] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:58] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:18] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:45:47] !log deployment-prep hashar: installing applicationserver::homeless and applicationserver::jobrunner on jobrunner06 [14:45:49] Logged the message, Master [15:14:20] 05/25/2012 - 15:14:20 - Updating keys for laner at /export/home/deployment-prep/laner [15:16:18] 05/25/2012 - 15:16:18 - Updating keys for laner at /export/home/deployment-prep/laner [15:18:20] 05/25/2012 - 15:18:20 - Updating keys for laner at /export/home/deployment-prep/laner [15:18:33] RECOVERY Disk Space is now: OK on deployment-jobrunner05 i-0000028c output: DISK OK [15:18:48] !log deployment-prep hashar: deleted jobrunner06 (precise), we just need one precise instance which will be jobrunner05 for now [15:18:50] Logged the message, Master [15:18:52] RECOVERY dpkg-check is now: OK on deployment-jobrunner05 i-0000028c output: All packages OK [15:19:05] RECOVERY Current Users is now: OK on deployment-jobrunner05 i-0000028c output: USERS OK - 1 users currently logged in [15:19:05] RECOVERY Free ram is now: OK on deployment-jobrunner05 i-0000028c output: OK: 94% free memory [15:19:44] RECOVERY Total Processes is now: OK on deployment-jobrunner05 i-0000028c output: PROCS OK: 96 processes [15:20:03] RECOVERY Current Load is now: OK on deployment-jobrunner05 i-0000028c output: OK - load average: 0.29, 0.53, 0.41 [15:26:01] paravoid: so I looked at installing a job runner using Ubuntu precise. It complains about wikimedia-task-appserver and wikimedia-job-runner being unavailable [15:26:09] Iguess cause they are only in the lucid repo for now [15:26:16] so the choice is either: [15:26:26] 1) port ffmpeg 0.27 to Lucid [15:26:37] 2) forward port wikimedia* packages to precise [15:27:21] will add that on bug 37043 whenever bugzilla is up:) [15:27:36] we have to do (2) at some point anyway, shouldn't we? [15:27:47] go hashar go [15:27:59] yup 2 is going to be needed [15:28:26] I am just afraid of that taking a long time [15:28:30] I have no idea [15:28:48] more time than rebuilding the crap that ffmpeg is? :) [15:28:53] and risking production? [15:29:21] :) [15:29:29] what's the hostname of the precise job runner? [15:31:48] jobrunner05 [15:32:01] hm [15:32:22] paravoid: indeed ffmpeg on Lucid might cause issues [15:32:27] so lets do (2) [15:35:22] ffmpeg also breaks things, incl. abi, all the time [15:35:39] most of the times silently [15:35:45] it's like one of the worst upstreams out there [15:38:29] I have posted my comment on bug report https://bugzilla.wikimedia.org/show_bug.cgi?id=37043#c9 [15:38:45] is upgrading the wikimedia* packages to Precise a huge task ? [15:39:00] I mean, would it be possible to get them installed before the Berlin hackaton aka next week? [15:39:13] not sure [15:39:15] I'll try [15:41:10] chrismcmahon: I have added a few tracking bugs for the beta project : https://bugzilla.wikimedia.org/showdependencytree.cgi?id=37079&hide_resolved=1 [15:41:24] chrismcmahon: that could help tracking the project progress [15:42:01] hashar: thanks, I'll probably send that to wikivideo-l today [15:42:03] they're very very useful [15:42:16] feel free to add blocking bugs there [15:42:21] or open new tracking bugs [15:42:35] thanks a lot for doing that, it helped me quite a bit [15:43:00] chrismcmahon, hashar: probably not your fault, but it still worries me how we're trying to achieve two different goals at the same time [15:43:11] that is replicating production *and* deploying TMH [15:43:54] indeed [15:44:18] paravoid: the same issue will come up again. what we're after is the ability to deploy new features to a prod-like environment. [15:44:45] TMH just turned out to be a difficult instance of a something we'll want to do over and over [15:46:02] I have spent like 2 days trying to replicate the thumb system we have in production [15:46:04] that is, we would have the same issues deploying to prod as we have now in labs. (I think) wrt to versions, packages etc. [15:46:09] was probability not my smartest move ;-D [15:46:57] enwiki-fe97c599: 3.6607 21.5M ForeignAPIRepo::getThumbUrlFromCache could not write to thumb path 'mwstore://wikimediacommons-backend/wikimediacommons-thumb/4/4a/Commons-logo.svg/22px-Commons-logo.svg.png' [15:46:59] yeahhhh [15:47:10] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [15:51:22] will see that later [15:51:26] I am off for this week [15:51:49] RECOVERY Current Users is now: OK on pybal-precise i-00000289 output: USERS OK - 1 users currently logged in [15:51:49] RECOVERY dpkg-check is now: OK on pybal-precise i-00000289 output: All packages OK [15:51:59] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 0.63, 0.57, 0.62 [15:51:59] RECOVERY Total Processes is now: OK on pybal-precise i-00000289 output: PROCS OK: 82 processes [15:52:32] RECOVERY Disk Space is now: OK on pybal-precise i-00000289 output: DISK OK [15:58:31] RECOVERY Free ram is now: OK on pybal-precise i-00000289 output: OK: 88% free memory [16:41:10] is anyone here a mailman admin? I've got a bunch of out-of-office spam coming in on wikivideo-l [16:49:34] New patchset: Andrew Bogott; "Creating generic::mysql::server class that installs packages and sets up my.cnf and starts mysqld." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8926 [16:49:49] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (test); V: -1 - https://gerrit.wikimedia.org/r/8926 [16:53:27] New patchset: Andrew Bogott; "Creating generic::mysql::server class that installs packages and sets up my.cnf and starts mysqld." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8926 [16:53:42] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8926 [16:55:22] New review: Andrew Bogott; "Fellow Andrew -- I'd appreciate if you could glance at this and ensure that I'm not missing your int..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8926 [17:04:01] PROBLEM dpkg-check is now: CRITICAL on deployment-imagescaler01 i-0000025a output: CHECK_NRPE: Socket timeout after 10 seconds. [17:08:50] RECOVERY dpkg-check is now: OK on deployment-imagescaler01 i-0000025a output: All packages OK [17:36:45] Ryan_Lane: btw, http://lonesysadmin.net/vExcuse/ [17:36:54] one of my ex-colleagues sent me that :) [17:37:20] hahaha [17:38:23] seems appropriate :) [17:39:14] what's going to be our excuse after gluster? [17:39:15] ceph? [17:39:24] Ceph looks pretty cool [17:39:32] Planning my test cluster at work to see if it actually functions [17:51:45] 05/25/2012 - 17:51:45 - Creating a home directory for aaron at /export/home/ipv6/aaron [17:52:46] 05/25/2012 - 17:52:46 - Updating keys for aaron at /export/home/ipv6/aaron [17:54:47] 05/25/2012 - 17:54:46 - Creating a home directory for robla at /export/home/ipv6/robla [17:54:47] 05/25/2012 - 17:54:46 - Creating a home directory for erik at /export/home/ipv6/erik [17:55:44] 05/25/2012 - 17:55:44 - Updating keys for robla at /export/home/ipv6/robla [17:55:45] 05/25/2012 - 17:55:44 - Updating keys for erik at /export/home/ipv6/erik [18:11:29] 05/25/2012 - 18:11:28 - Creating a home directory for ori at /export/home/editor-engagement/ori [18:11:40] !log editor-engagement adding Ori.livneh for experimentation [18:11:40] Ryan_Lane: thanks! [18:11:42] Logged the message, Master [18:12:00] yw [18:12:18] when the bot says you have a home directory and your keys are updated you'll be able to log into instances and such [18:12:24] you can create instances inside of the project now [18:12:27] 05/25/2012 - 18:12:27 - Updating keys for ori at /export/home/editor-engagement/ori [18:13:26] great, thanks again [18:37:39] hi Ryan_Lane [18:39:27] aude: howdy [18:39:45] Ryan_Lane: i'd like to add an instance to the maps stuff (and get some stuff done before the hackathon) [18:39:55] cool [18:39:57] wondering about the choices of instance types [18:40:04] I'd recommend using the m1 types [18:40:11] ok [18:40:12] probably shouldn't use larger than a large [18:40:15] what's the difference? [18:40:23] we're having IO issues right now, that are causing performance problems [18:40:30] oh [18:40:31] right [18:40:32] s1 instances have more local storage, which you don't want to use [18:40:38] right [18:40:44] you want to use the project storage at /data/project [18:41:05] medium is probably ok and even small might work for just testing stuff [18:41:17] is it possible to resize them later? [18:41:50] not currently [18:41:56] and we'll be using puppet [18:41:58] ok [18:42:24] but we want to configure things where it's easy to replicate the setup in a new server [18:42:38] yep. that's the way to go about it [18:42:44] like amazon does [18:43:04] well, you replicate the setup via puppet [18:43:12] right [18:43:20] then when you create a new instance, you just add the puppet classes/variables [18:43:37] that works [18:43:49] i've been working with the wikidata labs site so have some idea [18:44:13] not sure we're doing everything the right way there but it works for now [18:44:48] * Ryan_Lane nods [18:45:19] well, things aren't the easiest to work with right now, everything is still in the early phases [18:45:20] we'll obviously need a deployment test setup when ready, with all the same settings as wikipedia [18:45:35] that's for wikidata [18:47:30] ok, it works i think :) [18:53:45] PROBLEM Current Load is now: CRITICAL on maps-test3 i-0000028f output: Connection refused by host [18:54:18] Ryan_Lane: assume that's normal? [18:54:25] yep [18:54:25] PROBLEM Current Users is now: CRITICAL on maps-test3 i-0000028f output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:54:31] ok [18:54:32] once puppet runs again it'll be fine [18:54:38] * aude will check back in a bit [18:54:49] have a 3-day weekend to work on stuff :) [18:55:05] PROBLEM Disk Space is now: CRITICAL on maps-test3 i-0000028f output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:55:45] PROBLEM Free ram is now: CRITICAL on maps-test3 i-0000028f output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:56:55] PROBLEM Total Processes is now: CRITICAL on maps-test3 i-0000028f output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:57:35] PROBLEM dpkg-check is now: CRITICAL on maps-test3 i-0000028f output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:02:28] ssmollett: are you around? [19:11:10] PROBLEM Puppet freshness is now: CRITICAL on ganglia-test3 i-0000025b output: Puppet has not run in last 20 hours [19:12:50] maplebed: surprisingly, yes [19:12:58] wow! [19:13:07] I thought for sure that was going to go unanswered. [19:13:22] I just wanted to let you know I found what might be an artifact of the ganglia upgrade [19:13:42] spence (the nagios server) had two gmond restart scripts and would throw the Misc cluster back and forth between two cluster names, blowing away all history each time. [19:14:04] I think I've managed to catch the history each time (with only the loss of an hour or two) [19:14:24] but thought you might like to know. [19:14:43] I think I've squished it (by putting a warning and exit 0 in the gmond restart script I don't like) [19:14:47] but the same thing might exist elsewhere. [19:15:47] I have a few thoughts of what to do to push the change cluster wide (ensure => absent for /etc/init.d/gmond for example) but I'm not going to do anything else just yet. [19:16:52] interesting. are the two init scripts gmond and ganglia-monitor? [19:16:56] yes. [19:17:20] and now my lunch date is here - gotta run. [19:25:21] !log tutorial Started mysql-server on tutorial-mysql [19:25:23] Logged the message, Mr. Obvious [19:25:27] !log tutorial Made mysql listen on all IPs rather than localhost on tutorial-mysql [19:25:28] Logged the message, Mr. Obvious [19:27:45] it also looks like the ganglia memcached plugin isn't working. but i'm not sure if it was working before the upgrade either. [19:33:45] Hey Ryan_Lane: do you know why apache would keep getting killed off on a pretty much default lab instance? [19:34:55] killed off? [19:35:09] [167129.334496] 119768 pages non-shared [19:35:09] [167129.334498] Out of memory: kill process 15598 (apache2) score 93645 or a child [19:35:09] [167129.344936] Killed process 15598 (apache2) [19:35:13] (from dmesg) [19:35:50] Are there any memory limits set? [19:36:29] Ryan_Lane, have time for a quick review? https://gerrit.wikimedia.org/r/#/c/8926/ (It's a cherry pick, but a messy one.) [19:37:40] csteipp: no, but maybe your instance is running out of memory? [19:37:48] csteipp: what size is it? [19:37:59] Yeah, it could. It was a tiny / smallest instance [19:38:06] andrewbogott_: :D quick? [19:38:12] Is there a way to live migrate? [19:38:39] csteipp: generally, you should likely always use small or higher [19:38:51] csteipp: resize isn't currently available [19:38:57] Lesson learned :) [19:39:09] 512MB of ram isn't enough for basically anything. heh [19:39:19] Is it possible to tweak that on an instance? Or do I have to rebuild the whole thing? [19:39:21] Ryan_Lane: Well, it'll be quick if you don't look very hard. [19:39:40] heh [19:43:54] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: -1; - https://gerrit.wikimedia.org/r/8926 [19:44:16] andrewbogott_: just a few small things [19:44:21] ok, thanks. [19:44:38] also, you'll need a default labs role class for this [19:44:52] since labs can't add parameterized classes [19:45:08] we'll likely want /mnt to be the default data directory [19:52:59] New review: Andrew Bogott; "Removed some whitespace messes as well." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8926 [19:53:19] New patchset: Andrew Bogott; "Creating generic::mysql::server class that installs packages and sets up my.cnf and starts mysqld." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8926 [19:53:34] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8926 [20:11:30] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8926 [20:11:32] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8926 [20:11:43] andrewbogott_: ^^ [20:11:54] ok [20:11:57] you'll still need a role class to use it in labs [20:12:07] to make it use /mnt [20:22:09] PROBLEM Free ram is now: WARNING on ipv6test1 i-00000282 output: Warning: 14% free memory [20:27:03] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 21% free memory [20:27:03] PROBLEM Total Processes is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:27:43] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:29:12] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [20:36:43] PROBLEM Free ram is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:36:53] PROBLEM Current Load is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:37:06] PROBLEM Disk Space is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:37:23] PROBLEM Current Users is now: CRITICAL on e3 i-00000290 output: Connection refused by host [20:47:05] PROBLEM Disk Space is now: UNKNOWN on e3 i-00000290 output: Invalid host name i-00000290 [20:48:25] Ryan_Lane: Can you give me a 1-sentence explanation of what a 'role class' is and is for? I'm looking at class defs now, almost getting it. [20:50:49] my understanding: the normal class is how to configure the service. the role class is how we configure a service for a specific duty. [20:51:06] using swift as an example; swift.pp contains all the stuff on how to configure the proxy server etc. [20:51:18] role/swift.pp contains all the values for the eqiad test cluster as a roll class. [20:52:12] PROBLEM Disk Space is now: CRITICAL on e3 i-00000291 output: Connection refused by host [20:52:25] PROBLEM Current Users is now: CRITICAL on e3 i-00000291 output: Connection refused by host [20:52:42] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000291 output: Connection refused by host [20:52:48] maplebed: Ok, makes sense. Although you did use more than one sentence. [20:53:20] clarity supercedes brevity. [21:26:23] New patchset: Sara; "Ensure that the init script exists for only one of gmond and ganglia-monitor." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8977 [21:26:39] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8977 [21:27:37] maplebed: I'm looking at your swift role definition now. I pretty much don't understand what nesting does. Like, you have classes in classes in classes... [21:27:55] I think that's just shorthand. [21:28:06] but honestly I'm not actually sure either. [21:29:10] shorthand like 'include class foo and, oh yeah, while I'm at it here's a definition for foo [21:29:12] '? [21:29:28] like class foo { [21:29:32] class bar {} [21:29:33] } [21:29:36] is the same as [21:29:42] class foo{} [21:29:45] class foo::bar{} [21:30:58] Ok, but in either case, what does it mean for bar to be 'in' foo? Other than namespace-wise? [21:31:16] I think it only affect sthe namespace. [21:31:39] in site.pp i'm still including role::swift::pmtpa-prod::proxy and things like that. [21:31:50] RECOVERY Free ram is now: OK on e3 i-00000291 output: OK: 94% free memory [21:31:51] So if a site.pp includes foo in a node... nothing in particular happens with bar unless it explicitly includes foo::bar? [21:31:57] RECOVERY Total Processes is now: OK on e3 i-00000291 output: PROCS OK: 89 processes [21:32:02] RECOVERY Current Load is now: OK on e3 i-00000291 output: OK - load average: 0.09, 0.12, 0.17 [21:32:03] hm, ok. [21:32:06] curious. [21:32:11] unless foo has a require or include bar, I thikn that's true. [21:32:21] this is the blind leading the blind though [21:32:26] so treat everything I say as suspect. [21:40:58] maplebed: i think https://gerrit.wikimedia.org/r/8977 should take care of the conflicting init scripts. i'll test it out in labs now. [21:45:45] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8977 [21:45:47] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8977 [21:46:02] ssmollett: maybe. this makes me think maybe not: [21:46:02] root@spence:/var/log/nagios# dpkg -S /etc/init.d/gmond [21:46:03] dpkg: /etc/init.d/gmond not found. [21:46:27] i ran dpkg --purge gmond manually on spence. [21:46:36] ah! [21:47:09] in that case, rock on! [21:51:27] RECOVERY Puppet freshness is now: OK on ganglia-test3 i-0000025b output: puppet ran at Fri May 25 21:51:19 UTC 2012 [21:52:00] maplebed: do you know if the ganglia memcached plugin used to work? [21:52:18] depends on which plugin it is. [21:52:42] I wrote a shitty one long ago that only works with a specific memcached version, but there are others around that are more flexible. [21:52:49] RECOVERY dpkg-check is now: OK on ganglia-test3 i-0000025b output: All packages OK [21:55:03] puppet/files/ganglia/plugins/memcached.py says 'Created by Ryan Lane on 2010-09-07.' But if I'm understanding the puppet manifest correctly, the .pyconf file for it wouldn't have been getting loaded. [21:55:25] it's not [21:55:36] I wrote that from scratch too [21:55:40] and it's not really complete [21:56:30] okay. so i should ignore it, then? [21:56:33] yep [21:56:37] maybe even delete it [21:56:50] https://github.com/ganglia/gmond_python_modules/tree/master/memcached [21:56:59] maybe just use that one instead? [21:57:13] (I haven't looked at it, just see that it exists) [21:57:35] amusingly, it was submitted to github within 2 weeks of the timestamp on Ryan_Lane's version. [21:57:42] hah [21:57:48] 2010-09-21. [21:58:03] New patchset: Andrew Bogott; "Stab in the dark, rolewise" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8979 [21:58:18] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8979 [21:59:04] Ryan_Lane: ^ is that roughly what you meant? [21:59:48] yep [21:59:53] your formatting is a little off :) [22:00:17] with that, though, you can include role::labsdb in the puppet classes, and people can just include it [22:00:23] then it'll use the proper datadir [22:00:58] it would be really awesome if you could edit code directly in gerrit's interface, then resubmit a patch [22:01:07] ^demon: ^^ make it happen! :D [22:01:41] <^demon> Maybe people should write good code to begin with, then we can skip the entire issue ;-) [22:02:05] heh [22:02:36] Yeah, then 'git review' can just echo "I have total faith in you as a coder" and then do a push and bypass gerrit entirely. [22:06:13] is anything likely to still be intentionally using the gmond package, instead of the ganglia-monitor package? that is, are there servers with lsbdistrelease older than 9.10? [22:06:38] New patchset: Andrew Bogott; "Added a role for a simple labs mysql db." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8979 [22:06:40] Ryan_Lane: Better formatting? [22:06:52] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8979 [22:08:08] PROBLEM Puppet freshness is now: CRITICAL on bots-4 i-000000e8 output: Puppet has not run in last 20 hours [22:09:35] ssmollett: yes; ns0 and mchenry are two examples of hosts running hardy. [22:13:14] andrewbogott_: yep! [22:13:23] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8979 [22:13:25] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8979 [22:33:55] PROBLEM dpkg-check is now: CRITICAL on mwr-proto i-00000292 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:35:05] PROBLEM Current Load is now: CRITICAL on mwr-proto i-00000292 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:35:46] PROBLEM Current Users is now: CRITICAL on mwr-proto i-00000292 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:38:51] RECOVERY dpkg-check is now: OK on mwr-proto i-00000292 output: All packages OK [22:40:21] RECOVERY Current Load is now: OK on mwr-proto i-00000292 output: OK - load average: 0.04, 0.50, 0.51 [22:40:45] RECOVERY Current Users is now: OK on mwr-proto i-00000292 output: USERS OK - 1 users currently logged in [22:57:37] maplebed: i should also remove the gmetad package from spence, right? [22:58:07] ssmollett: I haven't thought about it. probably? [22:58:14] is it running? [23:01:39] nope. [23:03:15] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 3.85, 4.26, 4.78 [23:16:19] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 6.47, 6.17, 5.54 [23:41:14] PROBLEM Free ram is now: WARNING on ipv6test1 i-00000282 output: Warning: 18% free memory [23:46:34] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 3.84, 4.25, 4.81 [23:53:45] PROBLEM Current Load is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host [23:54:25] PROBLEM Current Users is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host [23:55:05] PROBLEM Disk Space is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host [23:55:45] PROBLEM Free ram is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host [23:56:55] PROBLEM Total Processes is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host [23:57:35] PROBLEM dpkg-check is now: CRITICAL on aggregator-test3 i-00000293 output: Connection refused by host