[00:07:19] join #wikimedia-operations [00:07:22] oops [02:45:20] 05/24/2012 - 02:45:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:48:32] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [02:49:52] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 31% free memory [02:49:52] RECOVERY Current Load is now: OK on ipv6test1 i-00000282 output: OK - load average: 1.39, 0.81, 0.34 [02:50:22] RECOVERY Current Users is now: OK on ipv6test1 i-00000282 output: USERS OK - 0 users currently logged in [02:52:14] !log [02:52:17] !help log [02:52:17] !documentation for labs !wm-bot for bot [02:52:26] !documentation [02:52:26] https://labsconsole.wikimedia.org/wiki/Help:Contents [02:53:22] RECOVERY Total Processes is now: OK on ipv6test1 i-00000282 output: PROCS OK: 92 processes [02:53:59] RECOVERY dpkg-check is now: OK on ipv6test1 i-00000282 output: All packages OK [02:54:02] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [02:55:11] !wm-bot [02:55:11] http://meta.wikimedia.org/wiki/WM-Bot [02:55:21] 05/24/2012 - 02:55:21 - Updating keys for laner at /export/home/deployment-prep/laner [02:55:26] you want help on loh? [02:55:39] I really need to see what's up with that stupid deployment-prep project and my home directory [02:56:05] !log deployment-prep jeremyb: foo [02:56:06] Logged the message, Master [02:56:33] oooh [02:56:56] Ryan_Lane: loh? [02:57:01] *log [02:57:08] oh. yeah. but now i figured it out ;) [02:57:14] see labs-logs-bottie above ;) [02:58:37] * jeremyb collapses... [02:58:41] see you tomorrow [02:58:54] * Ryan_Lane waves [03:00:21] 05/24/2012 - 03:00:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:00:33] hehe, keys, keys, keys [03:00:40] -_- [03:01:20] 05/24/2012 - 03:01:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:05:21] 05/24/2012 - 03:05:21 - Updating keys for laner at /export/home/deployment-prep/laner [03:40:47] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [03:46:57] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [03:52:40] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [04:00:47] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:02:37] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:05:47] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:06:57] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:08:37] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: CRITICAL - Socket timeout after 10 seconds [04:11:57] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:12:37] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:13:27] PROBLEM HTTP is now: WARNING on deployment-apache21 i-0000026d output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.005 second response time [04:17:38] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:22:33] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:26:13] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [04:32:33] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:04:23] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [05:41:17] New review: Hashar; "That upload.conf does not exist anymore ;-D" [operations/puppet] (test); V: -1 C: -2; - https://gerrit.wikimedia.org/r/8710 [06:37:25] hello [06:37:34] I am not going to be connected that much today [06:37:56] doing my duties as a dad [06:43:58] PROBLEM Current Load is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:44:49] PROBLEM Current Users is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:45:48] PROBLEM Disk Space is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:45:48] PROBLEM Free ram is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:50:25] PROBLEM Total Processes is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:51:48] PROBLEM dpkg-check is now: CRITICAL on mingledbtest i-00000283 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:29] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.61, 6.58, 4.21 [06:57:04] looks like I got a ghost instance. i-00000219 got deleted but some process is still alive there | https://bugzilla.wikimedia.org/37070 [06:58:28] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.06, 3.76, 3.70 [07:01:51] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 6.70, 10.88, 7.60 [07:05:10] !log deployment-prep killed some stalled jobs on jobrunner02 [07:05:12] Logged the message, Master [07:08:28] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.74, 1.44, 2.50 [07:15:50] out of ideas for now [07:15:58] see you later [07:16:51] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 2.48, 2.54, 4.25 [09:36:46] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [09:42:55] !log deployment-prep hashar: installing php5-dev and running `pecl install parsekit` [09:42:57] Logged the message, Master [09:43:33] !log deployment-prep hashar: installing php5-dev and running `pecl install parsekit` on deployment-dbdump [09:43:35] Logged the message, Master [09:44:57] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [11:24:48] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [11:30:23] Change abandoned: Faidon; "per hashar's request" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8710 [11:30:30] thanks paravoid ! [11:30:59] we now have some tree of our bugs https://bugzilla.wikimedia.org/showdependencytree.cgi?id=37079&hide_resolved=1 [11:31:21] I have marked the ops one with +ops keyword and [OPS] in the summary [11:31:28] I might have assigned some to you meanwhile [11:33:28] that's nice! [11:44:17] testing new bot [11:44:20] yeahhhh [11:44:56] !log $USER testing new bot [11:44:56] $USER is not a valid project. [11:45:05] log 'installing php5-dev and running pecl/parsekit is already installed and is the same as the released version 1.3.0 [11:45:17] which is not really what I wanted to log [11:45:33] !log deployment-prep hashar: installing php5-dev and running `pecl install parsekit` on deployment-dbdump [11:45:35] Logged the message, Master [11:47:55] !log deployment-prep Moving /bin/log to /usr/local/bin/log [11:47:57] Logged the message, Master [11:51:15] !log hashar: yeah I do log [11:51:15] hashar: is not a valid project. [11:51:39] !log deployment-prep hashar: yeah I do log [11:51:40] Logged the message, Master [11:52:01] !log deployment-prep hashar: Rewrote log command to use dologmsg and the new beta-logmsgbot [11:52:02] Logged the message, Master [11:56:39] .... [12:06:04] grr [12:06:12] disk dead again [12:07:10] paravoid: can you help please ? [12:07:16] the dumps project is killing the cluster again [12:07:17] http://ganglia.wmflabs.org/latest/?c=dumps&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:07:54] machine dumps-1 is doing some nasty stuff again http://ganglia.wmflabs.org/latest/?c=dumps&h=dumps-1&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:25:26] !log hashar synchronizing Wikimedia installation... : [12:25:26] hashar is not a valid project. [12:25:34] :^) [12:25:43] sync done. [12:32:46] daughter awake, I am out of there [12:32:48] see you tomorrow [13:04:15] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: CRITICAL - Socket timeout after 10 seconds [13:09:04] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.005 second response time [13:09:15] PROBLEM Puppet freshness is now: CRITICAL on ganglia-test3 i-0000025b output: Puppet has not run in last 20 hours [13:27:23] !log bots petrb: upgrading wm-bot [13:27:25] Logged the message, Master [13:30:39] !ping [13:30:39] pong [13:30:42] @help [13:30:42] Type @commands for list of commands. This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.3.0 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [14:03:33] Change on 12mediawiki a page OAuth/User stories was modified, changed by Amgine link https://www.mediawiki.org/w/index.php?diff=541623 edit summary: /* User story: add yours here */ Privacy concern [14:27:09] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [14:54:38] hashar is not around today as I recall [15:03:09] he was earlier [15:03:11] but left [15:05:13] beta commons is 2 steps forward, one step back. :) [15:06:50] what's the issue? [15:11:55] paravoid: [api-error-internal_api_error_MWException] upon attempting to upload a file via UploadWizard [15:12:22] ah, mw issues, don't think I'm able to help there [15:12:56] paravoid: that's what I thought. beta commons is just going to be unstable for a while [15:19:58] New review: Faidon; "Again, not a big fan of doing if/then elses all over the place. I'd propose having a separate class ..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8575 [16:09:20] 05/24/2012 - 16:09:20 - Creating a home directory for andrew at /export/home/globaleducation/andrew [16:15:47] 05/24/2012 - 16:15:47 - Creating a project directory for wm-review [16:15:47] 05/24/2012 - 16:15:47 - Creating a home directory for andrew at /export/home/wm-review/andrew [16:16:48] 05/24/2012 - 16:16:48 - Updating keys for andrew at /export/home/wm-review/andrew [16:16:49] 05/24/2012 - 16:16:48 - Creating a project directory for wmreview [16:18:42] 05/24/2012 - 16:18:42 - Creating a project directory for mwreview [16:18:43] 05/24/2012 - 16:18:42 - Creating a home directory for andrew at /export/home/mwreview/andrew [16:19:47] 05/24/2012 - 16:19:47 - Updating keys for andrew at /export/home/mwreview/andrew [16:46:58] New review: Platonides; "I saw it in the instance." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8710 [16:48:21] 05/24/2012 - 16:48:20 - Updating keys for laner at /export/home/deployment-prep/laner [16:54:26] New patchset: Faidon; "Ensure that /a exists on imagescalers" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8794 [16:54:41] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8794 [16:54:44] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8794 [16:54:47] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8794 [16:59:19] 05/24/2012 - 16:59:19 - Updating keys for laner at /export/home/deployment-prep/laner [17:07:41] New patchset: Hashar; "(bug 37046) fix apache monitoring on deployment-prep" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8575 [17:07:56] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8575 [17:09:13] New review: Hashar; "Patchset 2 is a proposal with a second monitor class. I am not sure that is what Faidon wants though..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8575 [17:39:40] ping hashar [17:41:28] GChriss: what's your labs username? [17:41:59] gchriss [17:42:28] Platonides: pong [17:43:31] when logged into labsconsole, I don't see controls under 'Special:NovaProject' other than an empty 'Project filter' box and a 'Submit' botton [17:44:01] GChriss: look at the project filter box [17:44:14] and tick or untick the checkboxes there [17:44:23] you aren't in the project [17:44:36] oh [17:44:43] therein lies the issue [17:44:48] now you are [17:44:54] what other project will you be working on? [17:45:16] 05/24/2012 - 17:45:16 - Creating a home directory for gchriss at /export/home/bastion/gchriss [17:45:21] the openmeetings project mentioned earlier [17:46:15] 05/24/2012 - 17:46:15 - Updating keys for gchriss at /export/home/bastion/gchriss [17:51:16] 05/24/2012 - 17:51:15 - Updating keys for gchriss at /export/home/bastion/gchriss [17:54:15] 05/24/2012 - 17:54:15 - Updating keys for gchriss at /export/home/bastion/gchriss [17:57:01] ok ssh'd into bastion. Using OpenSSH_5.9p1 Debian-5 + RSA key, bastion closes the connection just after input_userauth_pk_ok. A DSA key works as expected. [18:25:29] paravoid: so, since it's unlikely we'll have the new hardware in pmtpa before the hackathon, do you want to play a shell game? [18:26:33] we can live-migrate all the instances off of virt1, then unmount /var/lib/nova/instances [18:26:40] and block-migrate them back [18:26:50] then do the next, then the next, etc [18:26:54] then they'll all be on local storage [18:35:59] Ryan_Lane: 'Shall we play a game?' [18:36:13] can it be thermo-nuclear labs warfare? [18:38:49] :D [18:39:33] 'The only way to win is not to play' [18:39:51] hmm no, it's 'The only winning move is not to play' :D [18:40:03] How about a nice game of chess? [18:41:30] meh [18:41:37] that's a thinkin man's game [18:45:55] It involves little IO though. [18:51:13] Ryan_Lane: I'm about to go out [18:51:18] but don't we have a failed disk in virt1? [18:51:24] we do [18:51:31] moving things to local storage on a degraded array sounds scary [18:51:36] * Ryan_Lane shrugs [18:51:43] we can lose 3 disks in that array [18:51:54] 3? how come? [18:51:56] well 2 [18:52:03] so, another one :) [18:52:08] what's up with that ticket? [18:52:12] dunno [18:52:14] did you put it in? [18:52:18] I did [18:52:26] of course, chris isn't in pmtpa [18:52:44] ah right [18:52:59] which rt is it? [18:53:13] ah. see it [18:54:07] seems the disk is already on way [18:54:39] we can leave virt1 till last [18:54:48] but we should try to get as many off of gluster as possible [18:55:20] anyway, no need to handle it right now. we can start on that monday [18:55:51] PROBLEM Current Users is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused by host [18:56:26] what about the problems with the live migration that you mentioned? [18:56:26] PROBLEM Disk Space is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused by host [18:56:37] as long as we do one at a time, it works fine [18:56:57] we can have a script that migrates one, watches for completion, then does the next [18:57:01] PROBLEM Free ram is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused or timed out [18:57:39] my major problem with live migration is that it doesn't queue them up, and just tries to do them all at the same time [18:57:56] also, that after it completes, sometimes it doesn't put it back into virsh's definitions [18:58:21] PROBLEM Total Processes is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused by host [18:58:31] the former can be solved via a script, the second can be solved by checking all of the instances after they are done [18:59:51] PROBLEM dpkg-check is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused by host [19:00:11] PROBLEM Current Load is now: CRITICAL on mwtest-prototype i-00000284 output: Connection refused by host [19:45:20] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [20:31:27] hashar, why do I view /usr/local/apache/conf/upload.conf, then? [20:32:49] Platonides: ? [20:33:01] Platonides: caused you clicked the file ? [20:33:04] I don't know honestly [20:34:07] Platonides: oh labs [20:34:11] and the change you sent [20:34:13] ahmm [20:34:37] the apaches conf on labs are not the in git/gerrit [20:34:44] so your change is not going to do anything [20:35:05] basically, the apache conf files have been copied on labs [20:35:30] and then wikipedia.org occurrences were replaced by wikipedia.beta.wmflabs.org [20:35:59] Platonides: ping ? ;- [20:36:03] pong :) [20:36:20] I saw the bug, then the commented upload.conf include [20:36:29] went to edit the apache file [20:36:37] oh, crap it is managed by puppet [20:36:41] hehe [20:36:42] asked Ryan [20:36:47] figured out how to do it [20:36:50] submitted it [20:37:09] so at least, you discovered puppet! [20:37:11] and now you say it's bad because you say the file I'm reading it's not there [20:37:38] i was fearing to submit it to the wrong branch :D [20:38:47] I can confirm upload.conf is no more around [20:38:50] (in production) [21:03:54] PROBLEM HTTP is now: CRITICAL on mwtest-prototype i-00000284 output: CRITICAL - Socket timeout after 10 seconds [21:05:55] Ryan_Lane: Which of these mysql/db classes do I want for MW? generic::mysql::server seems to not work in labs. (Or, at least, does not work for me.) [21:06:10] I think someone was working on one.... [21:06:18] I don't think we have one right now [21:06:34] generic::mysql::server should likely be removed. I think it points at nothing [21:07:36] I think it was hashar [21:09:21] So all of the listed db classes are red herrings? That's good to know :) [21:10:28] andrewbogott_: yeah, it's all screwed up right now [21:10:37] our puppet repo needs some serious work [21:11:07] Sure -- best to save that work for when it's less annoying. [21:12:18] * andrewbogott_ looks optimistically at paravoid [21:16:24] RECOVERY Disk Space is now: OK on mwtest-prototype i-00000284 output: DISK OK [21:18:35] RECOVERY Current Load is now: OK on mwtest-prototype i-00000284 output: OK - load average: 4.96, 2.49, 1.18 [21:21:15] RECOVERY Current Users is now: OK on mwtest-prototype i-00000284 output: USERS OK - 1 users currently logged in [21:21:15] RECOVERY Total Processes is now: OK on mwtest-prototype i-00000284 output: PROCS OK: 91 processes [21:22:10] RECOVERY Free ram is now: OK on mwtest-prototype i-00000284 output: OK: 92% free memory [21:23:05] RECOVERY dpkg-check is now: OK on mwtest-prototype i-00000284 output: All packages OK [21:25:41] PROBLEM Free ram is now: WARNING on ipv6test1 i-00000282 output: Warning: 17% free memory [21:33:38] Hey Ryan_Lane, is it possible to get a public ip for my labs server? I keep getting "Failed to allocate new public IP address." [21:33:50] which project? [21:33:53] I'll need to up its quota [21:33:56] ipv6 [21:34:05] ok. gimme a sec [21:34:32] Slightly ironic having a project called ipv6 on a platform that only does ipv4 [21:34:49] oh, it is... [21:34:52] csteipp: ok [21:34:58] done [21:35:15] Thank you!! [21:35:31] yw [21:54:11] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 19.94, 11.84, 7.64 [22:00:48] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 45% free memory [22:07:19] uh oh [22:07:21] ls /usr/local/apache/common-local/php-trunk/ [22:07:21] ls: cannot access /usr/local/apache/common-local/php-trunk/: Input/output error [22:07:30] more gluster problems? [22:08:22] or problems with the nfs server... [22:08:29] deployment-nfs-memc:/mnt/export/apache on /usr/local/apache type nfs (rw,bg,soft,tcp,timeo=14,intr,nfsvers=3,addr=10.4.0.58) [22:09:05] Platonides: might be IO problem [22:09:07] the nfs server, likely [22:09:12] or nfs [22:09:23] well, I see it reports an IO problem :) [22:09:38] but nfs is maybe having problem because its storage can't be read [22:09:56] Platonides: just ssh to -nfs-memc [22:10:00] investigate [22:10:09] I'm there [22:10:12] it reads locally [22:10:24] restart nfs service [22:10:26] now it works in the apache, too [22:10:29] ok [22:10:43] maybe it was overloaded? [22:11:23] no [22:11:28] I don't thunk so [22:11:37] it would be gluster overloaded maybe [22:11:49] nfs is just a relay between instance and gluster storage [22:11:57] because its own storage is on gluster [22:14:32] gluster should "just work" [22:43:45] PROBLEM Current Users is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:44:25] PROBLEM Disk Space is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:45:05] PROBLEM Free ram is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:46:15] PROBLEM Total Processes is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:47:00] PROBLEM dpkg-check is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [22:48:15] PROBLEM Current Load is now: CRITICAL on demo-web2 i-00000285 output: CHECK_NRPE: Error - Could not complete SSL handshake. [23:10:10] PROBLEM Puppet freshness is now: CRITICAL on ganglia-test3 i-0000025b output: Puppet has not run in last 20 hours [23:44:20] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 3.09, 3.40, 4.97 [23:54:29] PROBLEM Current Load is now: CRITICAL on mwreview-proto i-00000286 output: CHECK_NRPE: Error - Could not complete SSL handshake. [23:55:06] PROBLEM Current Users is now: CRITICAL on mwreview-proto i-00000286 output: CHECK_NRPE: Error - Could not complete SSL handshake. [23:55:42] PROBLEM Disk Space is now: CRITICAL on mwreview-proto i-00000286 output: CHECK_NRPE: Error - Could not complete SSL handshake. [23:56:17] PROBLEM Free ram is now: CRITICAL on mwreview-proto i-00000286 output: CHECK_NRPE: Error - Could not complete SSL handshake. [23:59:26] RECOVERY Current Load is now: OK on mwreview-proto i-00000286 output: OK - load average: 0.60, 0.95, 0.65