[00:24:14] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 10.10, 15.89, 7.85 [00:34:14] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.22, 2.45, 4.31 [00:41:27] Ryan_Lane: Do we have database replication of any kind in labs yet? [01:23:54] PROBLEM Disk Space is now: CRITICAL on deployment-transcoding i-00000105 output: DISK CRITICAL - free space: / 37 MB (2% inode=53%): [01:28:54] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 48 MB (3% inode=53%): [01:31:14] PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours [02:47:27] RECOVERY Total Processes is now: OK on wikidata-dev-3 i-00000222 output: PROCS OK: 110 processes [02:47:57] RECOVERY dpkg-check is now: OK on wikidata-dev-3 i-00000222 output: All packages OK [02:49:07] RECOVERY Current Load is now: OK on wikidata-dev-3 i-00000222 output: OK - load average: 0.44, 0.89, 0.42 [02:49:27] RECOVERY Current Users is now: OK on wikidata-dev-3 i-00000222 output: USERS OK - 0 users currently logged in [02:50:47] RECOVERY Disk Space is now: OK on wikidata-dev-3 i-00000222 output: DISK OK [02:50:47] RECOVERY Free ram is now: OK on wikidata-dev-3 i-00000222 output: OK: 90% free memory [03:27:09] PROBLEM Puppet freshness is now: CRITICAL on nginx-dev1 i-000000f0 output: Puppet has not run in last 20 hours [03:35:39] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 13% free memory [03:55:39] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [03:57:29] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [03:57:49] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 17% free memory [04:04:09] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 17% free memory [04:05:39] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 93% free memory [04:12:29] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:21:06] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 2% free memory [04:21:06] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:24:18] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.24, 5.29, 3.42 [04:26:31] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:29:02] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 5.18, 6.82, 5.68 [04:31:42] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:34:02] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 1.14, 4.26, 4.97 [04:34:22] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.31, 1.82, 2.60 [04:36:42] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:04:02] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 6% free memory [05:09:02] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: Critical: 5% free memory [05:59:02] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [06:06:02] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: DISK CRITICAL - free space: / 21 MB (1% inode=43%): [06:11:02] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 54 MB (4% inode=43%): [07:07:00] PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours [07:11:00] PROBLEM Puppet freshness is now: CRITICAL on nova-gsoc1 i-000001de output: Puppet has not run in last 20 hours [08:51:45] hellllooo [08:52:49] goodbyyyyyyyyyyyyyyyye [09:12:05] hashar: hey [09:12:20] what's the status of your update :-) [09:16:34] which update ? [09:18:32] you changed configuration of apaches or not? [09:19:41] we needed to install some cache or something like that like Reedy said, I don't remember what is that [09:31:11] I have only changed deployment-web instance to use the production puppet class [09:31:30] why is there 7 apaches instances anyway? Is there any specific differences between them ? [09:31:46] I am going to switch all the webXX boxes this morning [09:34:17] there is 5 of them and 1 is being set up [09:34:28] and there is so many of them because they are overloaded most of time [09:34:51] Ryan said I should create some more, we had 2 in past [09:35:08] Ryan_Lane hi :O [09:35:20] hashar: I can do that if you tell me what to change [09:36:05] so which one is being used right ? :D [09:36:44] all we have to do is probably to use the apaches::service class instead of web server::php5 [09:37:18] hashar: all of them are being used [09:37:29] traffic is split to apaches [09:37:40] btw reloging to irc brb [09:37:42] by a LVS box or is that done at squid login? [09:37:47] squid [09:46:39] bak [09:46:41] :O [09:57:29] petan|wk: back. Where is the squid conf / doc ? :-] [09:58:00] ./etc/squid [09:58:05] there is no doc [09:58:11] \o/ [09:58:32] I don't understand squid I was hoping that Ryan give us the prod config few months ago [09:58:44] this is a workaround only what we use now [09:58:47] don't spend anytime on squid anyway [09:58:55] why [09:58:57] I think it is going to be dropped in favor of Varnish anyway [09:59:02] hm [09:59:12] but right now we use it [09:59:15] on prod [09:59:18] yes [09:59:36] but by the time we figure out how to configure it on labs, prod will probably have dropped it :-] [09:59:40] anyway it is not a big issue [09:59:48] ok [09:59:59] we just need it to split the load between the apaches and cache some statics files [10:00:06] I guess everything else should hit the apaches [10:01:06] ok [10:02:30] !log deployment-req adding generic::packages::git-core on deployment-squid so we can track changes being made in /etc/ [10:02:32] deployment-req is not a valid project. [10:03:13] AHHHH [10:03:25] !log deployment-prep adding generic::packages::git-core on deployment-squid so we can track [10:03:28] Logged the message, Master [10:03:47] !depprep [10:03:52] :-P [10:04:44] today is deadmau5 day , one hour long mix http://www.youtube.com/watch?v=_9IBbMW2o_o [10:04:45] \o/ [10:07:46] ahh [10:07:58] we can't rename an instance through labsconsole :D [10:11:08] !log deployment-prep made /etc/ a git repository on deployment-squid and committed existing /etc/squid/ [10:11:10] Logged the message, Master [10:12:26] o.0 [10:12:30] That's rather insane [10:12:53] what is insane? [10:13:02] Making /etc/ a git repo [10:13:18] that is the first thing I do on all my machine [10:13:29] Then you're insane [10:13:36] I even know someone who has / under git [10:13:51] that let you easily track any modification made to your system [10:13:58] and revert to a previous know state :] [10:14:03] I am not sure that is all insane [10:14:11] As git doesn't track changes unless you make commits then you have to specifically commit on changes so you should know what changes have been made and you should then use a config management system [10:14:50] XD [10:14:53] it's insane [10:14:53] I do review change in /etc regular and do the commit manually [10:15:02] but I don't care unless it's not my server [10:15:33] like commiting /etc/shadow is a bad idea [10:15:58] you would need to commit as root also [10:16:05] that's another bad thing you should avoid [10:16:42] making a repo of rootfs is a most insane thing I can imagine :D [10:16:53] like commiting stuff in /tmp [10:17:05] Lets just make /etc a ramfs :D [10:17:08] hehe [10:17:17] live cd does it [10:17:32] is beta.wmflabs.org deprecated in favor of deployment.wikimedia.beta.wmflabs.org ? [10:17:38] squid still references the first [10:17:47] hashar: I don't understand squid [10:17:52] ok :-) [10:17:54] will update it so [10:17:54] I did what I thought it's correct and it works [10:18:27] I thought that beta.wmflabs.org would tell to squid use cache for *beta.wmflabs.org [10:18:35] because I didn't want to define all domains we have [10:18:52] Nah [10:18:55] It's special like that [10:18:56] <3 varnish [10:43:19] petan|wk: is deployment-webs1 an Apache too ? [10:47:15] !log deployment-prep Cleaning out squid peers list [10:47:17] Logged the message, Master [10:53:58] IIRC yes [10:54:04] Dunno why it's called webs not web though [10:54:27] ahhso I will have to add it too [10:54:36] cause there is only 4 apaches in squid right now :-) [10:54:38] but [10:54:43] I need a coffee first [10:55:58] Squid is horrible to configure imo [11:02:46] !log deployment-prep migrate web{3,4,5} from webserver::php5 to apaches::service [11:02:47] Logged the message, Master [11:07:45] puppet takes sooo long :( [11:09:53] It does? [11:12:35] PROBLEM dpkg-check is now: CRITICAL on deployment-web3 i-00000219 output: DPKG CRITICAL dpkg reports broken packages [11:14:25] PROBLEM dpkg-check is now: CRITICAL on deployment-web4 i-00000214 output: DPKG CRITICAL dpkg reports broken packages [11:14:45] PROBLEM dpkg-check is now: CRITICAL on deployment-web5 i-00000213 output: DPKG CRITICAL dpkg reports broken packages [11:19:21] !log deployment-prep changed squid visible name to squid001.beta.wmflabs.org [11:19:23] Logged the message, Master [11:19:25] RECOVERY dpkg-check is now: OK on deployment-web4 i-00000214 output: All packages OK [11:19:36] RECOVERY dpkg-check is now: OK on deployment-web5 i-00000213 output: All packages OK [11:20:03] hashar: that's https version [11:20:38] beta ? [11:20:55] so should I name it squid001.deployment.wmflabs.org instead? [11:21:04] no [11:21:06] webs1 [11:21:11] ohh [11:21:14] don't insert it to squid [11:21:17] it doesn't work [11:21:35] squid only has web web3 web4 and web5 for now [11:21:39] web - web5 are apaches [11:21:42] ok [11:21:48] web2 should be up soon [11:22:17] !log deployment-prep puppet finished migration of web{3,4,5} to apache::service [11:22:19] Logged the message, Master [11:22:29] can you install that cache [11:22:34] we talked about it [11:22:35] RECOVERY dpkg-check is now: OK on deployment-web3 i-00000219 output: All packages OK [11:23:11] APC [11:23:12] that one [11:23:26] what are 'beta' and 'deployment' sub domains for ? [11:23:33] it used to be deployment in past [11:23:34] yeah APC is being installed right now. Need to check it works [11:23:37] then we renamed it [11:23:56] great [11:24:19] could you bless me with `netadmin` role on deployment-prep project? [11:24:25] would need it to maintain dns entries :-] [11:26:35] yes [11:27:41] done [11:27:46] danke [11:27:54] just don't remove the deployment dns it should still work [11:28:49] just wanted to set an entry for squid001.beta.wmflabs.org [11:28:56] ok [11:29:30] so APC should be enabled now [11:29:41] ok [11:29:44] that magically happens when installing apache::service [11:30:00] which somehow makes puppet install the php-apc package [11:30:05] which is, AFAIK, enabled by default [11:31:15] PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours [11:31:34] there is problem with log [11:32:41] when I open any page on wiki there is no debug log [11:32:43] $wgDebugLogFile [11:35:52] where is that set ? [11:36:41] in CommonSettings.php [11:36:59] it used to work today [11:37:07] last entry is-rw-r--r-- 1 www-data www-data 289522903 2012-04-25 11:19 [11:37:14] just like when someone die, he used to live yesterday :-] [11:37:17] few minutes ago it worked [11:37:17] lets see [11:39:13] !deployment-prep manually purged debian package `ack`, installed `ack-grep` [11:39:13] deployment-prep is a project to test mediawiki at beta.wmflabs.org before putting it to prod [11:39:22] !log deployment-prep manually purged debian package `ack`, installed `ack-grep` [11:39:23] Logged the message, Master [11:56:01] hashar: did you find why logging stopped working? [11:56:04] is there a way to debug it [11:56:34] there must be one [11:56:39] sorry was distracted by something else [11:56:41] resuming now [12:16:22] New patchset: Hashar; "classes for deployment preparation project (beta)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5790 [12:16:36] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/5790 [12:16:40] New review: Hashar; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5790 [12:22:18] ahhhh [12:23:02] I should have applied to an op position :-] [12:23:09] to get root access on the machine that actually matter [12:23:59] !log deployment-prep added a basic puppet skeleton in manifests/labs/beta/ with https://gerrit.wikimedia.org/r/5790 (test branch) [12:24:01] Logged the message, Master [12:27:48] mutante: do you know how we could use the various passwords::* puppets classes on labs? [12:27:56] I thought we had a dummy repository [12:29:04] hashar: there is labs/private [12:29:12] ohh forgot about that one [12:29:18] going to add some dummy classes there so [12:30:18] yeah, guess so, and then replace passwords in public classes with variables reading from private [12:31:14] eh, public puppet templates of config files i should say [12:31:37] I try to enable misc::scripts on labs but it requires the misc::passwordsScripts class which indeed uses assignments such as : [12:31:40] $cachemgr_pass = $passwords::misc::scripts::cachemgr_pass [12:32:35] hashar: that logging is really fucked [12:32:44] now it does work for commons but nothing else [12:33:15] hashar: sry, not really time to look at that closer right now, but how about we merge your mysql change, because installing these servers will eat more time [12:33:20] is it needed to enable any other variable than wgDebugLogFile [12:34:44] mutante: follow up on #wikimedia-operations [12:37:06] petan|wk: so is the variable set in CommonSettings ? [12:37:11] what does it log ? Everything ? [12:37:13] or just errors? [12:41:48] New patchset: Hashar; "dummy misc::passwordScript" [labs/private] (master) - https://gerrit.wikimedia.org/r/5792 [12:42:31] o.0 [12:43:10] hashar: it used to log some stuff, not really sure what exactly [12:43:27] how do you run maintenance script on labs? [12:43:28] but it was logging surely some lines for each execution of page [12:43:32] I guess from deployment-dbdump ? [12:43:33] hashar: there is a folder bin [12:43:37] hashar: yes [12:43:37] I would like to run the eval.php script [12:43:42] ok do it on dbdump [12:43:45] or webs1 [12:43:49] depends on load [12:43:53] both of them can do that [12:44:04] webs1 is not being used for apache so far so it's good for that as well [12:44:19] I will probably request to have dbdump renamed as bastion [12:44:20] :-D [12:44:40] so we get everything installed there and use that machine as the main working server [12:45:53] oh [12:46:22] got a prompt form: /usr/local/apache/common/live/maintenance/eval.php --wiki enwiki [12:59:33] hashar: works? [12:59:58] been fixing some syntax errors in a ruby template :-D [13:00:04] trying to log stuff right now [13:00:13] if you really need to rename it you could just create a new server [13:00:27] don't forget you can type log in shell [13:00:44] !log deployment-prep petrb: test [13:00:46] Logged the message, Master [13:00:51] see? [13:01:02] ohhh there is a web service !!! [13:01:04] \o/ [13:01:05] yes [13:01:13] it works on some projects where people want it :-) [13:01:24] !log deployment-prep hashar: I am a hero [13:01:25] Logged the message, Master [13:02:14] I use only that [13:02:17] too lazy [13:02:36] I have a similar command on production [13:02:46] thanks! [13:04:37] still logging doesn't work [13:04:57] only for commons [13:06:06] hmm $wgDebug is inexistent somehow [13:07:13] weird [13:07:19] is it needed for anything? [13:07:35] for some reason it's logging stuff on commons wiki, but not on any other [13:07:43] :/ [13:07:48] don't understand why [13:08:07] it's using both same config [13:15:47] well I can not write to files in /usr/local/apache/common/wmf-config -( [13:23:54] hashar: that's weird [13:24:00] sec [13:25:20] hashar: try now [13:25:26] logout before [13:25:31] then log back [13:25:37] o.0 why is apache in /usr/local/ anyway? [13:25:50] Damianz: because that's the way they have it on prod [13:25:56] we just mirror it [13:26:05] mw does have some bizzare choices [13:26:10] indeed [13:26:13] mw not [13:26:17] wmf maybe :P [13:26:28] Same diff :P [13:26:34] Damianz: cause production has/had its own apache compilation installed using the /usr/local/ prefix [13:26:36] Some of the mw choices are as bizzare as wmf's [13:26:46] heh [13:26:46] since that path is hardcoded everywhere it is safer to keep using it :-]]] [13:26:48] hashar: Can not haz packages? [13:27:11] back in 2002 none knew about how to write a .rpm :-D [13:27:14] Hmm really need to test my gil theory on why py2.7 sucks and try it with 3 [13:27:23] so no l33t p4ck4g3s !!! [13:28:14] PROBLEM Puppet freshness is now: CRITICAL on nginx-dev1 i-000000f0 output: Puppet has not run in last 20 hours [13:28:16] hashar: you said you want to kill nfs [13:28:17] why [13:28:25] nfs sucks [13:28:28] cause I can't write to the common path ? :D [13:28:31] and overall NFS sucks [13:28:32] scap sucks [13:28:48] but it makes stuff so easy :D [13:28:56] ok what is your alternative to that [13:29:10] scap :-D [13:29:16] before you decide to use something experimental like gluster, keep in mind it's buggy a lot [13:29:17] seriously I have no idea right now [13:29:19] NFS is fine for now [13:29:49] in the end the safe way is to have files modified on a bastion then deployed on the other hosts using rsync / git / whatever file copy [13:29:57] for example right now I moved all logs to gluster and since then I have a lot of troubles with that [13:29:59] you could even tar | nc ;-] [13:30:05] also the whole auto mount is broken [13:30:20] maybe it's not gluster what sucks but actually the way it's mounted [13:30:25] might be [13:30:34] like when I do ls /data/project [13:30:40] shell stuck [13:30:43] for 20 seconds [13:30:53] that's what I hate [13:31:01] touch: cannot touch `/usr/local/apache/foo': Permission denied [13:31:01] [13:31:03] ;-D [13:31:14] althgouth the dir is drwxrwxr-x 5 mah depops [13:31:25] and I am a member of depops [13:31:30] yes so am I [13:31:36] no idea why it doesn't work for you [13:31:40] everyone else is fine [13:31:49] try sudo su mah [13:31:51] :P [13:31:56] that will fix it hehe [13:32:30] maybe we could just create a new user "op" :D [13:32:35] and switch to it [13:32:58] we wouldn't need group at all [13:33:15] <^demon> If you're trying to be more like the actual deployment, you should make the user called mwdeploy and have all deployment scripts sudo to it. [13:33:26] aha [13:33:39] right [13:34:21] ^demon: where is $HOME [13:34:35] <^demon> /home/wikipedia? [13:34:58] ^demon: going to have misc::scripts installed on the bastion [13:35:58] ok done [13:36:08] hashar: try sudo su mwdeploy [13:36:10] <^demon> Heh, so it turns out httpd.maxThreads is set to 25, but database.poolLimit is set to 8. Doesn't this mean anyone beyond the first 8 users is likely to hit a database connection limit error? [13:36:20] <^demon> (these are defaults) [13:36:57] hmm at least I can write to deployment-nfs-memc:/mnt/export/apache/ [13:37:58] it magically solved!! [13:38:01] * hashar hates NFS [13:39:08] what solved [13:39:16] just use mwdeploy since now [13:39:17] :D [13:39:21] it works [14:13:45] PROBLEM Current Load is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:14:25] PROBLEM Current Users is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:15:05] PROBLEM Disk Space is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:15:45] PROBLEM Free ram is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:16:15] ^^^ that twmysqlconf is a new instance [14:16:20] yeh [14:16:29] that's why so many problems hehe [14:16:40] puppet is slower than my c++ parser XD [14:16:54] which creates the nagios confs [14:16:55] PROBLEM Total Processes is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:17:35] PROBLEM dpkg-check is now: CRITICAL on twmysqlconf i-00000223 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:33:38] and I can't log on that new instance [14:33:48] the "testwarm" project is f*** beyond repair [14:35:25] Change abandoned: Demon; "Was part of a test that's already merged, so abandoning. Feel free to restore if it's necessary." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/4144 [14:41:00] hashar: what is on log? [14:41:05] !console [14:41:05] in case you want to see what is happening on terminal of your vm, check console output [14:42:59] !log deployment-prep Creating temporary instance to test a MySQL puppet snippet [14:43:02] Logged the message, Master [14:43:19] hashar: can you prefix all instances with deployment [14:43:22] petan|wk: problem solved. I am deleting the project and already deleted all instances. [14:43:29] petan|wk: that one is a temp one [14:43:30] which project? [14:43:34] I will delete it in an hour or so [14:43:57] the delete project is 'testswarm' . I have opened a bug request to have it deleted. : https://bugzilla.wikimedia.org/36241 [14:44:00] ah ok [14:44:04] never worked :D [14:44:07] well it did [14:44:11] right [14:44:17] but then stopped working once labs started requesting ssh keys [14:44:20] and we never figured out why [14:44:23] ok [14:44:24] It did work, I worked on it a lot. but no longer needed [14:44:36] ohh [14:44:37] I can login still [14:44:45] Krinkle: I never managed to log in it :-D [14:45:12] petan|wk: testswarmmysqlconf will be deleted as soon as I finish my mess with it [14:46:45] PROBLEM host: swarm-specialpage is DOWN address: i-0000017b check_ping: Invalid hostname/address - i-0000017b [14:46:53] hashar: I logged my progress on https://labsconsole.wikimedia.org/w/index.php?title=Nova_Resource:I-0000017b#Config (which is now deleted) [14:48:05] PROBLEM host: twmysqlconf is DOWN address: i-00000223 check_ping: Invalid hostname/address - i-00000223 [14:49:35] PROBLEM host: miniswarm is DOWN address: i-0000010a check_ping: Invalid hostname/address - i-0000010a [14:49:54] Krinkle: there should be some people who can undelete pages or just have generic sysop rights [14:50:09] I don't understand why Ryan didn't make nova admin a user rights [14:50:11] right [14:50:14] I'd like to have sysop on labsconsolewiki [14:50:18] instead of giving it to sysop by default [14:50:46] current permissions on console are strange [14:53:44] PROBLEM Current Load is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:54:17] https://bugzilla.wikimedia.org/show_bug.cgi?id=36243 [14:54:24] PROBLEM Current Users is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:55:04] PROBLEM Disk Space is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:55:44] PROBLEM Free ram is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:56:54] PROBLEM Total Processes is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:57:34] PROBLEM dpkg-check is now: CRITICAL on testswarmmysqlconf i-00000224 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:27] New patchset: Hashar; "update contint to 'production' branch" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5799 [15:03:41] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/5799 [15:03:41] git checkout production 'some file' [15:03:43] for the win [15:03:52] New review: Hashar; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5799 [15:04:02] petan|wk: Hello [15:04:15] hey [15:04:29] member:petan%7Cwk:Is http://commons.wikimedia.beta.wmflabs.org/wiki/Special:UploadWizard broken again? [15:04:46] I am unable to upload video since morning [15:04:56] IMG_0006.theora.ogv [15:04:56] Internal error: Server failed to store temporary file. [15:05:03] this is he error I get [15:05:08] *the [15:05:10] ah, let me check that [15:05:36] in fact it never really worked to me, are you a developer of that tool [15:05:47] no, tester [15:05:51] aha [15:06:02] problem is that the extension produces almost no debug information [15:06:12] so it's very hard to find out what's wrong [15:06:30] + logging is broken atm, hashar any progress on that? [15:06:49] not at all [15:06:54] too bad [15:06:59] at least I managed to write to the configuration files :D [15:08:30] chrismcmahon: I have no idea where the files are being stored [15:08:38] how could I check if it can write to that location? [15:08:41] petan|wk: I was uploading files last week before the localization issue happened [15:08:57] ok, even if it was related to that issue, is there a way to debug it [15:09:12] because nternal error: Server failed to store temporary file. isn't clear to me [15:09:16] which temporary file? [15:09:18] where [15:09:39] there is /tmp and /mnt/upload [15:09:43] both writable [15:10:00] keep in mind hashar has done a lot of updates on web servers [15:10:07] maybe it's related [15:10:18] chrismcmahon: can you give me list of sw we need to install on apaches [15:10:36] petan|wk I cannot, and I'm not sure who can. [15:10:36] you said it's in your /home but it's not [15:10:49] maybe it was j^ [15:10:54] I don't remember [15:10:58] what is his name on labs? [15:11:56] I don't know that either :( [15:12:12] ok yes [15:12:15] that was him [15:12:52] my labs name is j [15:13:03] ok [15:13:05] I found it [15:13:36] is there anyone who knows how that upload wizard works :-) [15:13:52] I am wondering where it saves the file [15:14:00] that one it can't save [15:14:34] for transcoding we need a larger tmp space since it has to hold the video + the transcoded one [15:14:56] thank you j^ :) [15:15:05] j^: that's no problem [15:15:13] we have few hundreds of GB if that is enough [15:15:19] there is a lot of space in /data/project [15:16:28] so the transcoding instance checks the db for updates, checks out the video(this will put it into TMP), encode the derivative in temp and move it to the storage backend [15:17:20] the error above would be happening before though [15:17:30] thats sounds like an error with the web instances [15:18:51] New review: Hashar; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5799 [15:18:58] arhghg [15:19:11] I give up [15:19:22] grr [15:19:39] I give down [15:20:19] !log deployment-prep deleting testswarmmysqlconf , it is of no use :-/ [15:20:21] Logged the message, Master [15:21:11] commons has a lot of moving parts [15:21:46] Change abandoned: Hashar; "I can not merge in test branch and hence can not use that to test on labs. So this change is basica..." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5799 [15:23:26] j^: that problem happens when i upload picture [15:23:36] so I don't think it's problem with transcoding [15:25:54] PROBLEM host: testswarmmysqlconf is DOWN address: i-00000224 CRITICAL - Host Unreachable (i-00000224) [15:26:21] We did try nailing petan|wk to the floor but he struggles a lot [15:26:26] hashar: can you prefix instances with deployment pls [15:26:36] deployment-testswarmmysqlconf [15:26:37] etc [15:26:44] which instance? [15:26:44] so that we don't have hostname mess [15:26:52] hashar: all instances you create in deployment project [15:27:03] there are more projects on labs [15:27:15] as I told you the testswarmmysqlconf was temporary and I deleted it [15:27:19] ok [15:27:25] will surely prefix the next ones though :-D [15:27:47] * Damianz prefixes hashar with v [15:27:59] Now you and ^demon can make pretty patterns [15:28:15] <^demon> ...what? [15:28:20] :o [15:28:28] Damianz is on drugs, ignore that :P [15:28:35] Drugs are fun [15:28:36] :D [15:28:47] caffeine <3 [15:30:14] j^: I am still getting the same error [15:30:38] I need to know where does it try to save the file and why it doesn't work [15:30:44] is it possible to enable some debug mode [15:31:37] coffee again [15:31:47] then I get a look at the debuglogfile again [15:31:57] actually commons work [15:32:02] that's only wiki we have logs for [15:35:10] petan|wk: there must be something wrong somewhere [15:35:16] cause I have added $wgDebugTimestamps [15:35:19] heh [15:35:22] and the common logs are not prefixed with a time [15:35:28] I know there is something wrong [15:35:34] :) [15:35:37] that's pretty obvious [15:35:42] the runJobs.php script still have the old context, that is why it still write in the log file [15:35:53] ah [15:35:55] everything else has the new context which is … wrong somehow :-( [15:35:56] right [15:36:06] hm... [15:36:24] so restarting the job runner will stop the log spam [15:36:30] ok [15:36:33] can you do that? [15:36:45] where is it running [15:36:49] transcoding? [15:36:50] I'll take a look at it [15:36:50] no idea :D [15:36:55] let me restart this session [15:36:58] hi Platonides [15:37:03] :D [15:37:07] :-( [15:37:09] I think he restarted session [15:37:13] of his own [15:37:14] :D [15:38:22] maybe he killed his irc instead of that job [15:38:44] well "this" was surely referring to his operateng system session :-) [15:38:49] .. operating .. [15:38:53] ok [15:39:05] why do people restart their OS? [15:39:08] I never do that [15:39:19] unless I change hardware of course [15:39:29] I sometimes shut down my pc when I do that [15:39:37] sometimes not [15:39:59] I don't like rebooting :/ [15:40:27] The only time I reboot is when I move house [15:40:53] petan|wk: are you sure MW actually include the files in wmf-config ? [15:41:12] I hope so [15:41:22] is there any other config :P [15:41:25] to include [15:41:32] cause adding die("oh no"); at top of CommonSettings.php does not do anything :-] [15:41:48] That could be squid? [15:41:55] wow [15:41:58] let me check it [15:42:01] on deployment-dbdump:/usr/local/apache/common-local/CommonSettings.php [15:42:04] let met try again [15:42:12] Does php index.php die? [15:42:21] petrb@deployment-web5:/home$ head /usr/local/apache/common-local/CommonSettings.php [15:42:21] head: cannot open `/usr/local/apache/common-local/CommonSettings.php' for reading: No such file or directory [15:42:31] sorry /usr/local/apache/common-local/wmf-config/CommonSettings.php [15:42:48] there is no die [15:43:02] ahh [15:43:06] http://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Random [15:43:08] did you save it? [15:43:10] so squid is doing its job [15:43:13] http://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Random is uncached [15:43:14] not really [15:43:20] IIRC [15:43:24] hashar: I still don't see any die() [15:43:27] in that file [15:43:30] hashar killed the internets [15:43:36] Damianz: twice [15:43:36] ok now I do [15:43:46] is this how you test stuff on prod? [15:43:51] Totally :D [15:44:04] petan|wk: in prod you do not have the nice message [15:44:08] just a nice blank page :-] [15:44:09] I swear when production goes down it's the ruddy devs testing in production [15:44:11] aha [15:44:19] hi hashar :) [15:44:26] Platonides: :-]] [15:44:29] Platonides: did you enjoy the reboot? [15:44:50] :O [15:44:54] is it just me or does the toolserver have db maintaince every week these days :( [15:44:54] heh [15:45:03] Damianz: it's normal [15:45:09] petan, did you recreate deployment-web ? [15:45:19] in fact I don't remember I ever saw toolserver operational [15:45:24] Platonides: yes [15:45:36] It works... for about 10seconds [15:45:53] ok, so I accept the new 1b:37:6d:82:54:e9:36:d6:dd:29:5f:47:dd:91:e6:6d. RSA key :P [15:45:56] yes [15:46:39] Totally need host keys in labsconsole [15:48:09] Damianz, you can view them in the console output [15:48:25] it's not the most convenient output, however [15:49:09] That doesn't count :P [15:49:29] where are uploads expected to be stored? [15:51:36] the problems where using uploadwizard in commons? [15:51:58] permissions on /mnt/upload/wikipedia/commons/temp/ look right... [15:54:07] * Platonides is looking at the logs... [15:54:17] I thought we already figured out the squid problems [15:56:49] I never remember my wmflabs password :/ [15:57:45] argh... [15:57:50] "The username Platonides is not registered on this wiki, but it does exist in the unified login system" [15:57:55] on all wikis I try [15:58:43] Platonides is un-unified [15:58:45] ok, meta accepted it [15:59:10] now, do wmflabs emails work? [16:02:03] can anyone try that upload? [16:03:30] uploading [16:03:39] localization is gone again :( [16:03:42] Ooh production has funny diff colours now [16:03:59] Internal error: Server failed to store temporary file. [16:04:14] I didn't see that on the server :S [16:04:24] do we have several apaches now? [16:04:35] can you retry? [16:05:02] sure, one moment... [16:05:06] it's trying to write errors at /data/project/errors.log [16:05:09] but it isn't allowe [16:05:22] I wonder why [16:06:00] Internal error: Server failed to store temporary file. [16:06:11] it's moving really fast though [16:06:20] doesn't seem to try to store into the filesystem [16:06:28] perhaps is trying to save that into swift [16:06:48] Platonides: I believe it should be saving that into swift [16:07:02] maybe swift is in trouble? [16:11:10] chrismcmahon, can you retry? [16:11:22] sure [16:12:23] I may have chosen too big a file, we'll see [16:12:43] big files are more likely to produce issues [16:12:50] such as an error on upload [16:12:56] but they still should work [16:13:51] now uploading two files at once, but looking more promising, as UW is not speeding through the progress bar like it was [16:14:33] no I got the same error... [16:14:42] Internal error: Server failed to store temporary file. [16:14:55] Internal error: Server failed to store temporary file. [16:15:22] my file is only 5MB [16:16:30] BTW, I also tried uploadig via http://commons.wikimedia.beta.wmflabs.org/wiki/Special:Upload instead of the upload wizard and this time I got the error: Could not create directory "mwstore://local-backend/local-public/d/d1". [16:16:41] that's more helpful [16:17:18] it also took a very long time to give me that error..I thought it was going to time out...but gave me the error [16:18:10] that folder already exists [16:18:14] and moreover, it's writable [16:18:31] although it could be that folder in swift [16:19:00] hmm, Special:Upload was completely disabled last time I tried to use it. [16:19:07] but that was a while back [16:19:47] I *think* they use different mechanisms [16:20:10] wait a minute, there are both apacha and www-data users? [16:20:40] they are different users! [16:21:15] I think we should change apache to run as www-data [16:21:54] robla, under which uid do our apaches run? [16:21:55] while Upload uses another api, when it comes to storing the files, if Upload does not work, UW will also not work [16:21:58] apache or www-data? [16:22:10] <^demon> Platonides: Should be documented on noc. [16:22:20] on noc? [16:22:29] * robla doesn't know off the top of his head [16:22:29] they are php configurations [16:22:52] <^demon> http://noc.wikimedia.org/conf/httpd.conf says apache:apache [16:23:48] Platonides: looks like apache2 run as apache user [16:24:13] why are many files at wmflabs owned by www-data? [16:24:34] <^demon> Ubuntu defaults perhaps? [16:24:55] could be [16:25:21] ubuntu default is www-data and they used to run as www-data [16:25:46] <^demon> Lots of this apache config has been around for aggeeessss, from before we were 100% Ubuntu. So it's likely a holdover from older times :) [16:26:29] ah ok, so labs was not using that part of the config before but now is [16:26:43] * j^ updates the transcoding node to also use apache user [16:26:51] <^demon> No, I'm talking about the cluster. [16:27:24] !log deployment-prep Changing uid and group of apache user from 48 to 33 to match www-data [16:27:26] Logged the message, Master [16:27:39] <^demon> Having apache:apache is probably really super old. Ubuntu defaults to www-data:www-data which is probably the reason for the discrepancy between production & labs. [16:27:42] try the upload now [16:27:53] wmflabs was using apche user now [16:28:19] it may have been only since the instance recreation, though [16:28:44] runJobs.php is running as www-data, for instance [16:29:21] yes thats from the transcoding node, it was not changed recently [16:29:43] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 19% free memory [16:31:05] trying upload...hang on... [16:31:14] does anyone know how I can add an entry in the pmtpa.wmflabs domain ? [16:31:28] on special:upload still getting - Could not create directory "mwstore://local-backend/local-public/d/d1". [16:31:34] trying wizard next... [16:32:33] Platonides: where did you change the uid? web instances have www-data/33 and apache/48 in /etc/passwd [16:33:19] same error on wizard: Internal error: Server failed to store temporary file. [16:35:44] j^ I chenged it on deployment-web /etc/passwd [16:36:58] Platonides: would have to be in sync on web and web2-6 [16:39:57] it shouldn't be needed, to begin with [16:42:58] !log deployment-prep Apaches no more log anything. This is because rsyslog sends logs to a blackhole :-D [16:43:00] Logged the message, Master [16:44:10] hashar? [16:44:43] New patchset: Hashar; "rsyslog: ability to send logs to a custom server" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5813 [16:44:46] !log deployment-prep With the uid change to deployment-web, it is now writing into /data/project/errors.log [16:44:48] Logged the message, Master [16:44:56] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/5813 [16:46:21] New review: Hashar; "There must be a better way to do that but I am out of idea. The aim is to be able to send Apache log..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5813 [16:47:07] Platonides: what was wrong with the log files? [16:48:07] it was owned by www-data [16:48:27] aren't Apaches running under www-data ? [16:48:28] it's just a file, not syslog [16:48:37] they run as apache [16:48:41] OHH [16:48:42] which is a different user [16:48:48] * hashar feels tired [16:48:50] so simple [16:48:55] it escaped us for the whole afternoon [16:48:56] :-( [16:49:03] Platonides: what [16:49:11] previous log files were www-data [16:49:13] it worked [16:49:22] petan, apache2 was running as user apache [16:49:23] uid 48 [16:49:27] www-data is uid 33 [16:49:33] why it worked [16:49:34] maybe cause I deployed apache::service class this morning [16:49:35] before [16:49:42] could be that [16:49:44] the logfiles were actually created by apache itself [16:49:45] and it changed from running apache from www-data user to apache user [16:49:49] some package configuration [16:49:55] kudos Platonides ! [16:50:04] hashar: yes that [16:50:05] 's it [16:50:13] you reinstalled it [16:51:05] we will eventually want to switch from writing to files to udp:// logging [16:51:17] depends [16:51:20] can you grep it? [16:51:52] well [16:52:02] the idea is to send every logs over UDP to a central lost [16:52:06] host [16:52:13] then have udp2log daemon to run there and write to files [16:52:20] ok [16:52:28] that we could do [16:52:34] which is basically reinventing the syslog wheel [16:52:42] except it is better :-D [16:53:34] ok [16:53:40] I will do that tommorow I guess [16:54:22] should be misc::mediawiki-logger [16:54:40] it could be installed on a dedicated instance [16:54:43] a small / diskless one [16:54:57] wich will just run the udp2log daemon and write over NFS to some big disk [16:55:24] or just install it on deployment-dbdump [16:56:41] misc::mediawiki-logger [16:56:43] should be enough [16:57:40] so web2-6 still need to be fixed to run apache as 33, uploads now fail sometimes [16:57:56] I need to get out of coworking place [17:01:42] !log deployment-prep j: Changing uid and group of apache user from 48 to 33 to match www-data on web3,web4,web5 [17:01:46] Logged the message, Master [17:03:00] remember to restart apache [17:07:13] PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours [17:10:34] tparveen: now upload should work again [17:10:57] trying... [17:11:13] PROBLEM Puppet freshness is now: CRITICAL on nova-gsoc1 i-000001de output: Puppet has not run in last 20 hours [17:11:14] i still get those translation placeholders thouh [17:17:19] j^: yes I get the placeholders as well...makes it difficult to understand what the fields are for...but the upload worked for my first file :) [17:17:35] Thanks...will try a few more and keep posted... [17:17:43] this is using uploadWizard [17:17:55] did not try special:upload ... [17:18:21] anyone got an idea why the translations are still not working? some cache in the db possibly? [17:20:01] RoanKattouw: no. no replication [17:20:32] petan|wk: I'm not sure what you mean by nova admin as a right [17:20:50] so that people can have admin on the wiki? [17:20:58] Ryan_Lane: In that case can I get a labs project where I can set up MySQL with a clone of some small wiki's DB? [17:21:10] um. no [17:21:18] you mean replication? [17:21:22] from production? [17:21:23] No, not even that [17:21:26] ah [17:21:38] Just a DB with a healthy amount of data so EXPLAIN returns useful results [17:21:44] why not in deployment-prep? [17:22:25] Well let me tell you what I really want :) [17:23:08] I want an environment where people can run EXPLAIN queries against a somewhat realistic DB, and where I can give people shell like it's candy [17:23:20] heh [17:23:26] well. that would be real database replication [17:23:29] on real hardware ;) [17:23:40] It doesn't have to be replicated [17:23:42] so, I plan on bringing up user databases at some point soon [17:23:44] I said "somewhat realistic" [17:23:48] So a user DB would be fine [17:23:49] you need real hardware, though [17:24:02] I could just import data from an old dump of some small-ass wiki [17:24:03] otherwise what's the point? [17:24:42] if you are hitting virtualized disks your explains are going to be shitty [17:24:46] But basically the things I care about is 1) be able to import a couple hundred thousand rows of data and 2) give out shell access very very liberally [17:24:51] yeah [17:25:02] I plan on implementing user databases as LXC containers [17:25:04] How will EXPLAIN suck if the disks are virtual? [17:25:11] because the performance will be terrible [17:25:22] If it takes a second before it comes back that's fine [17:25:22] or does that not matter for this? [17:25:32] But will MySQL make different query plans based on how slow its disk is? [17:25:34] I would think not [17:25:43] ah. I guess not [17:26:00] ok [17:26:03] let's make a project for this [17:26:12] Also, I am aware that EXPLAIN works (almost) on the toolserver, but that doesn't have the liberal access policy that I want [17:26:17] when we get container support, we'll move the databases to hardware [17:26:26] OK [17:26:34] If you make me a project I'll install MySQL and everything [17:26:44] also, you don't necessarily need to give everyone shell [17:26:58] Right, they can just access the DB from bastion [17:27:03] exactly [17:27:13] I'll need to install the client libraries there [17:27:24] this could honestly be the start of the replicated database project [17:27:29] Yeah that works better, then they don't need key forwarding [17:27:32] lol [17:27:38] Maybe as far as installing mysql-client goes [17:27:58] well, I'm thinking this project could hold the databases, and we can give out access from there [17:28:03] But I am just going to install mysql-server, ask you on what FS you would like the DB to live, and then just import some random garbage into it and use it for a tutorial [17:28:08] Right [17:28:24] well, petan has issues on the gluster storage [17:28:35] which sucks, because that's where it should live [17:28:39] maybe I'll upgrade gluster today [17:28:42] that should solve that issue [17:28:46] so, what to call this project? [17:29:55] If it were up to me I would call it sqltutorial [17:30:06] But if you want to hijack it for something more generic, go for it [17:30:25] <^demon> RoanKattouw: Just pick a medium-ish sized dump and import that for a relatively realastic dataset. [17:30:31] Yeah [17:30:33] <^demon> Then make some bogus data for user rows. [17:30:42] I don't even necessarily need a user table [17:30:53] All my demo cases concern page and image [17:31:10] And I'm pretty sure every data type and index type you need is covered without user [17:31:14] or other private tables [17:35:36] heh. something more generic please :) [17:35:38] hm [17:35:53] I hate naming things [17:36:03] toolserver2 :D [17:36:10] * Ryan_Lane is a terrible, terrible person [17:36:33] database? [17:37:16] ugh. [17:37:27] that's likely the best, but is so boring [17:42:57] oh Ryan_Lane hello :-] [17:43:12] howdy [17:43:23] well mixed feeling [17:43:29] oh? [17:43:33] I have started looking at the deployment-prep project [17:43:48] that is an awesome one but there is still a lot of work to make it on par with production :-] [17:44:56] yes [17:44:58] quite a bit [17:45:04] now for the real question, who is allowed to merge changes in operations/puppet, branch 'test' [17:45:08] is that only ops ? [17:48:19] New review: Hashar; "Bug is https://bugzilla.wikimedia.org/36246" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5813 [17:52:38] We really need seperate puppet branches to make labs work [17:52:43] * Damianz eyes Ryan_Lane [17:52:55] paravoid: ^^ [17:52:55] heh [17:53:07] 04/25/2012 - 17:53:06 - Creating a home directory for faidon at /export/home/pediapress/faidon [17:53:09] hashar: yep [17:53:17] * paravoid coughs [17:53:41] would it be possible to set up a branch specific to the deployment-prep project so we could merge in it on sight [17:53:56] by we, I mean ops + me :-] [17:54:07] 04/25/2012 - 17:54:07 - Updating keys for faidon [17:54:14] Oh yeah sorry, I have a new person to moan at :D [17:54:17] cause I am not sure how I could actually tests my puppet changes on labs [17:55:05] Well really for labs we should test 'in production' as the idea is it's testing but we lack a production cluster - well for bots at least [17:55:09] yeah, right now we review your changes [17:55:17] ok [17:55:49] it's annoying, for sure [17:56:02] well I like being reviewed [17:56:13] my main issue is that our timezone makes it a bit hard :-] [17:56:28] Ryan_Lane: Ignoring the certain ops that merge their own (sometimes broken) changes into production without peer review :D [17:56:32] well, it would be nice to be able to test changes before review [17:56:41] though I think paravoid will be spending sometime on the 'beta' stuff too [17:56:43] Damianz: you mean all of us [17:56:44] ? [17:56:50] yes, he will [17:57:04] though I personally feel that per-project puppet branches are more important right now [17:57:26] Yes, well I was using 'certain' in a very broad sense.... [17:57:45] well will be back later [17:57:55] * Ryan_Lane nods [17:57:59] need to prepare dinner, get my daughter to sleep and prepare my luggage for the weekend trip [17:58:02] ++ [17:58:07] have fun [17:58:08] The main problem with pushing labs forward is the lack of what the general community can get merged into puppet as test is a slightly high bar as we can break everything :D [17:58:25] yes [17:58:31] I completely agree [18:01:28] I think the biggest hurdle in labs right now is puppet [18:01:31] Can you justify making GSOC people make people food if they are too tired to move? [18:01:39] heh [18:01:44] * Damianz looks at paravoid :D [18:02:19] the second biggest hurdle in labs is also puppet, but in a different way [18:02:22] And yes - I agree in 2 senses, a) making labs and production less split as we've clunged stuff together and b) actually doing what we started out with of controbuting back. [18:02:28] there's no documentation on puppet classes and variables [18:02:45] so, people have no clue how to use them [18:02:56] It doesn't help that some bits of production AFAIK are not puppetised in any sensible way, like squid. [18:02:59] the third biggest hurdle is labs is puppet, because we aren't using modules :) [18:03:04] yes [18:03:09] Oh btw <3 modules [18:03:16] Totally more awesome than how labs does it from my playing [18:05:39] well, we're doing it wrong [18:06:41] Really should spend some more time on puppet =/ At the moment I have some insane things that could be done in arrays instead, just have a hate of ruby to skip over first. [18:19:35] paravoid: http://etherpad.wikimedia.org/LabsUpgradePlans [18:24:52] * Damianz thinks Ryan_Lane is adverse to cats :( [18:24:56] hahaha [18:24:59] I knew it was you [18:25:19] Just because I'm an ass :D [18:26:51] Btw, have you looked over the puppet module for openstack they where talking about releasing at the conference? I keep meaning to as the openstack stuff was a little 'lacking' in the production sense last time I looked and was more tilted towards being replaced with devstack. [18:27:13] devstack isn't meant to be used in production [18:27:34] also, canonical announced support for essex and the next three releases of openstack in precise [18:27:44] so, it would be silly to use devstack [18:27:54] the puppet people want to work with us on the modules [18:28:03] so, hopefully they'll get better over time [18:28:51] precise is 12.04? [18:29:01] * Damianz never understood ubuntu's naming [18:29:50] Oh cool, that's released today [18:30:01] Time to upgrade my desktop at work then :D [18:32:07] I think release date is tomorrow [18:32:39] Possibly, I don't understand their coloured release thing either... fedora's table makes far more sense. [18:38:19] no, ubuntu's makes more sense [18:43:15] <^demon> Ryan_Lane: Hehe, https://bugzilla.wikimedia.org/show_bug.cgi?id=36248 [18:43:34] heh [18:44:03] Does Chad have a transaction fetish? [18:44:21] <^demon> No, I just like my data to not disappear :) [18:52:29] You take all the fun out of mysql :D [20:26:23] PROBLEM Disk Space is now: CRITICAL on deployment-transcoding i-00000105 output: DISK CRITICAL - free space: / 34 MB (2% inode=53%): [20:31:23] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 45 MB (3% inode=53%): [20:46:25] hi Ryan_Lane sorry I didn't get you response yesterday re:mailman on labs [20:48:48] well, as I said, the consensus was that we didn't really want to modify it [20:49:47] so you don't want to touch list info pages at all? [20:51:47] I was going to do it for wikimedia-l Ryan_Lane, but Erik suggested that it be done for all of them. [20:52:04] lemme find the email thread [20:52:52] sure [20:54:43] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 12% free memory [20:55:05] * Ryan_Lane sighs [20:55:08] I can't find it, of course [20:57:26] Thehelpfulone: you should talk to mark about it [20:57:31] he's the one that raised an objection [20:57:34] and I can't find the email [20:57:46] hm. maybe it was in our meeting and not in email [21:04:03] Ryan_Lane: was it an opposition to changing the list info pages altogether or just putting them on labs? [21:04:12] changing the list info pages [21:04:42] was Erik at the meeting? [21:04:43] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 42% free memory [21:04:48] no [21:05:18] ok, I'll follow it up later then, thanks [21:05:21] yw [21:05:24] sorry [21:05:49] heh np it's not your fault! [21:32:13] PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours [21:37:43] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 11% free memory [21:44:23] PROBLEM Disk Space is now: CRITICAL on labs-nfs1 i-0000005d output: DISK CRITICAL - free space: /export 0 MB (0% inode=82%): /home/SAVE 0 MB (0% inode=82%): [21:44:43] PROBLEM Free ram is now: CRITICAL on bots-3 i-000000e5 output: Critical: 5% free memory [21:49:46] hey ryan [21:50:11] i was running pageviews.wmflabs.org on a labs instance, and the files were located in /srv [21:50:17] but everything is gone.... :( [21:50:31] do you know what happened? [22:01:06] drdee: what do you mean gone? [22:01:23] there's nothing in the /srv directory? [22:01:27] yes [22:01:32] which instance is this? [22:01:35] pageviews [22:01:43] part of statsgrokse project [22:02:43] PROBLEM Total Processes is now: WARNING on incubator-bots2 i-00000119 output: PROCS WARNING: 723 processes [22:05:06] looking [22:06:56] seems /dev/vdb is pretty broken [22:06:59] I'm assuming it was there [22:07:29] :( [22:07:43] PROBLEM Total Processes is now: CRITICAL on incubator-bots2 i-00000119 output: PROCS CRITICAL: 1055 processes [22:07:44] I'm running an fsck [22:07:50] ok [22:18:54] oh great [22:18:58] I think I killed your instance [22:19:12] sad puppy face [22:20:07] any particular reason? [22:20:26] hm [22:20:27] maybe not [22:21:16] drdee: check /mnt now [22:21:30] unfortunately.... [22:21:35] lost+found is full of stuff :( [22:22:03] can't connect to pageviews [22:22:10] right, but there are now files in /mnt [22:22:16] maybe you guys had it mounted at /srv? [22:22:26] 'connection closed by 10.4.0.29 [22:22:28] try now [22:22:41] nope [22:22:42] still not there.... [22:23:05] back in [22:23:44] we didn't mount it at srv, we just put the files there [22:23:54] okay stats.grok.se is back [22:24:11] but git repo is corrupted [22:24:25] lots of stuff is [22:24:32] the filesystem was *really* messed up [22:24:46] I'm guessing this is a casualty from when gluster went haywire [22:25:38] how much faith do you have in gluster? [22:26:17] well, it's stable now [22:26:24] we're running a newer version [22:26:37] I'd recommend using /data/project, now, though [22:26:43] can it still corrupt data? [22:26:51] ok [22:27:06] /data/project is a separate gluster cluster [22:27:14] it writes directly to a filesystem [22:27:21] /mnt goes through a ton of layers of indirection [22:27:35] which makes it more likely it'll get corrupted [22:27:37] well [22:27:42] if gluster goes crazy, anyway [22:28:03] that said, we need to upgrade gluster before you can store mysql on it [22:28:32] what version are we running? [22:31:21] stable [22:31:22] 3.2 [22:31:29] we need to upgrade to the beta [22:31:30] 3.3 [22:35:16] thx [22:44:23] RECOVERY Disk Space is now: OK on labs-nfs1 i-0000005d output: DISK OK [22:48:13] PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-3 i-00000222 output: Puppet has not run in last 20 hours [23:15:23] PROBLEM Current Load is now: WARNING on mobile-enwp i-000000ce output: WARNING - load average: 7.27, 6.98, 5.36 [23:15:53] RECOVERY Puppet freshness is now: OK on nginx-dev1 i-000000f0 output: puppet ran at Wed Apr 25 23:15:37 UTC 2012 [23:20:23] RECOVERY Current Load is now: OK on mobile-enwp i-000000ce output: OK - load average: 2.23, 3.90, 4.49