[00:13:54] PROBLEM Current Load is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:14:36] PROBLEM Current Users is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:15:12] PROBLEM Disk Space is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:16:02] PROBLEM Free ram is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:17:22] PROBLEM Total processes is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:18:14] PROBLEM dpkg-check is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:18:55] RECOVERY Current Load is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK - load average: 1.13, 0.97, 0.50 [00:19:12] PROBLEM host: i-000003cc.pmtpa.wmflabs is DOWN address: i-000003cc.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003cc.pmtpa.wmflabs) [00:19:36] RECOVERY Current Users is now: OK on userstats i-000004ca.pmtpa.wmflabs output: USERS OK - 1 users currently logged in [00:19:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [00:20:16] RECOVERY Disk Space is now: OK on userstats i-000004ca.pmtpa.wmflabs output: DISK OK [00:21:03] RECOVERY Free ram is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK: 859% free memory [00:22:23] PROBLEM Disk Space is now: WARNING on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK WARNING - free space: /export 956 MB (5% inode=59%): /home/SAVE 956 MB (5% inode=59%): [00:22:23] RECOVERY Total processes is now: OK on userstats i-000004ca.pmtpa.wmflabs output: PROCS OK: 90 processes [00:23:13] RECOVERY dpkg-check is now: OK on userstats i-000004ca.pmtpa.wmflabs output: All packages OK [00:49:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:04:52] PROBLEM Total processes is now: WARNING on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS WARNING: 173 processes [01:09:53] RECOVERY Total processes is now: OK on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS OK: 94 processes [01:19:49] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:50:55] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:22:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:41:43] RECOVERY Free ram is now: OK on bots-sql2 i-000000af.pmtpa.wmflabs output: OK: 20% free memory [02:52:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:59:42] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af.pmtpa.wmflabs output: Warning: 14% free memory [03:22:05] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [03:53:55] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Warning: 14% free memory [03:54:13] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:13:52] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Critical: 4% free memory [04:23:52] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f.pmtpa.wmflabs output: OK: 95% free memory [04:24:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:34:52] PROBLEM Disk Space is now: WARNING on echo-xmpp i-00000351.pmtpa.wmflabs output: DISK WARNING - free space: / 546 MB (5% inode=91%): [04:54:23] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:24:25] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:59:45] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Unable to read output [06:03:32] PROBLEM Current Users is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:02] PROBLEM Disk Space is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:22] PROBLEM Current Load is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:02] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:53] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [06:06:33] PROBLEM Total processes is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [06:40:23] PROBLEM Total processes is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: PROCS CRITICAL: 348 processes [06:41:54] PROBLEM Current Load is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: CRITICAL - load average: 97.84, 50.40, 29.13 [06:44:53] RECOVERY Disk Space is now: OK on echo-xmpp i-00000351.pmtpa.wmflabs output: DISK OK [06:50:22] RECOVERY Total processes is now: OK on userstats i-000004ca.pmtpa.wmflabs output: PROCS OK: 94 processes [06:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:01:52] PROBLEM Current Load is now: WARNING on userstats i-000004ca.pmtpa.wmflabs output: WARNING - load average: 0.90, 4.09, 16.53 [07:21:52] RECOVERY Current Load is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK - load average: 0.01, 0.05, 0.03 [07:23:32] RECOVERY Current Users is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [07:24:04] RECOVERY Disk Space is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: DISK OK [07:24:22] RECOVERY Current Load is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: OK - load average: 0.27, 0.97, 1.15 [07:24:45] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0.pmtpa.wmflabs output: Warning: 8% free memory [07:24:45] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:25:54] RECOVERY dpkg-check is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: All packages OK [07:25:54] RECOVERY SSH is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:26:36] RECOVERY Total processes is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: PROCS OK: 230 processes [07:31:23] Change on 12mediawiki a page Wikimedia Labs/Toolserver features wanted in Tool Labs was modified, changed by Nemo bis link https://www.mediawiki.org/w/index.php?diff=594553 edit summary: JIRA public permalink [07:35:02] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 6.56, 7.42, 5.99 [07:43:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 151 processes [07:45:02] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 7.11, 8.02, 6.13 [07:53:52] RECOVERY Total processes is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS OK: 147 processes [07:54:42] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Unable to read output [07:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:55:21] seems the labsconsole no more let us connect :-] [07:56:04] tellings me "Labs uses cookie to log in users. You have cookies disabled" [07:56:10] ;-D [07:56:19] Ryan_Lane: are you awake and doing any modification to labsconsole ? [07:57:05] PROBLEM Disk Space is now: UNKNOWN on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Call to fork() failed [07:57:23] PROBLEM Current Load is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [07:58:54] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [07:58:58] hashar: memcached died for some reason [07:59:10] :-( [07:59:15] I'm awake, but wasn't making changes [07:59:26] I'm wanting to stab hp cloud in its face [07:59:35] PROBLEM Total processes is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:14] Ryan_Lane: what did they do wrong ? [08:01:32] PROBLEM Current Users is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:55] they scheduled maintenance from 1am till 3am my time [08:02:05] PROBLEM Disk Space is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:02:09] for a service that should never go down [08:02:10] ever [08:02:38] maybe they need to change their power circuits [08:02:48] it's for block storage [08:03:04] there's never a reason to take it down for 2 houts [08:03:06] hours [08:03:15] their original maintenance window was 8 hours [08:03:32] I had to convince them that I couldn't take downtime for 8 hours [08:04:02] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [08:04:56] !log deployment-prep attempting to deploy role::db::core class on deployment-sql02 [08:04:59] Logged the message, Master [08:05:38] Ryan_Lane: labs working again. Thanks for restarting the memc instance :-] [08:05:47] yw [08:06:55] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 157 processes [08:17:12] !log deployment-prep removed role::db::core from -sql02 : class is not meant for labs :-] [08:17:12] Logged the message, Master [08:20:54] PROBLEM Current Load is now: CRITICAL on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: CRITICAL - load average: 24.05, 23.56, 20.91 [08:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [08:31:50] !log deployment-prep -sql02 : manually installed mysql server using /mnt/mysql as datadir. [08:31:52] Logged the message, Master [08:40:52] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 15.13, 17.87, 19.76 [08:51:18] [bz] (8NEW - created by: 2Nemo_bis, priority: 4Unprioritized - 6normal) [Bug 41095] Missing perl dependencies? - https://bugzilla.wikimedia.org/show_bug.cgi?id=41095 [08:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:20:52] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 2.37, 2.80, 4.63 [09:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:25:53] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 1.94, 2.24, 4.49 [09:35:00] Ryan_Lane: could you have a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=41083 [09:35:32] deployment-video04(10.4.1.19) is unable to connect to mysql at deployment-sql(10.4.0.53) [09:39:24] hey j^ [09:39:38] hi [09:39:46] j^: I raised the issue on the labs mailing list [09:39:59] ryan answered it is a bug in openstack pending packaging in Ubuntu :( [09:40:06] https://bugzilla.wikimedia.org/show_bug.cgi?id=40526 [09:40:10] will update that bug with his answer [09:41:15] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4High - 6normal) [Bug 40526] new security rule not applied - https://bugzilla.wikimedia.org/show_bug.cgi?id=40526 [09:41:43] [bz] (8NEW - created by: 2Jan Gerber, priority: 4Unprioritized - 6normal) [Bug 41083] unable to connect to deployment-sql from deployment-video04 - https://bugzilla.wikimedia.org/show_bug.cgi?id=41083 [09:42:24] ah ok, any known workaround? [09:42:37] i could create another instance and hope that it ends up in the 10.4.0.x range [09:46:05] hmm [09:46:08] also the sql security rule is no more around :/ [09:46:33] I have added it back [09:46:53] so you could try creating a new instance using the default and sql security group [09:47:14] might be able to apply it [09:50:32] PROBLEM Current Users is now: UNKNOWN on ganglia-test2 i-00000250.pmtpa.wmflabs output: Invalid host name i-00000250.pmtpa.wmflabs [09:50:45] ok, while its bulding i have another question: [09:50:47] is commons.wikimedia.beta.wmflabs.org currently using any custom javascript or extensions that are not used in production? since right now TMH does not load (i.e. http://commons.wikimedia.beta.wmflabs.org/wiki/File:Folgers.ogv) [09:51:37] I am not sure [09:51:45] it is definitely missing some javascript from the commons production site [09:52:18] http://commons.wikipedia.beta.wmflabs.org/w/thumb.php?f=Folgers.ogv&width=352 <--- gives an error 500 (internal server error) [09:52:52] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:02] PROBLEM dpkg-check is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:02] PROBLEM Current Load is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:06] bah its gone [09:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:55:32] PROBLEM Current Users is now: CRITICAL on ganglia-test2 i-00000250.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:42] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf.pmtpa.wmflabs output: Warning: 9% free memory [09:57:52] RECOVERY dpkg-check is now: OK on aggregator-test1 i-000002bf.pmtpa.wmflabs output: All packages OK [09:58:52] PROBLEM Total processes is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [09:59:32] PROBLEM dpkg-check is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [10:00:12] PROBLEM Current Load is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [10:03:52] RECOVERY Total processes is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS OK: 103 processes [10:04:32] RECOVERY dpkg-check is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: All packages OK [10:05:12] RECOVERY Current Load is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: OK - load average: 0.15, 0.45, 0.41 [10:15:56] Is there somewhere a docu of the Puppet class "git" (for example git::clone)? [10:17:33] Jan_Luca: not really :/ [10:17:38] Jan_Luca: but I can help :) [10:17:52] PROBLEM Current Load is now: WARNING on aggregator-test1 i-000002bf.pmtpa.wmflabs output: WARNING - load average: 0.23, 2.10, 17.59 [10:18:17] there is some inline documentation in operations/puppet.git repository. That definition is in manifests/generic-definition.pp [10:18:32] PROBLEM dpkg-check is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [10:18:57] Jan_Luca: I have pasted the inline doc at http://pastebin.com/mGjiyJ8i [10:23:08] thank you [10:24:15] hashar: looks like TMH git was not pulled, are extensions pulled automatically in deployment-prep or has do be done as needed? [10:24:33] j^: should be automatically pulled. Let me check [10:24:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:24:46] the autoupdating script might have an error :-] [10:25:44] j^: TMH is at : * 2ffc1f8 - (HEAD, origin/master, origin/HEAD, master) Localisation updates from http://translatewiki.net. (15 hours ago) [10:25:56] aka it is up to ddate [10:26:11] hashar: yes, but that was only after i pulled manually [10:26:26] so the updater might have an issue :/ [10:26:47] will keep an eye and ping you instead of pulling next time it happens [10:27:45] new instace is able to telnet sql server, will install videoscaler class and see if it works from php too [10:28:13] !log deployment-prep j: create new videoscaler instance deployment-video05 this time with sql access [10:28:15] Logged the message, Master [10:31:00] grmblblbl [10:31:01] !log wikidata-dev wikidata-dev-2: Disabled E3Experiments extension in the test client because it contained a new signup page that did not display captchas and thus prevented account creation. [10:31:03] Logged the message, Master [10:31:30] j^: deployment-jobrunner06 is out of disk space, that is where the beta auto updater run [10:34:22] !log deployment-prep deployment-jobrunner06 / was filled out by Gluster logs /var/log/glusterfs/data-project.log filled it all :( [10:34:22] Logged the message, Master [10:34:52] thats not cleaned up by logrotate? [10:37:12] RECOVERY Disk Space is now: OK on deployment-jobrunner06 i-0000031d.pmtpa.wmflabs output: DISK OK [10:39:32] PROBLEM host: i-000004c6.pmtpa.wmflabs is DOWN address: i-000004c6.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000004c6.pmtpa.wmflabs) [10:41:50] j^: apparently it is not :( [10:42:53] RECOVERY Current Load is now: OK on aggregator-test1 i-000002bf.pmtpa.wmflabs output: OK - load average: 0.72, 0.43, 3.80 [10:52:59] on the video instance i also get lots of gluster errors in the logs: E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-deployment-prep-project-replicate-0: background entry self-heal failed on /apache/common-local [10:54:08] yeah got a similar issue in jobrunner06 [10:54:15] I have no idea what it could be [10:54:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:54:50] gluster volume degraded [10:55:25] what do you mean ? [10:56:48] do me the errors suggest that the gluster volume is degraded and needs to be repaired [11:01:36] https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=794699 [11:02:41] fine [11:02:54] so we want to fill a bug and get ops to look at the gluster volume [11:09:28] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41104] glusterfs log files are not rotated - https://bugzilla.wikimedia.org/show_bug.cgi?id=41104 [11:12:33] j^: I don't have the error on the deployment-integration instance :-] [11:12:41] might be related to the symbolic links [11:13:56] or running php-cli (jobrunner, videoscaler) [11:14:30] ah [11:18:39] !log deployment-prep emptied /var/log/glusters/data-deployment.log huge files on several instances [11:18:41] Logged the message, Master [11:19:12] RECOVERY Disk Space is now: OK on deployment-apache33 i-0000031b.pmtpa.wmflabs output: DISK OK [11:21:13] RECOVERY Disk Space is now: OK on deployment-apache32 i-0000031a.pmtpa.wmflabs output: DISK OK [11:22:00] !log deployment-prep updated mediawiki-config : 6bbf8f2..7caabad [11:22:02] Logged the message, Master [11:24:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [11:32:08] j^: I hopefully properly removed the live hack on beta that was doing the same as https://gerrit.wikimedia.org/r/#/c/28007/ [11:33:38] is /data/project/apache/common/wmf-config tracking git? git log gives an error here [11:36:44] hashar: PrivateSettings.php~ and InitialiseSettings-wmflabs.php~ look like files that should be removed no? [11:37:14] they are vim backup files [11:37:22] so should probably be deleted [11:37:38] wmf-config had a .git directory I deleted that [11:38:04] the working copy is in /home/wikipedia/common/ (which is an alias to /data/project/apache/common IIRC) [11:38:16] j^: which error do you have ? [11:38:40] git error is now gone [11:38:43] great :-] [11:38:54] somehow wmf-config was configured as a git repo [11:39:00] that might have been the issue [11:41:27] are all web instances using the same config? [11:43:08] since the page sometimes looks as if TMH was not enabled at all [11:45:30] there is a squid cache in front [11:45:44] that might be caching the page [11:54:03] now i get a 404 for resources like http://bits.beta.wmflabs.org/static-master/resources/jquery/jquery.cookie.js [11:54:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [11:56:28] dohh [11:56:31] poor bits [11:57:09] j^: need to fix it :-) [11:58:08] the links got removed :( [11:58:21] !g I89c58c9a085dce4ef83ad5ea907135a269de359a [11:58:21] https://gerrit.wikimedia.org/r/#q,I89c58c9a085dce4ef83ad5ea907135a269de359a,n,z [11:58:44] grmblblb [12:02:03] https://gerrit.wikimedia.org/r/28337 [12:03:38] !log deployment-prep Fixed assets on bits. The static-master symbolic links got removed at some point. See {{gerrit|28337}} [12:03:40] Logged the message, Master [12:03:49] j^: that is fixed :-] http://bits.beta.wmflabs.org/static-master/resources/jquery/jquery.cookie.js [12:03:59] that most probably caused a ton of issues [12:05:08] j^: got an error : "Error loading EmbedPlayer dependency set: Module mw.EmbedPlayer failed." [12:05:13] on http://commons.wikimedia.beta.wmflabs.org/wiki/File:Folgers.ogv [12:05:36] though that is thrown by ext.gadget.Stockphoto [12:05:43] so I am not sure it is related to TMH [12:07:06] hashar: loading http://commons.wikimedia.beta.wmflabs.org/wiki/Special:TimedMediaHandler i got a does not exist error (Served by deployment-apache32) [12:08:00] j^: I get a special page content there [12:08:08] with a banner about commons being available in french [12:08:13] then blank content [12:08:13] yes i get it most of the time [12:08:28] via [12:08:43] last time i did not get it it was served by 32 [12:08:48] so might be something about 32 [12:10:52] PROBLEM Disk Space is now: WARNING on maps-test4 i-000004ae.pmtpa.wmflabs output: DISK WARNING - free space: / 574 MB (5% inode=89%): [12:11:46] X-Cache MISS from squid001.beta.wmflabs.org [12:11:57] so it seems it is hitting the apaches [12:25:53] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [12:31:10] !log centralauth Update all instances and reboot them [12:31:16] !log centralauth Update all instances and reboot them [12:31:17] Logged the message, Master [12:31:22] Logged the message, Master [12:55:53] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:05:33] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [13:10:33] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 97 processes [13:27:04] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:30:58] Hey hashar, thanks for setting up that db server! [13:31:15] Any chance I could get sudo access in the project? [13:33:03] Ryan_Lane: When you see this later on, please help me boot up dumps-bot3, it is down and I can't reboot it on the web interface. ID: i-000003ef.pmtpa.wmflabs. Thanks! [13:34:37] csteipp: you should have sudo access already [13:34:41] csteipp: at least I made you a sysadmin [13:34:50] csteipp: and good morning :-] [13:34:56] Good Morning :) [13:35:07] I don't appear to have sudo... [13:35:32] csteipp: while you are around, I will be connected tonight from roughly noon to 2pm PST. Not the ideal time but might let us chat a bit :) [13:35:32] At least, I get a password prompt for it when I try to sudo anything [13:35:34] ahh [13:35:40] csteipp: you need to enter your labs password [13:35:48] Oh! [13:35:50] sudo uses pam_ldap to authenticate [13:36:02] hashar: Are you guys going to create the beta wikis for Wikivoyage anytime soon? [13:36:06] <^demon> Morning everyone :) [13:36:22] morning ^demon! [13:36:40] Hm... "csteipp is not allowed to run sudo on deployment-sql02. " [13:37:14] :-( [13:37:15] <^demon> hashar: So, the fact that we run puppet linting from patchset-created is annoying me, so I'm fixing the puppet job in jenkins. [13:37:41] <^demon> Your rake doesn't seem to work completely yet, so I just made a new build step with the command line version. [13:37:49] ^demon: is that reporting anything back to Gerrit ? [13:38:10] I know I experimented a bit about it during the SF tech days [13:38:39] <^demon> Not yet. But I got it to build on one I retriggered :) [13:38:42] csteipp: I am looking at the security groups [13:38:48] doh [13:40:11] csteipp: ahh I did not add you to the sudo policy :-D https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaSudoer&action=modify&sudoername=admin&project=deployment-prep [13:40:17] csteipp: you might be able to add your self there [13:41:07] !log deployment-prep Added CSteipp and Reedy as sudoers [13:41:09] Logged the message, Master [13:42:15] meh [13:42:16] back [13:42:27] csteipp what is your name? [13:42:31] <^demon> hashar: What do I have to do to get jenkins-bot to report the failure to gerrit with a V-1? [13:42:36] <^demon> Or +1 [13:42:40] Thanks hashar! [13:42:49] petan: As in irl? [13:42:58] as in labs :) [13:43:08] csteipp: we usually get everything puppetized. Though for sql02 I just installed mysql manually [13:43:08] CSteipp [13:43:14] the production puppet class do not work on labs :/ [13:43:17] there is no such user I can see [13:43:29] got it [13:43:36] it's not sorted meh [13:43:44] ^demon: it should do it by default. A verified -1 is triggered whenever the build script fail, most of the time when some command exit with a non 0 status. [13:43:52] ^demon: example: /bin/false ;-] [13:43:53] ah, you should have sudo... [13:44:03] I do now :) [13:44:13] <^demon> hashar: Ah, jenkinsbot probably doesn't have permissions on ops/puppet. Lemme check [13:44:22] !log deployment-prep -sql02 removed ganglia from host and reran puppet. [13:44:23] Logged the message, Master [13:44:34] ^demon: which job are you working on ? [13:44:45] <^demon> operations-puppet [13:45:06] https://integration.mediawiki.org/ci/job/operations-puppet/configure so indeed got builded. not sure why since it is supposed to be disabled :-] [13:45:20] Retriggered by user Demon for Gerrit: https://gerrit.wikimedia.org/r/23603 in silent mode. [13:45:27] so silent mode means nothing is reported back to Gerrit [13:45:35] to avoid having ops getting angry :-] [13:45:49] <^demon> In silent mode? [13:45:51] <^demon> Whoops [13:47:18] csteipp: your idea to use the MediaWiki database load balancer is smart. I just have 0 knowledge about how to set it up properly. I guess Asher could help [13:47:23] csteipp: or Aaron :-] [13:47:31] <^demon> hashar: I also just gave JenkinsBot Verified permissions on All-Projects. [13:47:59] Yeah, I think I'll be able to do it... but I'll see how far I get :) [13:50:21] <^demon> hashar: I disabled quiet mode, but it's still not reporting :\ [13:51:36] let me look up in /var/log/jenkins/jenkins.log [13:52:53] #gerrit approve 23603,2 --message 'Build Started https://integration.mediawiki.org/ci/job/operations-puppet/97/ ' --verified 0 --code-review 0 [13:52:53] hehe [13:52:53] so the startup is commented out [13:56:35] ^demon: anyway, the build #97 has been triggered on change 23603 which is merged. Jenkins tried the command gerrit approve 23603,2 --message 'Build Successful….' --verified 1 --code-review 0 which is not going to work since the change is closed [13:56:35] <^demon> Ah, gonna have to try it on a fresh change :) [13:56:35] and make sure are aware about it [13:56:35] apparently Ryan did not see a point in using Jenkins and preferred the Gerrit hooks [13:56:35] !ping [13:56:35] ^demon: you can play with https://gerrit.wikimedia.org/r/#/c/28216/ if you want. I am most probably going to abandon that one. [13:56:35] wm-bot: hey [13:56:35] o.o [13:56:38] Hi petan, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [13:56:39] pong [13:56:43] lol [13:56:49] that was a lag [13:56:52] PROBLEM Total processes is now: WARNING on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS WARNING: 156 processes [13:57:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:59:48] <^demon> hashar: Whoo, it worked :) [13:59:49] <^demon> https://gerrit.wikimedia.org/r/#/c/28216/ [14:00:30] great [14:01:02] it could use the rake script now :-) [14:01:54] annhhh [14:02:00] na it needs the ruby gems [14:05:22] so we have puppet client 2.7.6 in production which does not have the puppet/util/colors class :/ [14:06:01] will self fix when we upgrade Gallium to Precise I guess [14:10:50] <^demon> hashar: I disabled the jenkins job again until ops merges the change. Don't want to spam people with double notifications on pass/fail :) [14:20:23] ^demon: and the new jenkins is at http://integration.wmflabs.org/ci/ :-D [14:20:32] still need to write you down an email about it [14:21:00] will hopefully be a safer system for us [14:21:01] http://integration.wmflabs.org/ci/job/mediawiki-core-lint/45/ [14:21:14] though tests are failing cause there is only 1GB of memory on the box *grin* [14:21:58] <^demon> So integration's going to run out of labs now instead of gallium? [14:22:25] nop [14:22:31] I have setup Zuul on labs [14:22:32] * chrismcmahon follows along... [14:22:36] to get the puppet class setup [14:22:38] and update the jenkins job [14:22:52] ended up rewriting the jenkins job from scratch (not that long anyway) [14:23:04] <^demon> Ah ok, then we'll move it to prod? [14:23:04] Zuul is now almost working fine :-] [14:23:13] so to move to prod I have to get gallium upgraded [14:23:26] spend the day finding out all the steps that need to be done [14:23:33] since some stuff is not puppetized [14:23:46] we will have to shutdown jenkins for a few hours while the box is being rebuild [14:24:46] hashar: and then we can deploy extensions and config to beta labs from Jenkins? [14:26:49] I guess [14:27:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:27:04] will need to write the job and get jenkins to communicate with a beta instance [14:27:29] chrismcmahon: we should probably write a small doc as to what we expect that job to do [14:27:49] hashar: that's about the most important thing I want to do right now [14:28:36] make deploying extensions to beta labs builds in Jenkins so a) we can see where they fail and b) hang some browser tests off those builds [14:29:27] well all extensions are already git pulled on beta [14:29:38] though most are not configured [14:30:05] hashar: but that has been failing since 7 Oct. if it were in Jenkins would could see it red and see the actual commands [14:30:55] yeah the script does not work that well [14:31:47] though I am pretty sure I fixed it last friday [14:32:05] not working :-(((( [14:32:12] will fix that tonight [14:32:41] hashar: I went looking for the script yesterday on deployment, couldn't find it. can you point me to where it runs? [14:32:54] it is on deployment-jobrunner06 [14:33:04] /usr/local/bin/wmf-beta-autoudpate [14:33:13] thanks! [14:33:17] that is started as a service (configured in upstart) [14:33:21] ok [14:33:21] and deployed by puppet [14:33:50] it must be a permission error [14:34:05] anyway, I am off to go to bank + accountant + daughter + dinner [14:34:23] will be back around after dinner (roughly noon PST) [14:34:47] don't waste too much time with the beta auto updater, it is a bad quality software and poorly integrated :-( [14:35:29] off [14:41:52] PROBLEM Total processes is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS CRITICAL: 202 processes [14:46:52] RECOVERY Total processes is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS OK: 117 processes [14:57:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:59:12] PROBLEM Current Load is now: WARNING on deployment-video05 i-000004cc.pmtpa.wmflabs output: WARNING - load average: 10.69, 10.54, 7.26 [15:04:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [15:04:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 154 processes [15:08:53] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 17.72, 15.95, 9.22 [15:18:53] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 10.14, 9.95, 7.05 [15:19:53] RECOVERY Total processes is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS OK: 150 processes [15:27:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [15:29:33] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 97 processes [15:40:34] Damianz or petan are you available? [15:40:44] for 10 mins [15:40:47] great [15:40:59] I've been otld to talk to you guys, with respect to being added to the Bots project on Labs [15:41:15] sure [15:42:08] !log bots +huji [15:42:09] Logged the message, Master [15:42:10] done [15:42:20] !project bots [15:42:20] There are multiple keys, refine your input: project-access, project-discuss, projects, [15:42:32] hmm [15:42:47] @search Projec [15:42:49] Results (Found 4): instancelist, instance-json, info, manage-projects, [15:43:02] !info test [15:43:02] https://www.mediawiki.org/wiki/WMF_Projects/Wikimedia_Labs [15:43:02] hm.. [15:43:12] Huji you are member of that project [15:43:16] * Huji feels like he opened a can of worms! [15:43:23] that's weird [15:43:23] I just can't remember that key to link you to docs [15:43:36] because .. see: https://www.mediawiki.org/wiki/WMF_Projects/User_talk:Huji [15:43:39] !resource bots [15:43:43] https://labsconsole.wikimedia.org/wiki/Nova_Resource:bots [15:43:48] maybe this? [15:44:06] yes, [15:44:12] that is what I'm talking about [15:44:28] huji I don't see anything there [15:44:42] sorry, bad link [15:44:42] 1 se [15:44:43] basically you just ssh to bastion, then you ssh to bots-nr1 [15:44:44] :) [15:44:56] https://labsconsole.wikimedia.org/wiki/User_talk:Huji [15:45:01] there you can do your stuff, for any question, you speak to me or Damianz [15:45:16] ok, anyway you are member now [15:45:45] !botsdocs [15:46:00] 10/17/2012 - 15:46:00 - Created a home directory for huji in project(s): bots [15:46:05] !botsdocs is https://labsconsole.wikimedia.org/wiki/Nova_Resource:Bots/Documentation [15:46:05] Key was added [15:46:12] !botsdocs | Huji [15:46:12] Huji: https://labsconsole.wikimedia.org/wiki/Nova_Resource:Bots/Documentation [15:46:13] :P [15:46:19] this is documentation [15:46:33] great [15:47:06] thanks petan [15:50:46] petan: are you and Damian the only sysadmins in that project? [15:51:06] 10/17/2012 - 15:51:05 - User huji may have been modified in LDAP or locally, updating key in project(s): bots [15:51:13] Huji nope, let me get a list... [15:51:18] should be somewhere on web [15:52:08] DamianZaremba Dzahn Jeremyb Ryan Lane Petrb Rich Smith Thehelpfulone [15:52:14] that's a list of all sysadmins there [15:52:36] one last question, if you still have time [15:52:39] sure [15:52:46] !mail [15:52:51] !list [15:52:53] @search list [15:52:53] Results (Found 1): keys, [15:53:05] @search mail [15:53:05] Results (Found 3): osm-bug, new-labsuser, account-questions, [15:53:10] o.o [15:53:34] a bunch of bot masters in Fa WP are running buts off of toolserver now, and the idea is for me to pave the way for them so that they can eventually switch to Labs. Their bots are mainly focused on adding missing categories, and creating pages from data [15:53:50] !mail is we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [15:53:51] Key was added [15:54:05] do you have any recommendtations about which instance to start using, until we get big enough (probably) to request a separate instance? [15:54:15] RECOVERY Current Load is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: OK - load average: 0.27, 0.77, 4.78 [15:54:30] yes, bots-nr1 or bots-4 in case you need root [15:54:57] bots-4 is perfect to test stuff, then move it to puppet and then enable it on production version [15:54:57] :) [15:54:58] !bots [15:54:58] http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure proposal for bots [15:55:04] root is not needed, IMHO. bots-1 is listed FIXME in the Doc, which is why I asked [15:55:05] that's how the proposed idea should be [15:55:20] bots-1 is the worst instance you could use :D [15:55:26] it's the first one we made and most damaged [15:55:31] use bots-4 or bots-nr1 [15:55:35] notice "nr" [15:55:40] bots-nr1, roger [15:55:54] I ams also not familiar with puppet, any reading suggested? [15:56:00] !puppet [15:56:00] learn: http://docs.puppetlabs.com/learning/ troubleshoot: http://docs.puppetlabs.com/guides/troubleshooting.html [15:56:10] !labswiki puppet [15:56:25] !labswiki is https://labsconsole.wikimedia.org/wiki/$* [15:56:25] Key was added [15:56:31] !labswiki Puppet [15:56:31] https://labsconsole.wikimedia.org/wiki/Puppet [15:57:03] aha\ [15:57:04] oops, [15:57:05] that one doesn't exist [15:57:07] :D [15:57:20] funny [15:57:24] there is no help for puppet so far [15:57:32] ok, in that case [15:57:32] haha [15:57:33] !puppet [15:57:33] learn: http://docs.puppetlabs.com/learning/ troubleshoot: http://docs.puppetlabs.com/guides/troubleshooting.html [15:57:37] these 2 works [15:57:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 153 processes [15:58:00] nice [15:58:01] thank you so much [15:58:05] no problem :) [15:58:09] !list | Huji [15:58:15] !mail | Huji [15:58:15] Huji: we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [15:58:16] !list alias mail [15:58:16] Created new alias for this key [15:58:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [15:59:01] !list | test [15:59:01] test: we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [15:59:12] ok :D [16:24:52] PROBLEM host: i-000004cf.pmtpa.wmflabs is DOWN address: i-000004cf.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000004cf.pmtpa.wmflabs) [16:28:53] PROBLEM Current Load is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:29:14] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [16:29:33] PROBLEM Current Users is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:30:12] PROBLEM Disk Space is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:31:05] PROBLEM Free ram is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:32:22] PROBLEM Total processes is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:33:15] PROBLEM dpkg-check is now: CRITICAL on praxis i-000004d0.pmtpa.wmflabs output: Connection refused by host [16:34:32] RECOVERY Current Users is now: OK on praxis i-000004d0.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [16:35:13] RECOVERY Disk Space is now: OK on praxis i-000004d0.pmtpa.wmflabs output: DISK OK [16:36:03] RECOVERY Free ram is now: OK on praxis i-000004d0.pmtpa.wmflabs output: OK: 1007% free memory [16:37:23] RECOVERY Total processes is now: OK on praxis i-000004d0.pmtpa.wmflabs output: PROCS OK: 90 processes [16:37:43] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [16:38:13] RECOVERY dpkg-check is now: OK on praxis i-000004d0.pmtpa.wmflabs output: All packages OK [16:38:53] RECOVERY Current Load is now: OK on praxis i-000004d0.pmtpa.wmflabs output: OK - load average: 0.55, 0.79, 0.59 [16:42:42] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf.pmtpa.wmflabs output: Warning: 8% free memory [16:59:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [17:07:34] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [17:08:52] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 2.66, 3.57, 4.74 [17:12:34] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 97 processes [17:26:53] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 12.26, 10.78, 7.63 [17:29:24] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [17:53:48] [bz] (8NEW - created by: 2Chris McMahon, priority: 4Unprioritized - 6major) [Bug 41121] out of captch images; cannot create accounts - https://bugzilla.wikimedia.org/show_bug.cgi?id=41121 [17:59:23] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [18:13:38] sumanah: Is it possible to get a Gerrit user for using gerrit query [18:13:52] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 4.05, 4.28, 4.95 [18:13:53] in a script [18:13:59] hi Jan_Luca -- do you mean you want a Gerrit user account? [18:14:21] I have a user account for me (jan) but I need one for a labs project [18:14:53] Jan_Luca: Ah, ok, I see what you mean. [18:15:06] I don't know what the current policy is on that sort of thing. Maybe ask in labs-l ? [18:16:06] There is a user "labs-puppet" for downloading puppet config from Gerrit so it seems to be possible... [18:16:12] <^demon> I imagine it'd be fine to set up a user for that. But yeah, just toss it on labs-l to give someone a chance to say "That's a bad idea" [18:16:45] ^demon: Ok, I will do this [18:16:54] <^demon> sumanah: Also, gerrit 2.5rc1 came out yesterday. Full release notes: http://gerrit-documentation.googlecode.com/svn/ReleaseNotes/ReleaseNotes-2.5.html [18:26:15] [bz] (8NEW - created by: 2Jan Gerber, priority: 4Unprioritized - 6normal) [Bug 41123] glusterfs /data/projects 0-deployment-prep-project-replicate-0: background entry self-heal failed - https://bugzilla.wikimedia.org/show_bug.cgi?id=41123 [18:26:16] [bz] (8NEW - created by: 2Chris McMahon, priority: 4Unprioritized - 6major) [Bug 41121] out of captcha images; cannot create accounts - https://bugzilla.wikimedia.org/show_bug.cgi?id=41121 [18:29:23] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [18:30:05] <^demon> Jan_Luca: You might also be interested to know--when we upgrade to gerrit 2.5, it introduces a nice public REST api. [18:31:53] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 3.88, 4.20, 4.96 [18:34:09] * Damianz appears [18:35:24] <^demon> Howdy. [18:35:38] Hi :D [18:36:12] So, I just picked up my motorbike. Why did I not do this sooner!? [18:47:23] PROBLEM Disk Space is now: CRITICAL on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK CRITICAL - free space: /export 505 MB (2% inode=58%): /home/SAVE 505 MB (2% inode=58%): [18:53:11] ^demon: When will gerrit 2.5 be installed on gerrit.wikimedia.org? I think this will last some more time... [18:53:32] when they fix ldap [18:53:52] PROBLEM Current Load is now: CRITICAL on mobile-solr i-000004d1.pmtpa.wmflabs output: Connection refused by host [18:53:55] <^demon> Jan_Luca: Real Soon Now (tm). Waiting for an upstream fix for an LDAP regression. [18:54:32] PROBLEM Current Users is now: CRITICAL on mobile-solr i-000004d1.pmtpa.wmflabs output: Connection refused by host [18:58:45] ^demon: You know really you're going to end up forking releases, chery picking patches and building custom versions :D [18:58:47] RECOVERY Current Load is now: OK on mobile-solr i-000004d1.pmtpa.wmflabs output: OK - load average: 0.05, 0.53, 0.46 [18:59:16] <^demon> Already am. I'm afraid the ldap fix won't make stable-2.5, and I'll have to do that. [18:59:23] <^demon> (Our 2.4.2 is custom) [18:59:37] RECOVERY Current Users is now: OK on mobile-solr i-000004d1.pmtpa.wmflabs output: USERS OK - 1 users currently logged in [18:59:50] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [19:00:40] surly they can't break ldap in a -stable release [19:01:46] <^demon> Well it took a day to convince them it was even broken, sooooo ;-) [19:02:04] lol [19:02:41] <^demon> It's been a week now :( [19:05:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [19:06:54] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 10.28, 9.59, 6.85 [19:09:52] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 8.96, 8.44, 6.08 [19:11:44] http://i.imgur.com/rIuRI.jpg < [19:30:24] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [19:58:49] Hey hashar, have time for a few questions about beta? [20:00:17] sure [20:00:24] csteipp: let me fix it first :-) [20:00:30] Oh, no problem! [20:00:32] PROBLEM Total processes is now: CRITICAL on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS CRITICAL: 283 processes [20:00:52] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [20:01:38] csteipp: listening to you :) [20:02:22] PROBLEM Disk Space is now: WARNING on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK WARNING - free space: /export 763 MB (4% inode=58%): /home/SAVE 763 MB (4% inode=58%): [20:02:54] I was reading through addWiki.php, and see it updates all.dblist, and wikiversions.dat... should I push those updates into gerrit after the run? Or does another script handle that? [20:03:14] hmm well [20:03:18] that is part of the current madness [20:03:28] all.dblist and wikiversions.dat are shared by both production and beta [20:03:36] but actually have different content :-((( [20:03:42] Ah... [20:03:56] which is a huuuugugggggeee trouble [20:03:56] I can see how that would cause some issues. [20:04:03] I want to replace all.dblist by something like all-wmflabs.dblist [20:04:05] <^demon> csteipp: Updates to? [20:04:24] ^demon: yep... wikivoyage [20:04:27] <^demon> (I was poking at addWiki the other day, which is why I asked) [20:04:29] plus we have no more captcha on beta :( [20:04:30] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&returnto=Main+Page&type=signup [20:04:39] I saw that this morning [20:04:47] But couldn't figure out why [20:04:54] 1489: $wgCaptchaDirectory = '/mnt/upload7/private/captcha'; [20:04:56] YEAHHHH [20:07:17] Hm... so ^demon, since you're also working on adWiki-- if we're adding in a whole new suffix, do we need to update refresh-dblist too? [20:07:44] <^demon> Yes. You'll need to tweak refresh-dblist to recognize the new suffix. [20:07:59] cool [20:08:09] <^demon> addWiki *should* be fine as-is. It just updates all.dblist and then calls refresh-dblist (which splits it into the wikifoo.dblist files) [20:08:57] hashar, I made a patch for that [20:09:13] ah, you commented [20:09:20] yeah [20:09:26] currently creating the upload7 directory [20:09:29] and moving the captchas [20:09:40] <^demon> csteipp: I started this etherpad for Wikidata, but it mostly applies to wikivoyage too. You might find it useful (http://etherpad.wikimedia.org/WikidataWiki). [20:09:48] there isn't anything pointing to upload6 ? [20:09:55] <^demon> At some point, this all needs real docs on wikitech. But it only happens every ~6 years or so. [20:10:07] I've notice :) [20:11:02] Platonides: I have created the /data/upload7 directory and move private/captcha from upload6 to that. [20:11:14] Platonides: want me to amend the puppet change ? [20:11:22] I just sent another one [20:11:41] I think we should have upload6 -> upload7 then [20:12:08] just one name [20:12:08] https://gerrit.wikimedia.org/r/#/c/28391/ still has only one patchset [20:12:22] we need beta to match production as closely as possible [20:12:22] it was waiting on 'Is this really what you meant to do?' [20:12:27] so if prod use upload7, beta should too :-] [20:12:39] ^demon: are you changes for "All of mediawiki-config will need work" in gerrit? [20:12:55] <^demon> No, none have been submitted yet. [20:12:56] That's the part I've been trying to figure out today :) [20:13:23] <^demon> At the very least, all the CentralAuth stuff needs tweaking. And per Roan, the siteFromDB() code may need looking at to make sure it'll work with the new TLD. [20:13:41] which new TLD? [20:13:57] <^demon> wikivoyage and wikidata are both being set up. [20:14:10] oh, right [20:14:29] it should be straightforward [20:14:57] Platonides: commented on https://gerrit.wikimedia.org/r/#/c/28391/ . Just replace the target to point to the newly created upload7 [20:14:58] <^demon> Platonides: Yeah hopefully. It's just been a long time so I'm afraid of breaking stuff :) [20:15:12] <^demon> We haven't added a new project TLD in years. [20:16:14] hashar, I had done it [20:16:24] but missed the -a in the --amend [20:16:24] ;-} [20:16:26] new one sent [20:16:39] I have a git alias named amend [20:16:41] so much for such a tiny change :P [20:16:47] `git amend' is aliased to `commit --amend' [20:17:01] though that is not going to save you if you forget -a hehe [20:17:06] the problem was forgetting the -a :) [20:17:37] <^demon> Well, `git commit --amend` is just like git commit, it's only going to commit what's staged :) [20:17:55] <^demon> Alternatively, you can git add and then git commit --amend, in case you only want to amend say 1 or 2 files. [20:18:03] yes, I know... yet I still fail sometimes [20:18:27] Platonides: why do you change the upload6 link from /data/project/upload6 to upload7 ? [20:19:05] for them moving everything to upload7 [20:19:46] seems production replaced upload6 with upload7 [20:20:17] <^demon> Yes, so upload7 is more like prod now. [20:20:32] ok [20:20:33] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [20:20:38] so we have to apply your change [20:20:47] and then rename upload6 to upload7 [20:20:52] yes [20:20:54] moving back the captcha dir ;-) [20:21:33] just merge it :P [20:21:40] I wish I could [20:21:48] I thought you could [20:22:18] ssh: connect to host gerrit.wikimedia.org port 29418: Connection refused [20:22:19] :( [20:22:24] me too [20:22:25] wtf [20:22:43] Gerrit web interface is serving me 503s [20:22:48] Gerrit hates you [20:23:18] works now [20:23:45] hashar, I also sent https://gerrit.wikimedia.org/r/#/c/28442/ to change to upload7 the rest of the references [20:24:28] it was ^demon's fault [20:24:56] Platonides: so I fixed your change : https://gerrit.wikimedia.org/r/28391 ;D [20:26:16] will get someone to merge it [20:26:27] though SF Ops are probably all eating ;-] [20:26:41] Ryan_Lane just joined here [20:26:41] Ryan_Lane, can you merge 28391 and 28442 ? [20:26:51] !gerrit floating_ips [20:26:51] https://gerrit.wikimedia.org/ [20:26:55] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 3.85, 4.05, 4.98 [20:26:55] bah [20:26:59] !change 28391 [20:26:59] https://gerrit.wikimedia.org/r/#q,28391,n,z [20:27:09] !change 28442 [20:27:09] https://gerrit.wikimedia.org/r/#q,28442,n,z [20:27:55] <^demon> Ryan_Lane: I just restarted gerrit btw. Our custom build was slightly borked. [20:27:56] ummm [20:28:00] whatever floats your ..ips.. [20:28:15] Platonides: 28442 affects production [20:28:23] I'm not terribly comfortable with this [20:28:34] oh [20:28:34] waut [20:28:34] wait [20:28:35] no it doesn't [20:28:39] see the filename suffix [20:29:04] actually, the move of server in production broke labs [20:29:37] we are trying to move the labs names up to date with production ones [20:29:44] Platonides: Ryan merged the changes :]] [20:30:13] * Ryan_Lane nods [20:30:13] yeah. I didn't catch the suffix [20:30:27] !log deployment-prep moving /data/project/upload6 to /data/project/upload7 to match production. See {{bug|41121}} [20:30:27] see in an ideal word we'd just load wmflabs.php in from the settings then allow hashar to merge changes to that file without ops review [20:30:27] Logged the message, Master [20:30:32] * Damianz dreams about his perfect world [20:30:55] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [20:31:28] and hashar moved it [20:31:34] hashar, did you rerun puppet? [20:31:43] no need [20:31:46] but I need to update the conf [20:32:15] no need? [20:32:23] the webservers need /mnt/upload7 [20:32:35] or did you create them manually? [20:33:17] I moved the file [20:33:19] will run puppet to apply the new manifest [20:33:19] well, night [20:33:45] error: insufficient permission for adding an object to repository database .git/objects [20:33:47] boabhahboabhab [20:34:02] Platonides: thank you to have pointed out the root cause [20:34:27] it was quite easy [20:34:44] I was seeing, out of captchas= there are plenty of captchas there! xD [20:34:55] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 2.89, 3.27, 4.46 [20:35:05] Platonides: yeah the error message is really bad [20:35:09] so I obviously went to look where it was looking for them [20:35:09] and spotted it [20:35:23] well, the error message was right [20:35:27] there were no captchas in the new path :) [20:35:55] * Damianz captchas Platonides [20:36:19] * Platonides sends a captcha flood to Damianz [20:37:15] hmm [20:37:51] what's the point of test wiki vs beta? can we kill test (after fixing beta)? [20:38:29] Damianz: no [20:38:53] test is meant to test a deploy [20:38:53] how I hate dublicated stuff [20:39:07] beta is meant to test changes before they are to be deployed [20:39:07] surly you should stage a deploy not on prod? [20:39:27] it's good to know your deploy isn't going to break things [20:39:49] because that /never/ happens :) [20:40:09] it happens less when people use test [20:40:51] Damianz: test/test2 should arguably be called "staging" and/or "pre-production", environments directly connected to prod. beta is (becoming) a place to take more risks than are possible on test/test2 [20:40:59] 10/17/2012 - 20:40:58 - User sugar has been renamed, moving home directory in project(s): sugarcrm [20:41:00] dunno, we use dev, uat (staging) and prod vms. Everything goes on staging before it hits prod and the staging servers get cloned from prod so the data is identical. [20:41:13] also <\3 vmware, I miss kvm [20:42:22] Imo for a deploy we should 'package' what is being pushed, clone out a setup identical to production, run though the upgrade steps, qa test that then repeat in prod [20:42:44] Since as I under stand it you can break prod from test doing some things [20:42:54] Damianz: beta is also a place where can point high-level automated tests. Again, we're not quite in a position where gerrit/Jenkins/beta are all lined up right, but it's coming. [20:43:03] where we [20:43:53] !log deployment-prep applying nfs::upload::labs to apache32 and 33. It is not more applied by the role::applicationserver class (prod apply nfs::upload directly on nodes) [20:43:55] Logged the message, Master [20:44:22] what is this "identical to production" you speak of? ;) [20:44:44] See imo water testing shouldn't be on on beta - it should be part of CI and spun up on a clean setup with a known config. [20:46:34] Damianz: beta isn't yet close enough to production for that [20:46:37] agreed that it should be part of CI. agreed that it should be a known config. [20:46:48] also, we run beta from trunk [20:46:50] not from wmf branches [20:46:57] yeah [20:47:05] where testing should be done against every branch supported. [20:47:09] and also master [20:47:15] but that depends on workflow [20:47:15] thus beta config piggybacks on operations/mediawiki-config in gerrit [20:47:16] agreed. that's not easy, either, though [20:47:19] and if master is stable or tags are [20:48:14] the prod mediawiki config is a freaking mess, it needs turning into roles with generic sections and cluster specific overrides. Hacks ontop of hacks make for mess. [20:48:18] for that matter, prod itself is probably less of a "known config" than beta is. [20:48:54] The other problem is, large parts of prod aren't in puppet (think squid) and pushing the beta stuff back into prod won't be super easy due to the risk involved. [20:48:57] I keep stumbling over ad hoc stuff in prod to be automated for beta [20:49:15] Damianz: squid is going to be phased out I thin [20:49:16] k [20:49:22] varnish? [20:49:26] * Damianz yays if so [20:49:30] Damianz: thus I said in an email a while back "the best way to improve beta is to use it" [20:49:31] plus that is honestly not the most sensible part for beta [20:49:55] or complain about what you don't like :P [20:50:04] chrismcmahon, we should have started by copying everything from production to beta [20:50:08] then improving there [20:50:21] Platonides: agreed. that's not what happened. [20:50:24] Well really we should have cloned prod boxed rather than reverse engieering blind [20:50:32] making things new (hopefulyl better) then make merges difficult [20:50:48] But beta serves as an ok base to at least get everything configured and hope ops will push that back into prod so we can *easily* replicate the cluster [20:50:49] yes, but we weredn't given that to clone :( [20:51:00] It doesn't help that puppet is a fucking mess and impossible to use outside prod/labs [20:51:04] as Mr. Weinberg said "things got this way because things got this way" [20:51:14] :) [20:51:28] good night [20:51:55] thanks Platonides [20:53:20] Damianz: that is more or less what I did [20:53:32] Damianz: when I landed on beta, I started using the production puppet class [20:53:42] so beta is mostly running from production puppet class :-] [20:54:00] of course production keeps evolving / refactoring so it is hard to follow up [20:54:09] it's the mostly part that I hate :) Anything changed in production should be possible to test on beta first [20:54:21] ideally ops should work on beta [20:54:24] And there's not active feedback ops wise for changes to make that really work [20:54:28] and they should also test beta... [20:54:37] Also we suck for things like lvs/bgp etc that's in prod [20:55:04] but then ops are overwhelmed with production issues and lot of problems are really only going to be faced on the production cluster [20:55:37] Firefighting doesn't improve stuff long term though, it just makes people grumpy and the issues re-appear [20:55:58] this isn't firefighting [20:56:24] Imo was part of a change management process unless it's service affecting *nothing* should change on prod without going though beta. Anything changed for service affecting issues should be retroactivly put though beta [20:56:50] Damianz: there is a bit of firefighting I guess. But they are also doing toooon of stuff behind the scene. So I am never going to blame ops. [20:57:08] and yeah, we definitely need industrial processes ;-] [20:57:14] we are still young and small hehe [20:57:35] I have been in a company where changing a comma in a shell script would need like a month to get deployed [20:57:50] Same issues at the day job heh. Trying to move from a weird setup to a totally automated setup. [20:58:13] before puppet, we had to poke ops online to change files that belong to root user [20:58:15] that was a pitty [20:58:28] nowadays i push to gerrit, wait a few days and eventually poke someone to have a look at it [20:58:31] much better for them [20:58:50] cause all the trivial stuff can be handled in batches during a code review sprint [20:59:04] Gerrit is ok, usually involved grabbing someone specific though. Not really activly reviewed imo (even if linked back from bz in some places). [20:59:06] but yeah, I definitely agree our process need to improve :-]  But I believe we are on the right track [20:59:09] Same with patches in bz, they get ignored. [20:59:18] (he we deploy mediawiki every two weeks straight from master !!! ) [20:59:32] yeah bz patches have been a recurring issue :-( [20:59:50] I believe the pre commit review workflow with gerrit helped a bit on that front [21:00:01] and Andree (the new bug wrangler) is probably looking at it [21:00:07] Sumanah helped a ton too by getting developers to look at the patches [21:00:17] hexmode did lot of copy from bugzilla to gerrit [21:00:24] so we improved this year at least :) [21:00:29] (I tried to be optimist hehe) [21:00:32] PROBLEM Total processes is now: CRITICAL on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS CRITICAL: 283 processes [21:00:45] The gerrit way encourages collab more (though I prefer the github PR format). The hey you need x, y, z, 2 hands, 1 foot and a nice person to get an svn account *then* you can add code just sucked. [21:00:51] (and by we I mean staff + contractor + community + volunteers … e.g. everyone) [21:00:52] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [21:00:57] Though the problem there is lots of tools are not in git yet [21:01:45] if you are aware of such tool, please shout about it on wikitech-l or possibly bugzilla under the git/gerrit component [21:02:00] Chad and I do some svn to git conversion from time to time [21:02:10] or poke whoever wrote the tool to add it in a git repo [21:02:36] I don't think we have any list yet of what is missing [21:03:18] fwiw, the next major piece for beta is to deploy automatically from gerrit in Jenkins builds, so that a) we can see if a deploy build failed and b) hang automated tests off that build using Jenkins [21:03:34] chrismcmahon: so the captcha are still problematic and I have no real idea what is wrong there :/ Apparently can't find the files [21:03:49] chrismcmahon: giving up for tonight. To late and I got a bunch of bugs to open before heading bed [21:03:55] along the way we discover things like we don't actually manage the contents of the prod db very well. [21:04:02] thanks hashar [21:04:23] that's probably going to take a chunk of jenkins work but it would be nice to have proper (integration/water) testing in place along with unit tests... also btw do we fail builds on stuff like code coverage? [21:04:59] chrismcmahon: adding a nice error message would probably help. "Ran out of captcha images" does not help at all. [21:05:10] Damianz: any serious code coverage analysis is a ways away, but I know we've been discussing it for some time. [21:05:29] Damianz: I am still reimplementing the Jenkins jobs ;-( [21:05:46] Damianz: and yeah code coverage would be great. Chris is pushing for it ;D [21:05:51] This is the problem with unit tests, they can be 100% sucessful and cover 1% of the code but as long as it's 100% everything's great... [21:05:51] I just keep slowing him down hehe [21:06:24] actually it was Erik who first mentioned code coverage to me in course of my hiring interviews, but some other things take precedence [21:07:43] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6major) [Bug 41132] live hack in beta mediawiki-config (tracking) - https://bugzilla.wikimedia.org/show_bug.cgi?id=41132 [21:07:56] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 10.48, 9.38, 6.18 [21:08:53] Btw, now gerrit supports replication in a kinda sane way where you going to move beta to that so it's updated near live or stick to the loop of git pull? [21:08:53] RECOVERY SSH is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:09:44] <^demon> gerrit can replicate to anywhere with git + ssh installed. [21:09:53] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 12.34, 11.01, 7.65 [21:09:53] * Damianz would really love to be able to deploy tag,branch or ref in a nice ui for proper 'hey it's broken in version x' and you can step though to see what changed (code diffs) against what's happening (error logs/output). [21:10:05] <^demon> All it's doing is pushing refs. But since you'd want a non-bare copy, you'd have to do some magic. [21:10:33] <^demon> Either `git reset --hard HEAD` all the time. Or pull from the bare copy to the live copy. [21:10:33] ^demon: Probably the issue would but submodules, dunno. I just dislike the while true; do git pull; sleep; done thing [21:10:33] let's get master reliable first, eh? [21:10:49] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41133] beta all.dblist is a live hack - https://bugzilla.wikimedia.org/show_bug.cgi?id=41133 [21:10:58] master is boring and when you say that you ignore the broken unit tests in the previous versions :D [21:11:04] 10/17/2012 - 21:10:59 - User sugar has been renamed, moving home directory in project(s): sugarcrm [21:11:20] <^demon> The unit tests didn't even work until like a year and a half ago ;-) [21:12:14] ^demon, you could make gerrit run something different when pushing the replication [21:12:23] then that script could do the git pull [21:12:36] <^demon> There's no post-replicate hook :\ [21:12:47] ^demon, just the replicate action [21:13:15] and remember that even if gerrit sends command push, we can override that locally in authorized_keys :) [21:13:37] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41134] beta wikiversions.dat is a live hack - https://bugzilla.wikimedia.org/show_bug.cgi?id=41134 [21:13:44] just call a script that accepts the pull [21:13:50] and then checkouts [21:14:05] preferably pulling in another copy [21:14:10] it may not like getting pushed to a non-bare [21:14:20] <^demon> If the destination isn't a bare repo, you could just have it replicate to that copy. Then the only thing you'd need to do is `git reset --hard HEAD` after replication. [21:14:23] although I think it can be forced [21:14:24] <^demon> Rather than having 2 clones. [21:14:34] <^demon> Yeah, you can force it but you have to reset after push [21:14:51] either way, it can be done easily [21:16:09] <^demon> Indeed. I was saying it's not hard. [21:16:16] <^demon> The replication is fairly robust. [21:16:30] except when we lose refs :P [21:16:52] The alternative is replicate everything to a r/o share and rsync/git clone off that which doesn't solve the auto updating problem but might be useful for stuff like puppet. [21:16:53] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [21:17:00] <^demon> Platonides: Yeah, well it doesn't know any better :p [21:17:06] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41135] on beta wmf-config/extension-list is a live hack - https://bugzilla.wikimedia.org/show_bug.cgi?id=41135 [21:18:09] <^demon> Anyway, to shut git up about pushing to a non-bare repo you just have to `git config receive.denycurrentbranch ignore` on the destination end. [21:18:16] Damianz, git should have a mode where fetching refs doesn't overwrite existing ones [21:18:30] merge ? :-) [21:18:50] hashar? [21:19:26] [bz] (8RESOLVED - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 39759] phase out deployment-backup - https://bugzilla.wikimedia.org/show_bug.cgi?id=39759 [21:19:48] bahhh [21:19:55] https://bugzilla.wikimedia.org/show_bug.cgi?id=40991 - squid001 gives "Zero Sized Reply" error on POST request [21:19:58] merge sucks for really what we want on beta [21:20:23] git reset --hard; git clean -f -d; git pull # more like it [21:20:52] I don't do merges in git [21:21:09] i do 'git up' [21:21:19] I have it as an alias: up = pull --ff-only [21:21:46] [bz] (8ASSIGNED - created by: 2Sumana Harihareswara, priority: 4Normal - 6critical) [Bug 40991] squid001 gives "Zero Sized Reply" error on POST request - https://bugzilla.wikimedia.org/show_bug.cgi?id=40991 [21:21:46] so if I'm not on origin/master it doesn't automatically merge where it shouldn't [21:23:10] [bz] (8RESOLVED - created by: 2Antoine "hashar" Musso, priority: 4Low - 6minor) [Bug 37061] rebuild localisation cache whenever needed - https://bugzilla.wikimedia.org/show_bug.cgi?id=37061 [21:23:27] so I am out for bed now :-] [21:23:28] The most annoying thing to do is forget you're on a detached branch and pull in the master branch then end up in 'hey you have non-commited stuff' hell. [21:23:39] * Damianz tucks hashar in and wishes a good night [21:24:07] me too [21:25:17] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6normal) [Bug 36748] [OPS] syslog::server (in test) unusuable - https://bugzilla.wikimedia.org/show_bug.cgi?id=36748 [21:27:24] PROBLEM Current Users is now: WARNING on ve-nodejs i-00000245.pmtpa.wmflabs output: USERS WARNING - 7 users currently logged in [21:27:53] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 1.17, 2.59, 4.92 [21:29:53] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 1.17, 2.30, 4.75 [21:30:41] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [21:30:55] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [21:32:23] RECOVERY Current Users is now: OK on ve-nodejs i-00000245.pmtpa.wmflabs output: USERS OK - 4 users currently logged in [21:35:01] baba [21:35:01] I am really out now [21:35:01] have a good night! [21:35:52] PROBLEM Disk Space is now: CRITICAL on maps-test4 i-000004ae.pmtpa.wmflabs output: DISK CRITICAL - free space: / 286 MB (2% inode=89%): [21:40:52] PROBLEM Disk Space is now: WARNING on maps-test4 i-000004ae.pmtpa.wmflabs output: DISK WARNING - free space: / 288 MB (3% inode=89%): [21:50:56] PROBLEM Disk Space is now: CRITICAL on maps-test4 i-000004ae.pmtpa.wmflabs output: DISK CRITICAL - free space: / 278 MB (2% inode=89%): [22:00:32] PROBLEM Total processes is now: CRITICAL on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS CRITICAL: 283 processes [22:00:52] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [22:22:52] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 8.55, 9.22, 6.04 [22:25:16] Yay to annoying random project people unintentionally, some people need to understand personality moar. [22:30:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 189 processes [22:30:52] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [22:35:52] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 8.61, 8.36, 6.25 [22:44:29] hmm, just tried to create an instance: There were no Nova credentials found for your user account. Please ask a Nova administrator to create credentials for you. [22:46:50] any nova administrators around? [22:47:07] Ryan_Lane: ping ;) [22:47:30] logout [22:47:31] login [22:47:35] known issue [22:50:20] ok, trying.. [22:50:33] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 96 processes [22:51:07] gwicke: did you change your preferences? [22:51:19] worked after logging in again [22:51:32] Ryan_Lane: not recently I think [22:52:28] I want to kill that message and just force a logout tbh [22:52:42] or actually fix mediawiki, but screw that [22:52:44] Failed to create instance. [22:52:55] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 154 processes [22:52:58] is there any way to debug why it failed to create an instance? [22:53:09] yeah, ask Ryan_Lane to look at the logs [22:53:14] what settings/name did you use [22:53:18] gwicke: which project? [22:53:25] visualeditor [22:53:27] sec [22:53:47] tried to get an eight-core vm [22:53:51] for round-trip testing [22:54:39] we don't really need that much RAM btw, but definitely CPU time [22:55:04] you hit a quota [22:55:30] gimme a min [22:55:43] we really need less crap error handling and actually like report quota use [22:56:01] Ryan_Lane: we'll probably need some more instances if possible [22:56:14] the more CPU time we have, the more articles we can actually test [22:57:09] are you doing performance testing? [22:57:09] no, round-trip testing [22:57:09] ah [22:57:09] on a dump of wikipedia [22:57:09] * Ryan_Lane nods [22:57:31] how many are you creating? [22:57:41] gwicke: How did you select those 100k articles, BTW? [22:57:52] just did a random sample for now [22:58:06] Right [22:58:28] if we had the resources, we'd test all of them to be sure to catch all serious breakage [22:58:33] If you followed the article criteria (for the purposes of Special:Statistics), that's already like 5% of all articles [22:58:48] Ryan_Lane: how many could we use at most? [22:58:52] over 9000 [22:58:54] a lot, theoretically [22:59:20] Ryan_Lane: we don't need them forever, just a few big runs i guess [22:59:23] until December [22:59:33] could tear them down in the meantime if resources are tight [23:00:25] only need to apt-get install node and run something from nfs, so setup is minimal [23:00:46] use puppet [23:00:54] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [23:01:57] Damianz: yeah, should look into that [23:06:09] 10/17/2012 - 23:06:09 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:07:43] * Damianz wanders off to debate the point of life [23:11:18] 10/17/2012 - 23:11:18 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:12:53] RECOVERY Total processes is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS OK: 149 processes [23:15:59] Ryan_Lane: should instance creation work now? I still get the same error. [23:16:07] gwicke: sorry, I was fixing another problem [23:16:13] 10/17/2012 - 23:16:11 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:16:37] Ryan_Lane: ok, just pong when something changes [23:21:13] 10/17/2012 - 23:21:10 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:25:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 156 processes [23:26:19] 10/17/2012 - 23:26:18 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:28:31] gwicke: try now [23:28:57] what. the. fuck is SugarCE-Full-6.5.5? [23:29:03] is that *really* a user? [23:29:25] and if is it, who did this? :) [23:29:28] !project sugarcrm [23:29:28] There are multiple keys, refine your input: project-access, project-discuss, projects, [23:29:40] !projects sugarcrm [23:29:40] https://labsconsole.wikimedia.org/w/index.php?title=Special:Ask&q=[[Resource+Type%3A%3Aproject]]&p=format%3Dbroadtable%2Fheaders%3Dshow%2Flink%3Dall%2Fsearchlabel%3D%E2%80%A6-20further-20results%2Fclass%3Dsortable-20wikitable-20smwtable&po=%3FMember%0A%3FDescription%0A&limit=500&eq=no [23:29:47] !resource sugarcrm [23:29:47] https://labsconsole.wikimedia.org/wiki/Nova_Resource:sugarcrm [23:30:07] Ryan_Lane: success, thanks! [23:30:11] cool. yw [23:30:21] I increased your number of cores by 20 [23:30:29] maybe I should have done 24 [23:30:55] I may need to increase memory at some point too [23:31:04] 10/17/2012 - 23:31:04 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:32:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [23:34:01] Ryan_Lane: the instance still has a status 'error', is it just a matter of waiting until it is set up? [23:34:20] no [23:34:21] if it has error, there's a problem [23:34:27] gimme a sec [23:35:19] DetachedInstanceError.... [23:35:30] weird [23:36:13] 10/17/2012 - 23:36:12 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:36:59] gwicke: delete/create [23:37:08] I restarted nova-compute on a couple nodes [23:37:27] there's some open bugs that'll be fixed whenever ubuntu releases the update [23:38:53] ok, trying.. [23:39:22] RECOVERY dpkg-check is now: OK on patchtest i-000000f1.pmtpa.wmflabs output: All packages OK [23:39:43] RECOVERY Current Load is now: OK on patchtest i-000000f1.pmtpa.wmflabs output: OK - load average: 0.00, 0.06, 0.07 [23:39:58] failed to create instance [23:40:02] PROBLEM host: i-000004d2.pmtpa.wmflabs is DOWN address: i-000004d2.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000004d2.pmtpa.wmflabs) [23:40:13] it must not have deleted the old one yet [23:40:33] ugh, all the compute services are showing as down [23:40:43] RECOVERY Current Users is now: OK on patchtest i-000000f1.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [23:40:43] RECOVERY Free ram is now: OK on patchtest i-000000f1.pmtpa.wmflabs output: OK: 628% free memory [23:40:53] Successfully deleted instance, but failed to remove parsoid-roundtrip3 DNS entry. [23:40:57] yep [23:41:02] 10/17/2012 - 23:41:02 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:41:11] the compute services are still restarting [23:41:26] so until they are up, nova won't delete or create things [23:41:35] I need to prod canonical [23:41:39] they need to release the updates [23:41:48] ok ;) [23:41:54] so, give it a little bit [23:42:02] maybe 10-15 minutes [23:42:11] I got to run, will try again tomorrow [23:42:17] ok [23:42:17] delete & recreate [23:42:20] sorry it's not working for you [23:42:26] thanks for your efforts! [23:44:12] yw [23:46:01] 10/17/2012 - 23:46:01 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:47:25] PROBLEM dpkg-check is now: CRITICAL on patchtest i-000000f1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [23:47:54] PROBLEM Current Load is now: CRITICAL on patchtest i-000000f1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:52] PROBLEM Current Users is now: CRITICAL on patchtest i-000000f1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:52] PROBLEM Free ram is now: CRITICAL on patchtest i-000000f1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [23:51:13] 10/17/2012 - 23:51:09 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm [23:56:17] 10/17/2012 - 23:56:16 - User SugarCE-Full-6.5.5 has been renamed, moving home directory in project(s): sugarcrm