[00:13:54] PROBLEM Current Load is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:14:36] PROBLEM Current Users is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:15:12] PROBLEM Disk Space is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:16:02] PROBLEM Free ram is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:17:22] PROBLEM Total processes is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:18:14] PROBLEM dpkg-check is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: Connection refused by host [00:18:55] RECOVERY Current Load is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK - load average: 1.13, 0.97, 0.50 [00:19:12] PROBLEM host: i-000003cc.pmtpa.wmflabs is DOWN address: i-000003cc.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003cc.pmtpa.wmflabs) [00:19:36] RECOVERY Current Users is now: OK on userstats i-000004ca.pmtpa.wmflabs output: USERS OK - 1 users currently logged in [00:19:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [00:20:16] RECOVERY Disk Space is now: OK on userstats i-000004ca.pmtpa.wmflabs output: DISK OK [00:21:03] RECOVERY Free ram is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK: 859% free memory [00:22:23] PROBLEM Disk Space is now: WARNING on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK WARNING - free space: /export 956 MB (5% inode=59%): /home/SAVE 956 MB (5% inode=59%): [00:22:23] RECOVERY Total processes is now: OK on userstats i-000004ca.pmtpa.wmflabs output: PROCS OK: 90 processes [00:23:13] RECOVERY dpkg-check is now: OK on userstats i-000004ca.pmtpa.wmflabs output: All packages OK [00:49:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:04:52] PROBLEM Total processes is now: WARNING on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS WARNING: 173 processes [01:09:53] RECOVERY Total processes is now: OK on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS OK: 94 processes [01:19:49] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:50:55] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:22:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:41:43] RECOVERY Free ram is now: OK on bots-sql2 i-000000af.pmtpa.wmflabs output: OK: 20% free memory [02:52:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:59:42] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af.pmtpa.wmflabs output: Warning: 14% free memory [03:22:05] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [03:53:55] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Warning: 14% free memory [03:54:13] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:13:52] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Critical: 4% free memory [04:23:52] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f.pmtpa.wmflabs output: OK: 95% free memory [04:24:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:34:52] PROBLEM Disk Space is now: WARNING on echo-xmpp i-00000351.pmtpa.wmflabs output: DISK WARNING - free space: / 546 MB (5% inode=91%): [04:54:23] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:24:25] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:59:45] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Unable to read output [06:03:32] PROBLEM Current Users is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:02] PROBLEM Disk Space is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:22] PROBLEM Current Load is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:02] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:53] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [06:06:33] PROBLEM Total processes is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [06:40:23] PROBLEM Total processes is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: PROCS CRITICAL: 348 processes [06:41:54] PROBLEM Current Load is now: CRITICAL on userstats i-000004ca.pmtpa.wmflabs output: CRITICAL - load average: 97.84, 50.40, 29.13 [06:44:53] RECOVERY Disk Space is now: OK on echo-xmpp i-00000351.pmtpa.wmflabs output: DISK OK [06:50:22] RECOVERY Total processes is now: OK on userstats i-000004ca.pmtpa.wmflabs output: PROCS OK: 94 processes [06:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:01:52] PROBLEM Current Load is now: WARNING on userstats i-000004ca.pmtpa.wmflabs output: WARNING - load average: 0.90, 4.09, 16.53 [07:21:52] RECOVERY Current Load is now: OK on userstats i-000004ca.pmtpa.wmflabs output: OK - load average: 0.01, 0.05, 0.03 [07:23:32] RECOVERY Current Users is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [07:24:04] RECOVERY Disk Space is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: DISK OK [07:24:22] RECOVERY Current Load is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: OK - load average: 0.27, 0.97, 1.15 [07:24:45] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0.pmtpa.wmflabs output: Warning: 8% free memory [07:24:45] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:25:54] RECOVERY dpkg-check is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: All packages OK [07:25:54] RECOVERY SSH is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:26:36] RECOVERY Total processes is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: PROCS OK: 230 processes [07:31:23] Change on 12mediawiki a page Wikimedia Labs/Toolserver features wanted in Tool Labs was modified, changed by Nemo bis link https://www.mediawiki.org/w/index.php?diff=594553 edit summary: JIRA public permalink [07:35:02] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 6.56, 7.42, 5.99 [07:43:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 151 processes [07:45:02] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 7.11, 8.02, 6.13 [07:53:52] RECOVERY Total processes is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS OK: 147 processes [07:54:42] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Unable to read output [07:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:55:21] seems the labsconsole no more let us connect :-] [07:56:04] tellings me "Labs uses cookie to log in users. You have cookies disabled" [07:56:10] ;-D [07:56:19] Ryan_Lane: are you awake and doing any modification to labsconsole ? [07:57:05] PROBLEM Disk Space is now: UNKNOWN on aggregator2 i-000002c0.pmtpa.wmflabs output: NRPE: Call to fork() failed [07:57:23] PROBLEM Current Load is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [07:58:54] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [07:58:58] hashar: memcached died for some reason [07:59:10] :-( [07:59:15] I'm awake, but wasn't making changes [07:59:26] I'm wanting to stab hp cloud in its face [07:59:35] PROBLEM Total processes is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:14] Ryan_Lane: what did they do wrong ? [08:01:32] PROBLEM Current Users is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:55] they scheduled maintenance from 1am till 3am my time [08:02:05] PROBLEM Disk Space is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:02:09] for a service that should never go down [08:02:10] ever [08:02:38] maybe they need to change their power circuits [08:02:48] it's for block storage [08:03:04] there's never a reason to take it down for 2 houts [08:03:06] hours [08:03:15] their original maintenance window was 8 hours [08:03:32] I had to convince them that I couldn't take downtime for 8 hours [08:04:02] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [08:04:56] !log deployment-prep attempting to deploy role::db::core class on deployment-sql02 [08:04:59] Logged the message, Master [08:05:38] Ryan_Lane: labs working again. Thanks for restarting the memc instance :-] [08:05:47] yw [08:06:55] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 157 processes [08:17:12] !log deployment-prep removed role::db::core from -sql02 : class is not meant for labs :-] [08:17:12] Logged the message, Master [08:20:54] PROBLEM Current Load is now: CRITICAL on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: CRITICAL - load average: 24.05, 23.56, 20.91 [08:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [08:31:50] !log deployment-prep -sql02 : manually installed mysql server using /mnt/mysql as datadir. [08:31:52] Logged the message, Master [08:40:52] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 15.13, 17.87, 19.76 [08:51:18] [bz] (8NEW - created by: 2Nemo_bis, priority: 4Unprioritized - 6normal) [Bug 41095] Missing perl dependencies? - https://bugzilla.wikimedia.org/show_bug.cgi?id=41095 [08:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:20:52] RECOVERY Current Load is now: OK on ve-roundtrip i-00000404.pmtpa.wmflabs output: OK - load average: 2.37, 2.80, 4.63 [09:24:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:25:53] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 1.94, 2.24, 4.49 [09:35:00] Ryan_Lane: could you have a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=41083 [09:35:32] deployment-video04(10.4.1.19) is unable to connect to mysql at deployment-sql(10.4.0.53) [09:39:24] hey j^ [09:39:38] hi [09:39:46] j^: I raised the issue on the labs mailing list [09:39:59] ryan answered it is a bug in openstack pending packaging in Ubuntu :( [09:40:06] https://bugzilla.wikimedia.org/show_bug.cgi?id=40526 [09:40:10] will update that bug with his answer [09:41:15] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4High - 6normal) [Bug 40526] new security rule not applied - https://bugzilla.wikimedia.org/show_bug.cgi?id=40526 [09:41:43] [bz] (8NEW - created by: 2Jan Gerber, priority: 4Unprioritized - 6normal) [Bug 41083] unable to connect to deployment-sql from deployment-video04 - https://bugzilla.wikimedia.org/show_bug.cgi?id=41083 [09:42:24] ah ok, any known workaround? [09:42:37] i could create another instance and hope that it ends up in the 10.4.0.x range [09:46:05] hmm [09:46:08] also the sql security rule is no more around :/ [09:46:33] I have added it back [09:46:53] so you could try creating a new instance using the default and sql security group [09:47:14] might be able to apply it [09:50:32] PROBLEM Current Users is now: UNKNOWN on ganglia-test2 i-00000250.pmtpa.wmflabs output: Invalid host name i-00000250.pmtpa.wmflabs [09:50:45] ok, while its bulding i have another question: [09:50:47] is commons.wikimedia.beta.wmflabs.org currently using any custom javascript or extensions that are not used in production? since right now TMH does not load (i.e. http://commons.wikimedia.beta.wmflabs.org/wiki/File:Folgers.ogv) [09:51:37] I am not sure [09:51:45] it is definitely missing some javascript from the commons production site [09:52:18] http://commons.wikipedia.beta.wmflabs.org/w/thumb.php?f=Folgers.ogv&width=352 <--- gives an error 500 (internal server error) [09:52:52] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:02] PROBLEM dpkg-check is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:02] PROBLEM Current Load is now: CRITICAL on aggregator-test1 i-000002bf.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:06] bah its gone [09:54:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:55:32] PROBLEM Current Users is now: CRITICAL on ganglia-test2 i-00000250.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:42] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf.pmtpa.wmflabs output: Warning: 9% free memory [09:57:52] RECOVERY dpkg-check is now: OK on aggregator-test1 i-000002bf.pmtpa.wmflabs output: All packages OK [09:58:52] PROBLEM Total processes is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [09:59:32] PROBLEM dpkg-check is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [10:00:12] PROBLEM Current Load is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: Connection refused by host [10:03:52] RECOVERY Total processes is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS OK: 103 processes [10:04:32] RECOVERY dpkg-check is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: All packages OK [10:05:12] RECOVERY Current Load is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: OK - load average: 0.15, 0.45, 0.41 [10:15:56] Is there somewhere a docu of the Puppet class "git" (for example git::clone)? [10:17:33] Jan_Luca: not really :/ [10:17:38] Jan_Luca: but I can help :) [10:17:52] PROBLEM Current Load is now: WARNING on aggregator-test1 i-000002bf.pmtpa.wmflabs output: WARNING - load average: 0.23, 2.10, 17.59 [10:18:17] there is some inline documentation in operations/puppet.git repository. That definition is in manifests/generic-definition.pp [10:18:32] PROBLEM dpkg-check is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [10:18:57] Jan_Luca: I have pasted the inline doc at http://pastebin.com/mGjiyJ8i [10:23:08] thank you [10:24:15] hashar: looks like TMH git was not pulled, are extensions pulled automatically in deployment-prep or has do be done as needed? [10:24:33] j^: should be automatically pulled. Let me check [10:24:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:24:46] the autoupdating script might have an error :-] [10:25:44] j^: TMH is at : * 2ffc1f8 - (HEAD, origin/master, origin/HEAD, master) Localisation updates from http://translatewiki.net. (15 hours ago) [10:25:56] aka it is up to ddate [10:26:11] hashar: yes, but that was only after i pulled manually [10:26:26] so the updater might have an issue :/ [10:26:47] will keep an eye and ping you instead of pulling next time it happens [10:27:45] new instace is able to telnet sql server, will install videoscaler class and see if it works from php too [10:28:13] !log deployment-prep j: create new videoscaler instance deployment-video05 this time with sql access [10:28:15] Logged the message, Master [10:31:00] grmblblbl [10:31:01] !log wikidata-dev wikidata-dev-2: Disabled E3Experiments extension in the test client because it contained a new signup page that did not display captchas and thus prevented account creation. [10:31:03] Logged the message, Master [10:31:30] j^: deployment-jobrunner06 is out of disk space, that is where the beta auto updater run [10:34:22] !log deployment-prep deployment-jobrunner06 / was filled out by Gluster logs /var/log/glusterfs/data-project.log filled it all :( [10:34:22] Logged the message, Master [10:34:52] thats not cleaned up by logrotate? [10:37:12] RECOVERY Disk Space is now: OK on deployment-jobrunner06 i-0000031d.pmtpa.wmflabs output: DISK OK [10:39:32] PROBLEM host: i-000004c6.pmtpa.wmflabs is DOWN address: i-000004c6.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000004c6.pmtpa.wmflabs) [10:41:50] j^: apparently it is not :( [10:42:53] RECOVERY Current Load is now: OK on aggregator-test1 i-000002bf.pmtpa.wmflabs output: OK - load average: 0.72, 0.43, 3.80 [10:52:59] on the video instance i also get lots of gluster errors in the logs: E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-deployment-prep-project-replicate-0: background entry self-heal failed on /apache/common-local [10:54:08] yeah got a similar issue in jobrunner06 [10:54:15] I have no idea what it could be [10:54:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:54:50] gluster volume degraded [10:55:25] what do you mean ? [10:56:48] do me the errors suggest that the gluster volume is degraded and needs to be repaired [11:01:36] https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=794699 [11:02:41] fine [11:02:54] so we want to fill a bug and get ops to look at the gluster volume [11:09:28] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41104] glusterfs log files are not rotated - https://bugzilla.wikimedia.org/show_bug.cgi?id=41104 [11:12:33] j^: I don't have the error on the deployment-integration instance :-] [11:12:41] might be related to the symbolic links [11:13:56] or running php-cli (jobrunner, videoscaler) [11:14:30] ah [11:18:39] !log deployment-prep emptied /var/log/glusters/data-deployment.log huge files on several instances [11:18:41] Logged the message, Master [11:19:12] RECOVERY Disk Space is now: OK on deployment-apache33 i-0000031b.pmtpa.wmflabs output: DISK OK [11:21:13] RECOVERY Disk Space is now: OK on deployment-apache32 i-0000031a.pmtpa.wmflabs output: DISK OK [11:22:00] !log deployment-prep updated mediawiki-config : 6bbf8f2..7caabad [11:22:02] Logged the message, Master [11:24:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [11:32:08] j^: I hopefully properly removed the live hack on beta that was doing the same as https://gerrit.wikimedia.org/r/#/c/28007/ [11:33:38] is /data/project/apache/common/wmf-config tracking git? git log gives an error here [11:36:44] hashar: PrivateSettings.php~ and InitialiseSettings-wmflabs.php~ look like files that should be removed no? [11:37:14] they are vim backup files [11:37:22] so should probably be deleted [11:37:38] wmf-config had a .git directory I deleted that [11:38:04] the working copy is in /home/wikipedia/common/ (which is an alias to /data/project/apache/common IIRC) [11:38:16] j^: which error do you have ? [11:38:40] git error is now gone [11:38:43] great :-] [11:38:54] somehow wmf-config was configured as a git repo [11:39:00] that might have been the issue [11:41:27] are all web instances using the same config? [11:43:08] since the page sometimes looks as if TMH was not enabled at all [11:45:30] there is a squid cache in front [11:45:44] that might be caching the page [11:54:03] now i get a 404 for resources like http://bits.beta.wmflabs.org/static-master/resources/jquery/jquery.cookie.js [11:54:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [11:56:28] dohh [11:56:31] poor bits [11:57:09] j^: need to fix it :-) [11:58:08] the links got removed :( [11:58:21] !g I89c58c9a085dce4ef83ad5ea907135a269de359a [11:58:21] https://gerrit.wikimedia.org/r/#q,I89c58c9a085dce4ef83ad5ea907135a269de359a,n,z [11:58:44] grmblblb [12:02:03] https://gerrit.wikimedia.org/r/28337 [12:03:38] !log deployment-prep Fixed assets on bits. The static-master symbolic links got removed at some point. See {{gerrit|28337}} [12:03:40] Logged the message, Master [12:03:49] j^: that is fixed :-] http://bits.beta.wmflabs.org/static-master/resources/jquery/jquery.cookie.js [12:03:59] that most probably caused a ton of issues [12:05:08] j^: got an error : "Error loading EmbedPlayer dependency set: Module mw.EmbedPlayer failed." [12:05:13] on http://commons.wikimedia.beta.wmflabs.org/wiki/File:Folgers.ogv [12:05:36] though that is thrown by ext.gadget.Stockphoto [12:05:43] so I am not sure it is related to TMH [12:07:06] hashar: loading http://commons.wikimedia.beta.wmflabs.org/wiki/Special:TimedMediaHandler i got a does not exist error (Served by deployment-apache32) [12:08:00] j^: I get a special page content there [12:08:08] with a banner about commons being available in french [12:08:13] then blank content [12:08:13] yes i get it most of the time [12:08:28] via [12:08:43] last time i did not get it it was served by 32 [12:08:48] so might be something about 32 [12:10:52] PROBLEM Disk Space is now: WARNING on maps-test4 i-000004ae.pmtpa.wmflabs output: DISK WARNING - free space: / 574 MB (5% inode=89%): [12:11:46] X-Cache MISS from squid001.beta.wmflabs.org [12:11:57] so it seems it is hitting the apaches [12:25:53] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [12:31:10] !log centralauth Update all instances and reboot them [12:31:16] !log centralauth Update all instances and reboot them [12:31:17] Logged the message, Master [12:31:22] Logged the message, Master [12:55:53] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:05:33] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [13:10:33] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 97 processes [13:27:04] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:30:58] Hey hashar, thanks for setting up that db server! [13:31:15] Any chance I could get sudo access in the project? [13:33:03] Ryan_Lane: When you see this later on, please help me boot up dumps-bot3, it is down and I can't reboot it on the web interface. ID: i-000003ef.pmtpa.wmflabs. Thanks! [13:34:37] csteipp: you should have sudo access already [13:34:41] csteipp: at least I made you a sysadmin [13:34:50] csteipp: and good morning :-] [13:34:56] Good Morning :) [13:35:07] I don't appear to have sudo... [13:35:32] csteipp: while you are around, I will be connected tonight from roughly noon to 2pm PST. Not the ideal time but might let us chat a bit :) [13:35:32] At least, I get a password prompt for it when I try to sudo anything [13:35:34] ahh [13:35:40] csteipp: you need to enter your labs password [13:35:48] Oh! [13:35:50] sudo uses pam_ldap to authenticate [13:36:02] hashar: Are you guys going to create the beta wikis for Wikivoyage anytime soon? [13:36:06] <^demon> Morning everyone :) [13:36:22] morning ^demon! [13:36:40] Hm... "csteipp is not allowed to run sudo on deployment-sql02. " [13:37:14] :-( [13:37:15] <^demon> hashar: So, the fact that we run puppet linting from patchset-created is annoying me, so I'm fixing the puppet job in jenkins. [13:37:41] <^demon> Your rake doesn't seem to work completely yet, so I just made a new build step with the command line version. [13:37:49] ^demon: is that reporting anything back to Gerrit ? [13:38:10] I know I experimented a bit about it during the SF tech days [13:38:39] <^demon> Not yet. But I got it to build on one I retriggered :) [13:38:42] csteipp: I am looking at the security groups [13:38:48] doh [13:40:11] csteipp: ahh I did not add you to the sudo policy :-D https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaSudoer&action=modify&sudoername=admin&project=deployment-prep [13:40:17] csteipp: you might be able to add your self there [13:41:07] !log deployment-prep Added CSteipp and Reedy as sudoers [13:41:09] Logged the message, Master [13:42:15] meh [13:42:16] back [13:42:27] csteipp what is your name? [13:42:31] <^demon> hashar: What do I have to do to get jenkins-bot to report the failure to gerrit with a V-1? [13:42:36] <^demon> Or +1 [13:42:40] Thanks hashar! [13:42:49] petan: As in irl? [13:42:58] as in labs :) [13:43:08] csteipp: we usually get everything puppetized. Though for sql02 I just installed mysql manually [13:43:08] CSteipp [13:43:14] the production puppet class do not work on labs :/ [13:43:17] there is no such user I can see [13:43:29] got it [13:43:36] it's not sorted meh [13:43:44] ^demon: it should do it by default. A verified -1 is triggered whenever the build script fail, most of the time when some command exit with a non 0 status. [13:43:52] ^demon: example: /bin/false ;-] [13:43:53] ah, you should have sudo... [13:44:03] I do now :) [13:44:13] <^demon> hashar: Ah, jenkinsbot probably doesn't have permissions on ops/puppet. Lemme check [13:44:22] !log deployment-prep -sql02 removed ganglia from host and reran puppet. [13:44:23] Logged the message, Master [13:44:34] ^demon: which job are you working on ? [13:44:45] <^demon> operations-puppet [13:45:06] https://integration.mediawiki.org/ci/job/operations-puppet/configure so indeed got builded. not sure why since it is supposed to be disabled :-] [13:45:20] Retriggered by user Demon for Gerrit: https://gerrit.wikimedia.org/r/23603 in silent mode. [13:45:27] so silent mode means nothing is reported back to Gerrit [13:45:35] to avoid having ops getting angry :-] [13:45:49] <^demon> In silent mode? [13:45:51] <^demon> Whoops [13:47:18] csteipp: your idea to use the MediaWiki database load balancer is smart. I just have 0 knowledge about how to set it up properly. I guess Asher could help [13:47:23] csteipp: or Aaron :-] [13:47:31] <^demon> hashar: I also just gave JenkinsBot Verified permissions on All-Projects. [13:47:59] Yeah, I think I'll be able to do it... but I'll see how far I get :) [13:50:21] <^demon> hashar: I disabled quiet mode, but it's still not reporting :\ [13:51:36] let me look up in /var/log/jenkins/jenkins.log [13:52:53] #gerrit approve 23603,2 --message 'Build Started https://integration.mediawiki.org/ci/job/operations-puppet/97/ ' --verified 0 --code-review 0 [13:52:53] hehe [13:52:53] so the startup is commented out [13:56:35] ^demon: anyway, the build #97 has been triggered on change 23603 which is merged. Jenkins tried the command gerrit approve 23603,2 --message 'Build Successful….' --verified 1 --code-review 0 which is not going to work since the change is closed [13:56:35] <^demon> Ah, gonna have to try it on a fresh change :) [13:56:35] and make sure are aware about it [13:56:35] apparently Ryan did not see a point in using Jenkins and preferred the Gerrit hooks [13:56:35] !ping [13:56:35] ^demon: you can play with https://gerrit.wikimedia.org/r/#/c/28216/ if you want. I am most probably going to abandon that one. [13:56:35] wm-bot: hey [13:56:35] o.o [13:56:38] Hi petan, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [13:56:39] pong [13:56:43] lol [13:56:49] that was a lag [13:56:52] PROBLEM Total processes is now: WARNING on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS WARNING: 156 processes [13:57:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:59:48] <^demon> hashar: Whoo, it worked :) [13:59:49] <^demon> https://gerrit.wikimedia.org/r/#/c/28216/ [14:00:30] great [14:01:02] it could use the rake script now :-) [14:01:54] annhhh [14:02:00] na it needs the ruby gems [14:05:22] so we have puppet client 2.7.6 in production which does not have the puppet/util/colors class :/ [14:06:01] will self fix when we upgrade Gallium to Precise I guess [14:10:50] <^demon> hashar: I disabled the jenkins job again until ops merges the change. Don't want to spam people with double notifications on pass/fail :) [14:20:23] ^demon: and the new jenkins is at http://integration.wmflabs.org/ci/ :-D [14:20:32] still need to write you down an email about it [14:21:00] will hopefully be a safer system for us [14:21:01] http://integration.wmflabs.org/ci/job/mediawiki-core-lint/45/ [14:21:14] though tests are failing cause there is only 1GB of memory on the box *grin* [14:21:58] <^demon> So integration's going to run out of labs now instead of gallium? [14:22:25] nop [14:22:31] I have setup Zuul on labs [14:22:32] * chrismcmahon follows along... [14:22:36] to get the puppet class setup [14:22:38] and update the jenkins job [14:22:52] ended up rewriting the jenkins job from scratch (not that long anyway) [14:23:04] <^demon> Ah ok, then we'll move it to prod? [14:23:04] Zuul is now almost working fine :-] [14:23:13] so to move to prod I have to get gallium upgraded [14:23:26] spend the day finding out all the steps that need to be done [14:23:33] since some stuff is not puppetized [14:23:46] we will have to shutdown jenkins for a few hours while the box is being rebuild [14:24:46] hashar: and then we can deploy extensions and config to beta labs from Jenkins? [14:26:49] I guess [14:27:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:27:04] will need to write the job and get jenkins to communicate with a beta instance [14:27:29] chrismcmahon: we should probably write a small doc as to what we expect that job to do [14:27:49] hashar: that's about the most important thing I want to do right now [14:28:36] make deploying extensions to beta labs builds in Jenkins so a) we can see where they fail and b) hang some browser tests off those builds [14:29:27] well all extensions are already git pulled on beta [14:29:38] though most are not configured [14:30:05] hashar: but that has been failing since 7 Oct. if it were in Jenkins would could see it red and see the actual commands [14:30:55] yeah the script does not work that well [14:31:47] though I am pretty sure I fixed it last friday [14:32:05] not working :-(((( [14:32:12] will fix that tonight [14:32:41] hashar: I went looking for the script yesterday on deployment, couldn't find it. can you point me to where it runs? [14:32:54] it is on deployment-jobrunner06 [14:33:04] /usr/local/bin/wmf-beta-autoudpate [14:33:13] thanks! [14:33:17] that is started as a service (configured in upstart) [14:33:21] ok [14:33:21] and deployed by puppet [14:33:50] it must be a permission error [14:34:05] anyway, I am off to go to bank + accountant + daughter + dinner [14:34:23] will be back around after dinner (roughly noon PST) [14:34:47] don't waste too much time with the beta auto updater, it is a bad quality software and poorly integrated :-( [14:35:29] off [14:41:52] PROBLEM Total processes is now: CRITICAL on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS CRITICAL: 202 processes [14:46:52] RECOVERY Total processes is now: OK on deployment-video05 i-000004cc.pmtpa.wmflabs output: PROCS OK: 117 processes [14:57:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:59:12] PROBLEM Current Load is now: WARNING on deployment-video05 i-000004cc.pmtpa.wmflabs output: WARNING - load average: 10.69, 10.54, 7.26 [15:04:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 190 processes [15:04:52] PROBLEM Total processes is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS WARNING: 154 processes [15:08:53] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 17.72, 15.95, 9.22 [15:18:53] PROBLEM Current Load is now: WARNING on ve-roundtrip i-00000404.pmtpa.wmflabs output: WARNING - load average: 10.14, 9.95, 7.05 [15:19:53] RECOVERY Total processes is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: PROCS OK: 150 processes [15:27:03] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [15:29:33] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 97 processes [15:40:34] Damianz or petan are you available? [15:40:44] for 10 mins [15:40:47] great [15:40:59] I've been otld to talk to you guys, with respect to being added to the Bots project on Labs [15:41:15] sure [15:42:08] !log bots +huji [15:42:09] Logged the message, Master [15:42:10] done [15:42:20] !project bots [15:42:20] There are multiple keys, refine your input: project-access, project-discuss, projects, [15:42:32] hmm [15:42:47] @search Projec [15:42:49] Results (Found 4): instancelist, instance-json, info, manage-projects, [15:43:02] !info test [15:43:02] https://www.mediawiki.org/wiki/WMF_Projects/Wikimedia_Labs [15:43:02] hm.. [15:43:12] Huji you are member of that project [15:43:16] * Huji feels like he opened a can of worms! [15:43:23] that's weird [15:43:23] I just can't remember that key to link you to docs [15:43:36] because .. see: https://www.mediawiki.org/wiki/WMF_Projects/User_talk:Huji [15:43:39] !resource bots [15:43:43] https://labsconsole.wikimedia.org/wiki/Nova_Resource:bots [15:43:48]