[00:27:22] 10Tool-Labs: Clean up huge logs on toollabs - https://phabricator.wikimedia.org/T98652#1316282 (10scfc) With regard to `fiwiki-tools`, I have asked the maintainer [[https://fi.wikipedia.org/wiki/Keskustelu_k%C3%A4ytt%C3%A4j%C3%A4st%C3%A4:Zache#fiwiki-tools_rapidly_fills_error.log|here]]. [03:40:20] 10Tool-Labs, 10pywikibot-core: Support Debian package python-ipaddr - https://phabricator.wikimedia.org/T100603#1316518 (10jayvdb) 3NEW a:3jayvdb [03:41:55] 10Tool-Labs, 10pywikibot-core: Support Debian package python-ipaddr - https://phabricator.wikimedia.org/T100603#1316526 (10jayvdb) [03:44:41] 10Tool-Labs, 10pywikibot-core, 5Patch-For-Review: Support Debian package python-ipaddr - https://phabricator.wikimedia.org/T100603#1316531 (10jayvdb) Travis whitelist request posted: https://github.com/travis-ci/travis-ci/issues/3973 [03:47:17] (03CR) 10John Vandenberg: "The ipaddr package passes all our IP tests, so I have proposed adding support for it as T100603" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209978 (https://phabricator.wikimedia.org/T86015) (owner: 10Merlijn van Deen) [03:58:47] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316552 (10JAufrecht) That seems to have fixed it. From a different computer: ``` joel@sanpolo:~$ ssh phab-01 The authenticity of host 'phab-01 ()' can't be established. ECDSA key fingerprint is 7d:9... [03:59:07] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316553 (10JAufrecht) 5Open>3Resolved [04:01:59] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316564 (10Negative24) An alternative (read: the way I do it) is I set ``` Host *.wmflabs.org User negative24 ``` [06:37:22] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ivorah was created, changed by Ivorah link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Ivorah edit summary: Created page with "{{Tools Access Request |Justification=I want to populate a graph database with wikipedia information for research purposes. |Completed=false |User Name=Ivorah }}" [06:49:41] PROBLEM - Puppet staleness on tools-master is CRITICAL 20.00% of data above the critical threshold [43200.0] [08:03:16] PROBLEM - Puppet staleness on tools-shadow is CRITICAL 55.56% of data above the critical threshold [43200.0] [08:46:47] !log deployment-prep test es-tool restart-fast on deployment-elastic05 [08:46:51] Logged the message, Master [09:40:16] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317056 (10yuvipanda) Yeah, I agree :) We could perhaps try switching it off after the move to designate. [09:44:04] 6Labs, 6operations: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1317060 (10ArielGlenn) this is now live on virt1000. https://gerrit.wikimedia.org/r/#/c/214314/ [11:07:37] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317231 (10fgiunchedi) I'd say mostly to avoid roundtrips to ldap? ``` $ sudo nscd -g | grep -e 'cache:$' -e 'rate' -e 'number of cached' passwd cache: 1% cache hit rate 38 current... [11:18:37] Can somebody please tell me what in the world is wrong with jsub [11:19:14] YuviPanda|brb, ^ [11:20:27] :/ [11:20:34] No one's ever here. [11:22:46] JohnFLewis, do you know what is going on with Labs [11:22:51] specifically jsub [11:23:05] petan, ^ [11:23:15] hello [11:23:18] maybe [11:23:30] there was some announcement on wikitech that master server is fucked up [11:23:34] but I think it was fixed [11:23:38] tools.cyberbot@tools-bastion-01:~$ qstat [11:23:38] error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [11:23:38] error: unable to contact qmaster using port 6444 on host "tools-master" [11:23:41] Just now [11:23:48] sounds like that [11:23:56] I will check but my powers are very limited [11:24:03] So it's not fixed. [11:24:26] Great my bot is crying and I can't fix it. :-( [11:24:49] ty\ [11:25:15] !log tools rebooted tools-master in order to try fix that network issues [11:25:23] Logged the message, Master [11:25:53] tools.cyberbot@tools-bastion-01:~$ qstat [11:25:53] error: commlib error: got select error (Connection refused) [11:25:53] error: unable to send message to qmaster using port 6444 on host "tools-master": got send error [11:25:58] 2 packets transmitted, 2 received, 0% packet loss, time 1001ms [11:26:16] well, network indeed has some problems, pinging between local tools nodes takes almost 1 sec [11:27:28] I can't even connect now. [11:27:32] connect where [11:27:44] to where ever jsub is. [11:27:50] I need to restart a task. [11:28:10] My bot fell and got a booboo [11:28:20] !log tools syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] error writing to client: Broken pipe [11:28:24] Logged the message, Master [11:30:06] petan, is there a way around this. Can I ssh into the cyberbot queue and manually restart? [11:30:19] not really [11:30:27] queue is not on node but on master [11:30:44] Why not. I think I've done it before. [11:31:13] What? I thought I had a specially made cyberbot-exec node just for me. [11:32:16] yes but jobs are controlled from master server [11:32:24] which seems to be broken [11:32:34] it's possible to stop tasks running on your node but not restart them [11:32:46] :/' [11:34:58] it might be also broken LDAP [11:35:09] because syslog on all nodes is kind of full of errors [12:05:37] is there a local mirror of the gerrit repos somewhere? it seems a waste to clone them all [12:14:20] !log tools petrb: test [12:14:23] Logged the message, Master [12:15:03] !log tools petrb: shutting nscd off on tools-master [12:15:06] Logged the message, Master [12:22:33] !log tools petrb: inserted some local IP's to hosts file [12:22:37] Logged the message, Master [13:39:00] 6Labs: Upgrade labs controller to Trusty - https://phabricator.wikimedia.org/T90824#1317531 (10Andrew) [13:39:46] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317536 (10yuvipanda) @bblack figured that the problems are perhaps caused by too big /etc/hosts files, and reverting them seems to have fixed the issues. [14:04:30] 10Tool-Labs-tools-Other: Improve interface to KMLexport - https://phabricator.wikimedia.org/T94575#1317609 (10Aklapper) [14:04:42] RECOVERY - Puppet staleness on tools-master is OK Less than 1.00% above the threshold [3600.0] [14:22:32] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [14:33:16] RECOVERY - Puppet staleness on tools-shadow is OK Less than 1.00% above the threshold [3600.0] [14:43:22] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 11.11% of data above the critical threshold [0.0] [14:43:38] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:43:52] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:44:09] 10Tool-Labs: Test if grid engine master non-failure depends on the lengths of /etc/hosts lines - https://phabricator.wikimedia.org/T100660#1317680 (10scfc) 3NEW [14:44:22] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:44:42] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:45:34] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 50.00% of data above the critical threshold [0.0] [14:45:44] 10MediaWiki-extensions-OpenStackManager: OpenStackManager special pages should link to in-wiki documentation - https://phabricator.wikimedia.org/T36500#1317690 (10Amire80) [14:45:56] 10MediaWiki-extensions-OpenStackManager: OpenStackManager special pages should link to in-wiki documentation - https://phabricator.wikimedia.org/T36500#388005 (10Amire80) [14:46:22] PROBLEM - Puppet failure on tools-redis is CRITICAL 40.00% of data above the critical threshold [0.0] [14:46:38] PROBLEM - Puppet failure on tools-redis-slave is CRITICAL 30.00% of data above the critical threshold [0.0] [14:46:39] 10Tool-Labs, 5Patch-For-Review: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317695 (10yuvipanda) I've puppetized the fix (remove /etc/hosts generation for labsdb from tools-master and -shadow), and everything seems to be ok. Filing a load of followup bugs atm. Puppet is enabled and s... [14:46:40] * andrewbogott is fixing those puppet failures, maybe [14:46:44] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:47:10] 10Tool-Labs: Test if grid engine master non-failure depends on the lengths of /etc/hosts lines - https://phabricator.wikimedia.org/T100660#1317696 (10yuvipanda) I could consistently reproduce the failures by adding the lines back. Question is wether it was line length or total size length that's the problem. [14:47:18] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 33.33% of data above the critical threshold [0.0] [14:47:56] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:48:04] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1410 is CRITICAL 22.22% of data above the critical threshold [0.0] [14:48:19] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:48:27] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:49:07] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 66.67% of data above the critical threshold [0.0] [14:49:07] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 66.67% of data above the critical threshold [0.0] [14:49:27] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 33.33% of data above the critical threshold [0.0] [14:49:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 44.44% of data above the critical threshold [0.0] [14:49:51] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:49:51] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:50:01] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 30.00% of data above the critical threshold [0.0] [14:50:07] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 22.22% of data above the critical threshold [0.0] [14:50:31] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:51:09] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 55.56% of data above the critical threshold [0.0] [14:51:31] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 50.00% of data above the critical threshold [0.0] [14:51:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 30.00% of data above the critical threshold [0.0] [14:51:55] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 50.00% of data above the critical threshold [0.0] [14:52:17] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 66.67% of data above the critical threshold [0.0] [14:52:31] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:52:51] PROBLEM - Puppet failure on tools-mail is CRITICAL 40.00% of data above the critical threshold [0.0] [14:53:12] 10Tool-Labs: Figure out why exec_environ was included in gridengine master / shadow - https://phabricator.wikimedia.org/T100662#1317726 (10yuvipanda) 3NEW [14:53:32] PROBLEM - Puppet failure on tools-master is CRITICAL 60.00% of data above the critical threshold [0.0] [14:53:34] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 50.00% of data above the critical threshold [0.0] [14:53:34] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:53:44] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 50.00% of data above the critical threshold [0.0] [14:53:44] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:54:26] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 30.00% of data above the critical threshold [0.0] [14:54:32] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 22.22% of data above the critical threshold [0.0] [14:54:38] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 55.56% of data above the critical threshold [0.0] [14:54:44] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:55:18] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 22.22% of data above the critical threshold [0.0] [14:55:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:55:46] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:55:50] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:56:00] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:56:56] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:57:00] ok, here come the recoveries [14:57:11] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 20.00% of data above the critical threshold [0.0] [14:57:19] maybe [14:57:45] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 30.00% of data above the critical threshold [0.0] [14:57:58] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [14:59:04] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 60.00% of data above the critical threshold [0.0] [14:59:18] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 44.44% of data above the critical threshold [0.0] [15:00:43] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ivorah was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=160913 edit summary: [15:00:58] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 50.00% of data above the critical threshold [0.0] [15:01:17] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317738 (10yuvipanda) It's back on on tools-master now, and doesn't really seem to be causing any issues atm. [15:01:48] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317743 (10yuvipanda) @fgiunchedi but we also have nslcd which does that... [15:02:13] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317754 (10yuvipanda) [15:02:16] 10Tool-Labs, 5Patch-For-Review: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317753 (10yuvipanda) [15:08:03] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1317787 (10yuvipanda) [15:08:06] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1317783 (10yuvipanda) 5Resolved>3Open Failover failed during outage caused by T100554. tools-shadow just didn't start a master process at all, even with explicit start attempts (... [15:08:12] 10Tool-Labs, 5Patch-For-Review: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315443 (10yuvipanda) [15:08:13] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1317789 (10yuvipanda) [15:09:29] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [15:11:00] 10Quarry: SQL String functions not working - https://phabricator.wikimedia.org/T100057#1317797 (10Aklapper) [15:11:56] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1317802 (10yuvipanda) [15:11:57] 10Tool-Labs, 5Patch-For-Review: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317801 (10yuvipanda) [15:13:10] 10Tool-Labs, 5Patch-For-Review: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317803 (10yuvipanda) p:5Unbreak!>3Normal (resetting priority now that gridengine masters are not down) [15:13:19] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [15:13:39] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [15:13:49] RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0] [15:14:07] RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0] [15:14:07] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [15:14:43] RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0] [15:14:51] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [15:15:00] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1317810 (10yuvipanda) [15:16:08] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [15:16:24] RECOVERY - Puppet failure on tools-redis is OK Less than 1.00% above the threshold [0.0] [15:16:38] RECOVERY - Puppet failure on tools-redis-slave is OK Less than 1.00% above the threshold [0.0] [15:16:42] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [15:17:14] RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0] [15:17:18] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [15:17:54] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [15:18:03] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1410 is OK Less than 1.00% above the threshold [0.0] [15:18:19] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [15:18:29] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [15:18:35] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [15:19:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [15:19:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [15:19:43] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [15:19:49] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [15:20:01] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [15:20:09] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [15:20:31] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [15:20:31] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [15:20:47] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [15:21:33] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [15:21:47] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [15:21:53] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [15:22:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [15:22:51] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [15:22:58] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [15:23:34] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [15:23:34] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [15:23:46] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [15:23:46] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [15:24:02] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [15:24:40] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [15:24:40] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [15:24:40] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [15:25:16] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [15:25:30] * YuviPanda writes fun incident report now [15:25:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [15:25:48] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [15:25:58] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [15:26:00] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [15:26:51] 10Tool-Labs: Test if grid engine master non-failure depends on the lengths of /etc/hosts lines - https://phabricator.wikimedia.org/T100660#1317831 (10scfc) IIRC the aliases were in that instance's `/etc/hosts` previously, so the overall size should only have decreased, //but// (cf. `source/libs/uti/sge_hostname.... [15:26:54] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [15:27:12] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [15:27:46] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [15:29:00] 10Tool-Labs: Test if grid engine master non-failure depends on the lengths of /etc/hosts lines - https://phabricator.wikimedia.org/T100660#1317832 (10yuvipanda) My current fix is to get rid of the labsdb aliases on the master - they aren't being used from there at all. I agree that fixing puppet's host might be... [15:29:14] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [15:34:24] (03CR) 10Yuvipanda: [C: 04-2] "Me and Merlijn and Coren had a discussion around this at the hackathon, and I think in the long term we should just have people use virtua" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209978 (https://phabricator.wikimedia.org/T86015) (owner: 10Merlijn van Deen) [15:34:59] 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1317849 (10Andrew) 3NEW a:3RobH [15:35:12] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1317857 (10Andrew) [15:37:10] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1317865 (10Andrew) p:5Triage>3Normal [15:48:29] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317888 (10Andrew) p:5Triage>3Normal [15:53:45] 10Tool-Labs: Test if grid engine master non-failure depends on the lengths of /etc/hosts lines - https://phabricator.wikimedia.org/T100660#1317895 (10scfc) It could be that if the host name is already cached by `nscd` (for example by debugging on the command line), `gethostbyname_r()` doesn't even try to parse `... [15:59:59] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317913 (10mark) So PowerDNS supports multiple instances I think. Wouldn't this be easy to do on a separate IP? From a resources perspective, it doesn't need another full server at all... [16:07:46] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317947 (10fgiunchedi) nslcd AFAIK doesn't cache responses, so querying nss will still result in a ldap lookup (through nslcd) if nscd isn't running [16:08:23] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317948 (10Andrew) that's probably fine, I will try. [16:21:20] 10Tool-Labs, 5Patch-For-Review: Reduce amount of Tools-local packages - https://phabricator.wikimedia.org/T91874#1317977 (10scfc) [16:26:30] what credentials should I use to connect to phab mysql on phab-01? [16:34:42] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1318026 (10RobH) a:5RobH>3Andrew [16:35:16] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1285990 (10RobH) Since this is now goign to be a test with a second IP, I'm pulling the #hardware-request project for now. If this goes back to needing new metal, just re-append it back on. [16:35:26] 6Labs, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1318029 (10RobH) [16:36:46] also, are there any python packages on phab-01 (and hopefully production) for connecting to mysql? In python3, I've tried mysql, mysql.connector, mysqldb, MySQLdb, and peewee. [16:39:40] hi guys. i'm doing some inner joins with tool labs mediawiki tables and i noticed sth... [16:39:48] if i use describe in the table pagelinks [16:39:56] no primary keys appear. [16:40:10] while in the schema pl_from and pl_namespace, pl_title seem to be primary keys [16:40:18] does anybody know why is that? [16:46:27] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1318052 (10yuvipanda) [16:46:29] 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1318051 (10yuvipanda) [16:46:51] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1185706 (10yuvipanda) [16:46:53] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1318060 (10yuvipanda) [17:08:57] 6Labs: stray files created in /etc/ssh/userkeys - https://phabricator.wikimedia.org/T85814#1318138 (10faidon) a:5faidon>3None [17:09:34] 6Labs: stray files created in /etc/ssh/userkeys - https://phabricator.wikimedia.org/T85814#955415 (10faidon) [17:09:36] 6Labs: /etc/ssh/userkeys/ubuntu or /etc/ssh/userkeys/admin notices for every puppet run on labs instances - https://phabricator.wikimedia.org/T94866#1318146 (10faidon) [17:31:47] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1318204 (10JAufrecht) Was looking to update documentation for this, but am not sure where that would have been. There is https://wikitech.wikimedia.org/wiki/Help:Tool_Labs, which is very similar but not intended for phab-01 or b... [17:37:16] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1318225 (10Dzahn) https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 https://wikitech.wikimedia.org/wiki/Help:Getting_Started#Using_ProxyCommand https://wikitech.w... [17:42:12] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 22.22% of data above the critical threshold [0.0] [17:42:38] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:44:00] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:44:02] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318238 (10Dzahn) The nameservers for wmflabs.org are set to: whois wmflabs.org ... Name Server:LABS-NS0.WIKIMEDIA.ORG Name Server:LABS-NS1.WIKIMEDIA.ORG The IP addresses of these are (fro... [17:44:22] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:44:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:44:50] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:45:06] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 22.22% of data above the critical threshold [0.0] [17:45:21] 10:44 < icinga-wm> PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL - Plugin timed out while executing system call [17:45:22] PROBLEM - Puppet failure on tools-static-02 is CRITICAL 55.56% of data above the critical threshold [0.0] [17:45:24] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:45:30] hey [17:45:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:45:42] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:46:28] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:46:32] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [17:47:10] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 22.22% of data above the critical threshold [0.0] [17:47:22] PROBLEM - Puppet failure on tools-redis is CRITICAL 60.00% of data above the critical threshold [0.0] [17:47:35] PROBLEM - Puppet failure on tools-redis-slave is CRITICAL 40.00% of data above the critical threshold [0.0] [17:47:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:47:45] andrewbogott: YuviPanda: are you online? [17:47:53] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:48:17] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 55.56% of data above the critical threshold [0.0] [17:48:17] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 33.33% of data above the critical threshold [0.0] [17:48:59] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:49:03] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1410 is CRITICAL 44.44% of data above the critical threshold [0.0] [17:49:21] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:49:25] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:49:33] PROBLEM - Puppet failure on tools-master is CRITICAL 30.00% of data above the critical threshold [0.0] [17:49:35] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:50:07] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 66.67% of data above the critical threshold [0.0] [17:50:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 44.44% of data above the critical threshold [0.0] [17:50:35] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:50:43] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:50:53] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 60.00% of data above the critical threshold [0.0] [17:50:54] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 60.00% of data above the critical threshold [0.0] [17:51:01] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:51:11] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 44.44% of data above the critical threshold [0.0] [17:51:15] mutante: sge outage means no login to tools ? [17:51:44] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:52:33] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 50.00% of data above the critical threshold [0.0] [17:52:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 30.00% of data above the critical threshold [0.0] [17:52:49] matanya: i don't know. i just noticed alert about the labs DNS server [17:52:55] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:53:02] matanya: so when i lookup labs-ns0 from my laptop it is: [17:53:06] labs-ns0.wikimedia.org has address 208.80.154.19 [17:53:14] but when i do it from neon where icinga runs: [17:53:19] labs-ns0.wikimedia.org has address 208.80.152.33 [17:53:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 40.00% of data above the critical threshold [0.0] [17:53:39] and that made icinga go critical [17:53:43] mutante: i can't login at all [17:53:51] PROBLEM - Puppet failure on tools-mail is CRITICAL 50.00% of data above the critical threshold [0.0] [17:53:57] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 20.00% of data above the critical threshold [0.0] [17:53:58] $ ssh encoding01.eqiad.wmflabs [17:53:58] ssh: Could not resolve hostname bastion.wmflabs.org: Name or service not known [17:53:59] ssh_exchange_identification: Connection closed by remote host [17:54:33] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [17:54:35] matanya: i just saw this https://phabricator.wikimedia.org/T100665 [17:54:43] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [17:54:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 50.00% of data above the critical threshold [0.0] [17:54:54] so something is being changed about wmflabs.org DNS [17:54:59] but see my reply [17:55:03] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:55:27] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 44.44% of data above the critical threshold [0.0] [17:55:31] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 33.33% of data above the critical threshold [0.0] [17:56:15] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 33.33% of data above the critical threshold [0.0] [17:56:35] mutante: tbc in -operations [17:56:37] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 60.00% of data above the critical threshold [0.0] [17:56:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 50.00% of data above the critical threshold [0.0] [17:56:57] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 20.00% of data above the critical threshold [0.0] [17:57:01] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:58:12] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 30.00% of data above the critical threshold [0.0] [17:58:18] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 11.11% of data above the critical threshold [0.0] [17:58:48] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 40.00% of data above the critical threshold [0.0] [18:00:12] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 44.44% of data above the critical threshold [0.0] [18:00:20] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 22.22% of data above the critical threshold [0.0] [18:00:22] PROBLEM - Puppet failure on tools-shadow is CRITICAL 40.00% of data above the critical threshold [0.0] [18:01:06] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL 22.22% of data above the critical threshold [0.0] [18:01:10] PROBLEM - Puppet failure on tools-submit is CRITICAL 22.22% of data above the critical threshold [0.0] [18:01:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 60.00% of data above the critical threshold [0.0] [18:01:36] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 40.00% of data above the critical threshold [0.0] [18:01:44] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 20.00% of data above the critical threshold [0.0] [18:02:00] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 40.00% of data above the critical threshold [0.0] [18:02:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 33.33% of data above the critical threshold [0.0] [18:02:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 30.00% of data above the critical threshold [0.0] [18:03:54] andrewbogott_afk / YuviPanda ^ Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item labs_puppet_master in any Hiera data file and no default supplied at /etc/puppet/modules/base/manifests/init.pp:50 on node i-000009aa.eqiad.wmflabs [18:04:02] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 60.00% of data above the critical threshold [0.0] [18:04:39] "ssh: connect to host bastion.wmflabs.org port 22: Connection refused" [18:05:35] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 60.00% of data above the critical threshold [0.0] [18:08:00] what's up with labs DNS? [18:08:45] andrewbogott_afk: ? [18:08:58] I don't see any response at all on the dns port from labs-ns[01] [18:10:58] bblack: see this: [18:11:04] from neon / icinga: [18:11:09] labs-ns0.wikimedia.org has address 208.80.152.33 [18:11:15] from my laptop: [18:11:19] labs-ns0.wikimedia.org has address 208.80.154.19 [18:11:23] how is that? [18:11:35] yeah we know how labs-ns0 is messed up from the public DNS POV [18:11:40] and then this ticket: https://phabricator.wikimedia.org/T100665 [18:11:50] but regardless, the "correct" addresses for labs-ns[01] are not respoding to DNS requests, no servers running [18:12:05] and even the broken public side should reach one of those addresses [18:12:41] i see this recent change: [18:12:49] https://gerrit.wikimedia.org/r/#/c/213543/12/manifests/role/dns.pp [18:13:18] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318330 (10BBlack) The Nameserver IPs that the registrar stores are separate from the ones you're looking up in e.g. dig or whatever. They're part of the whois system. In whois, they have t... [18:13:36] seems that is the only one by andrew from today [18:13:40] that was merged [18:15:11] I'm trying a verbose puppet run on virt1000 (ns1) to see what it is or isn't currently trying to run there [18:15:16] it should have a DNS server running :/ [18:15:45] I assume the DNS issues are why metrics.wmflabs.org is down? [18:15:53] yes [18:16:02] thanks. [18:16:12] bblack: but that was from the output of "whois" [18:16:18] Name Server:LABS-NS0.WIKIMEDIA.ORG [18:16:25] re: ticket response [18:19:00] I've manually fixed it for now, by starting the pdns daemon on virt1000 [18:19:13] it seems like puppet should have been starting that, but I'm not wading into that since I have no idea what's going on [18:19:35] it may take time for negative cache entries to expire or whatever, and there are delays from the bad IP at the registrars [18:19:41] but with sufficient timeouts it should work again now [18:19:54] bblack-mba:~ bblack$ host tools-login.wmflabs.org [18:19:54] tools-login.wmflabs.org has address 208.80.155.130 [18:23:16] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318348 (10Dzahn) but that was output from the "whois" command: Name Server:LABS-NS0.WIKIMEDIA.ORG Name Server:LABS-NS1.WIKIMEDIA.ORG [18:25:46] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318367 (10BBlack) Whois (well really the registrar data that's sent to the TLD origin servers, but that's what whois reflects) also stores IP addresses for those hosts, independently of howe... [18:27:44] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318383 (10BBlack) Heh - the last comment is incorrect! I re-ran the commands and pasted quickly without looking. The old bad data that was there earlier today was this: ``` bblack-mba:~ b... [18:29:09] 6Labs, 6operations: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318405 (10Andrew) 5Open>3Resolved Yes, registrar emailed directly an hour ago and said it was done. [18:34:41] YuviPanda, so how is it hard to get access to instance proxy logs to analyze a tool's usage stats? [19:01:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [19:01:39] MaxSem: I'm not sure if we store any, actually. If we store any, I guess we should follow the same process as in prod? (not sure what that process is, though) [19:02:12] we have some, they're not stored not for long though [19:02:52] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [19:02:56] yeah, there's some logs, rotated daily, one backup [19:03:10] (access.log, .log.1, error.log, .log.1) [19:03:51] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [19:03:57] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [19:04:45] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [19:05:03] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [19:05:25] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [19:05:33] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [19:06:13] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [19:06:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [19:06:57] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [19:07:03] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [19:08:11] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [19:08:17] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [19:08:45] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [19:09:01] RECOVERY - Puppet failure on tools-services-01 is OK Less than 1.00% above the threshold [0.0] [19:13:18] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [19:13:19] RECOVERY - Puppet failure on tools-static-02 is OK Less than 1.00% above the threshold [0.0] [19:13:19] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [19:13:19] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [19:13:20] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [19:13:20] RECOVERY - Puppet failure on tools-exec-1217 is OK Less than 1.00% above the threshold [0.0] [19:13:20] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [19:13:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [19:13:21] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [19:13:21] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [19:13:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [19:13:23] for the record: I’m going to break puppet again, but not until after I have some lunch. [19:13:23] sorry about all the noise [19:13:26] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [19:13:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [19:13:26] RECOVERY - Puppet failure on tools-redis is OK Less than 1.00% above the threshold [0.0] [19:13:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [19:13:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [19:13:53] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:13:57] RECOVERY - Puppet failure on tools-exec-1219 is OK Less than 1.00% above the threshold [0.0] [19:14:22] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [19:14:40] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [19:14:48] RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0] [19:15:21] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [19:15:31] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [19:15:39] RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0] [19:15:57] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:16:27] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:16:29] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [19:16:29] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:16:31] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [19:16:55] andrewbogott: np. Maybe +q shinken-wm? [19:17:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [19:17:15] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:17:32] valhallasw: How do I do that? [19:17:36] RECOVERY - Puppet failure on tools-redis-slave is OK Less than 1.00% above the threshold [0.0] [19:17:41] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [19:17:45] andrewbogott: /msg chanserv op #wikimedia-labs [19:17:48] then /mode +q shinken-wm [19:17:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:17:50] I think [19:17:53] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [19:17:59] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 20.00% of data above the critical threshold [0.0] [19:18:01] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:18:17] RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0] [19:18:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [19:18:37] hm, seems not [19:18:55] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [19:19:03] hrm. maybe +b it then? as long as it isn't kicked, that should work [19:19:05] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1410 is OK Less than 1.00% above the threshold [0.0] [19:19:13] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:19:17] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:19:19] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [19:19:25] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [19:19:33] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [19:19:34] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [19:19:45] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:20:07] RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0] [19:20:08] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [19:20:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [19:20:35] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [19:20:43] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [19:20:51] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [19:20:54] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [19:21:00] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [19:21:05] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 100.00% of data above the critical threshold [0.0] [19:21:07] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [19:21:13] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:21:21] PROBLEM - Puppet failure on tools-shadow is CRITICAL 60.00% of data above the critical threshold [0.0] [19:21:21] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 22.22% of data above the critical threshold [0.0] [19:21:37] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 20.00% of data above the critical threshold [0.0] [19:21:47] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [19:22:05] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:22:11] PROBLEM - Puppet failure on tools-submit is CRITICAL 33.33% of data above the critical threshold [0.0] [19:22:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:22:31] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [19:22:37] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:22:45] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [19:22:45] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:22:59] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:23:13] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:23:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:23:23] PROBLEM - Puppet failure on tools-redis is CRITICAL 20.00% of data above the critical threshold [0.0] [19:23:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [19:23:36] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:23:37] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:24:32] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [19:24:46] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [19:24:58] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:25:02] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:25:24] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:25:38] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:25:50] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:25:56] no dice [19:26:01] some of those are real, not my breakages. [19:26:08] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:26:08] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:26:11] Duplicate declaration: Class[Diamond::Collector_module] is already declared in file /etc/puppet/modules/diamond/manifests/collector.pp:76; cannot redeclare at /etc/puppet/modules/diamond/manifests/collector.pp:76 [19:26:22] PROBLEM - Puppet failure on tools-static-02 is CRITICAL 44.44% of data above the critical threshold [0.0] [19:26:24] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:26:42] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:26:52] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:27:26] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:27:34] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:28:08] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:28:36] PROBLEM - Puppet failure on tools-redis-slave is CRITICAL 50.00% of data above the critical threshold [0.0] [19:28:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:28:53] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:29:19] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 44.44% of data above the critical threshold [0.0] [19:29:19] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:29:57] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:30:05] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1410 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:30:19] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:30:25] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:30:31] PROBLEM - Puppet failure on tools-master is CRITICAL 40.00% of data above the critical threshold [0.0] [19:30:35] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:31:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:31:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 66.67% of data above the critical threshold [0.0] [19:31:37] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:31:45] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:31:49] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:32:03] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:32:09] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:32:39] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:32:45] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:33:31] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 55.56% of data above the critical threshold [0.0] [19:33:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 40.00% of data above the critical threshold [0.0] [19:34:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 44.44% of data above the critical threshold [0.0] [19:34:52] PROBLEM - Puppet failure on tools-mail is CRITICAL 50.00% of data above the critical threshold [0.0] [19:35:45] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:35:45] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:36:33] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 70.00% of data above the critical threshold [0.0] [19:59:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [19:59:53] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [20:00:35] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [20:00:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [20:01:31] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [20:01:31] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [20:01:35] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [20:01:43] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [20:02:39] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [20:02:45] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [20:02:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [20:03:16] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [20:03:44] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [20:03:54] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [20:05:44] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [20:05:56] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [20:06:02] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [20:06:12] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [20:06:32] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [20:07:16] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [20:07:58] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [20:08:00] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [20:09:10] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [20:09:18] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [20:09:45] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [20:10:01] RECOVERY - Puppet failure on tools-services-01 is OK Less than 1.00% above the threshold [0.0] [20:11:19] RECOVERY - Puppet failure on tools-static-02 is OK Less than 1.00% above the threshold [0.0] [20:11:19] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [20:11:19] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [20:11:37] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [20:12:05] RECOVERY - Puppet failure on tools-exec-1217 is OK Less than 1.00% above the threshold [0.0] [20:12:09] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [20:12:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [20:12:33] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [20:12:37] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [20:12:43] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [20:13:01] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [20:13:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [20:13:25] RECOVERY - Puppet failure on tools-redis is OK Less than 1.00% above the threshold [0.0] [20:13:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [20:13:34] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [20:14:57] RECOVERY - Puppet failure on tools-exec-1219 is OK Less than 1.00% above the threshold [0.0] [20:15:18] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [20:15:22] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [20:15:49] RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0] [20:16:07] RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0] [20:16:07] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [20:16:22] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [20:16:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [20:16:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [20:16:42] RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0] [20:16:50] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [20:16:54] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [20:17:07] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [20:17:31] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [20:18:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [20:18:33] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [20:18:37] RECOVERY - Puppet failure on tools-redis-slave is OK Less than 1.00% above the threshold [0.0] [20:18:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [20:18:52] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [20:19:17] RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0] [20:19:18] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [20:19:57] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [20:20:04] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1410 is OK Less than 1.00% above the threshold [0.0] [20:20:28] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [20:20:34] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [20:20:40] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [20:20:42] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [20:22:02] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [20:52:46] ok, now I’m (probably) going to break puppet everywhere again. [21:52:17] running eval.php on jobrunner01 gives "bus error" http://pastebin.com/fzrjQ6LW [21:52:34] * AaronSchulz wonders if hhvm changed lately [22:17:19] (03PS1) 10Sitic: Fixed loading of config, added species logo [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/214498 (https://phabricator.wikimedia.org/T100702) [22:17:48] (03CR) 10Sitic: [C: 032 V: 032] Fixed loading of config, added species logo [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/214498 (https://phabricator.wikimedia.org/T100702) (owner: 10Sitic) [22:38:45] 6Labs: Don't hardcode virt1000 as the labs puppetmaster - https://phabricator.wikimedia.org/T100317#1319091 (10Andrew) 5Open>3Resolved ok, now we have service names labs-puppetmaster-eqiad and labs-puppetmaster-codfw. Instances are switching over to the new puppetmaster name; salt seems to be still working... [22:39:29] I am stunned that that didn’t cause a puppet outage [23:42:34] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, 7Ipv6: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1319254 (10Krenair)