[00:06:37] <grrrit-wm>	 (03PS1) 10Dduvall: Better encapsulation of API-related functionality [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180079 
[00:06:39] <grrrit-wm>	 (03PS1) 10Dduvall: Fixed screenshots upon failed scenarios [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 
[00:18:48] <grrrit-wm>	 (03PS1) 10Dduvall: API helper method for creating accounts [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180084 
[00:19:55] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[00:27:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[00:31:34] <wmf-insecte>	 Yippee, build fixed!
[00:31:34] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #347: FIXED in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/347/
[00:34:54] <shinken-wm>	 PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[00:36:18] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[00:36:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]  
[00:38:15] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[00:38:49] <shinken-wm>	 PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[00:41:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[00:42:09] <shinken-wm>	 PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[00:42:47] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[00:43:05] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[00:46:17] <wmf-insecte>	 Project beta-scap-eqiad build #34079: FAILURE in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34079/
[00:46:21] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[00:46:39] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[00:50:51] <shinken-wm>	 PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[00:52:25] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[00:52:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[00:56:32] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[00:56:36] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[00:56:41] <wmf-insecte>	 Yippee, build fixed!
[00:56:41] <wmf-insecte>	 Project beta-scap-eqiad build #34080: FIXED in 2 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34080/
[00:57:32] <shinken-wm>	 PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[00:59:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[00:59:50] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[00:59:54] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:01:59] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:06:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[01:06:28] <wmf-insecte>	 Project beta-scap-eqiad build #34081: FAILURE in 2 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34081/
[01:07:09] <shinken-wm>	 RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:08:14] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:09:07] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[01:09:17] <shinken-wm>	 PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[01:10:41] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[01:10:53] <shinken-wm>	 PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[01:12:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[01:13:11] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[01:16:18] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:19:14] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[01:19:22] <shinken-wm>	 PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[01:21:18] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:21:57] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:22:38] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:26:27] <wmf-insecte>	 Yippee, build fixed!
[01:26:27] <wmf-insecte>	 Project beta-scap-eqiad build #34083: FIXED in 2 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34083/
[01:26:33] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:27:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[01:27:47] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:29:48] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:33:26] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[01:36:27] <wmf-insecte>	 Project beta-scap-eqiad build #34084: FAILURE in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34084/
[01:36:42] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:37:53] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:38:13] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:39:07] <shinken-wm>	 RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:40:33] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[01:40:39] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:40:49] <shinken-wm>	 RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:43:50] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0]  
[01:44:15] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:44:23] <shinken-wm>	 RECOVERY - Puppet failure on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:46:08] <wmf-insecte>	 Yippee, build fixed!
[01:46:08] <wmf-insecte>	 Project beta-scap-eqiad build #34085: FIXED in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34085/
[01:47:32] <shinken-wm>	 RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:47:38] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[01:48:48] <shinken-wm>	 RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:50:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[01:51:23] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:51:52] <shinken-wm>	 PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[01:54:07] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[02:01:34] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:04:06] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:05:31] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:08:07] <shinken-wm>	 RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:08:45] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:12:20] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:12:40] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:14:17] <shinken-wm>	 RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:15:09] <shinken-wm>	 RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:16:51] <shinken-wm>	 RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:19:04] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:20:53] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:23:27] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[02:27:29] <wikibugs>	 3MediaWiki-extensions-TimedMediaHandler, Continuous-Integration: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#849777 (10Krinkle) The usage for both CirrusSearch and TimedMediaHandler stems from OggHandler:  [[ https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandl...
[03:42:31] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #192: FAILURE in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/192/
[04:11:15] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-sca-cache01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]  
[04:15:18] <wmf-insecte>	 Project beta-scap-eqiad build #34100: FAILURE in 1 min 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34100/
[04:35:12] <wmf-insecte>	 Yippee, build fixed!
[04:35:13] <wmf-insecte>	 Project beta-scap-eqiad build #34102: FIXED in 1 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34102/
[05:54:50] <wmf-insecte>	 Project beta-scap-eqiad build #34110: FAILURE in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34110/
[06:15:13] <wmf-insecte>	 Yippee, build fixed!
[06:15:14] <wmf-insecte>	 Project beta-scap-eqiad build #34112: FIXED in 1 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34112/
[07:40:38] <wmf-insecte>	 Yippee, build fixed!
[07:40:39] <wmf-insecte>	 Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #227: FIXED in 20 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/227/
[08:15:30] <wikibugs>	 3Beta-Cluster: Puppet failures on deployment-bastion - https://phabricator.wikimedia.org/T75520#849860 (10Joe) I added the correct patch here https://gerrit.wikimedia.org/r/#/c/180126/  Now on beta you just need to configure role::deployment::deployment_servers::common::key_source to some file in the beta privat...
[08:28:53] <wmf-insecte>	 Yippee, build fixed!
[08:28:53] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #193: FIXED in 41 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/193/
[08:38:38] <grrrit-wm>	 (03PS4) 10Adrian Lang: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 
[08:39:23] <grrrit-wm>	 (03CR) 10Adrian Lang: "Rebased. Also, Ife21da0b9989d5d292f73ad9f946e34215bb6be7 should fix the issue in WikibaseJavaScriptApi." [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[09:17:12] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #348: FAILURE in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/348/
[09:28:54] <Jagori>	 zeljkof:Hi!areyou busy?
[09:29:09] <zeljkof>	 sorry, in a meeting
[09:29:13] <zeljkof>	 Jagori: ^
[09:29:15] <Jagori>	 no issues
[09:31:47] <wmf-insecte>	 Yippee, build fixed!
[09:31:47] <wmf-insecte>	 Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #366: FIXED in 1 hr 23 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/366/
[10:35:19] <wmf-insecte>	 Project beta-scap-eqiad build #34139: FAILURE in 1 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34139/
[10:36:55] <wikibugs>	 3MediaWiki-extensions-TimedMediaHandler, Multimedia, Continuous-Integration: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#850078 (10Gilles)
[10:45:34] <wikibugs>	 3Release-Engineering, Phabricator.org, Phabricator: Answer questions about ongoing maintenance of phabricator customizations/extensions - https://phabricator.wikimedia.org/T78464#850106 (10mmodell) @dzahn: > just to be clear about it, this is _not T175 ?   I simply meant to say that the phabricator team is not /...
[10:55:27] <wmf-insecte>	 Yippee, build fixed!
[10:55:27] <wmf-insecte>	 Project beta-scap-eqiad build #34141: FIXED in 1 min 22 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34141/
[11:31:45] <wikibugs>	 3Quality-Assurance, Release-Engineering: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#850162 (10akosiaris)
[11:31:47] <wikibugs>	 3Quality-Assurance, Release-Engineering: role/phabricator.pp include a password class in the global puppet scope - https://phabricator.wikimedia.org/T78344#850159 (10akosiaris) 5Open>3Resolved a:3akosiaris After I ran a (noop) catalog compiling test, I merged and ran puppet on iridium. No changes as expect...
[12:51:54] <shinken-wm>	 PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[12:58:13] <shinken-wm>	 PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[13:08:52] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[13:10:18] <shinken-wm>	 PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[13:21:47] <kart_>	 hashar: feel free to merge, https://gerrit.wikimedia.org/r/#/c/178772 :)
[13:21:55] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:22:17] <wikibugs>	 3Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850376 (10hashar)
[13:22:20] <wikibugs>	 3Multimedia, Continuous-Integration, MediaWiki-extensions-TimedMediaHandler: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#850374 (10hashar) 5Open>3Resolved TimedMediaHandler is happy, thank you! https://integration.wikimedia.org/ci/job/mwext-CirrusSearch-testextension-hh...
[13:30:04] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]  
[13:35:17] <shinken-wm>	 RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:37:13] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[13:38:51] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:40:13] <YuviPanda>	 fwiw, there are actually no puppet failures
[13:43:10] <shinken-wm>	 RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:50:59] <hashar>	 !log deleting integration-slave1001 and recreating it. It is blocked on boot and we can't console on it https://phabricator.wikimedia.org/T76250
[13:51:01] <qa-morebots>	 Logged the message, Master
[13:51:17] <shinken-wm>	 PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[13:52:55] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[13:55:07] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:56:38] <wikibugs>	 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850455 (10hashar) a:3hashar Thanks @yuvipanda.  I have deleted the old instance and recreated it (with IP 10.68.17.119)...
[13:57:32] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[13:57:34] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[14:00:16] <YuviPanda>	 hmm
[14:00:21] <YuviPanda>	 these are actually all real transient errors
[14:00:27] <YuviPanda>	 from DNS failure and LVM issues?!
[14:01:26] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:01:49] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850484 (10hashar) 3NEW a:3hashar
[14:02:23] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850493 (10hashar) Puppet choke on newly created Precise instances T78661  :-(
[14:03:05] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850484 (10hashar)
[14:03:32] <hashar>	 YuviPanda: I haven't looked yet at the puppet failures
[14:04:08] <YuviPanda>	 hashar: hmm, it could possibly be lvm
[14:04:09] <hashar>	 cache-bits01 had a resolution error indeed 
[14:04:10] <hashar>	 Error: /Stage[main]/Role::Protoproxy::Ssl::Beta::Common/Install_certificate[star.wmflabs.org]/File[/etc/ssl/localcerts/star.wmflabs.org.crt]: Could not evaluate: getaddrinfo: Temporary failure in name resolution Could not retrieve file metadata for puppet:///files/ssl/star.wmflabs.org.crt: getaddrinfo: Temporary failure in name resolution
[14:04:25] <YuviPanda>	 hashar: yeah, and some others have lvm ones
[14:04:29] <hashar>	 is dnsmasq running on virt1000 ?
[14:04:42] <hashar>	 the lvm do a host lookup as well
[14:04:45] <YuviPanda>	 should be
[14:04:47] <YuviPanda>	 aah, hmm
[14:05:01] <YuviPanda>	 hashar: is the new integration-slave1001 having the same issue?
[14:05:15] <hashar>	 na seems a certname which is not recognized by the puppet master
[14:11:36] <YuviPanda>	 hashar: fixed :)
[14:11:41] <YuviPanda>	 manually cleaned and re-requested
[14:11:56] <YuviPanda>	 hashar: puppet is running now
[14:12:49] <hashar>	 YuviPanda: I am wondering whether it is transient or whether we suddenly lost the ability to create working Precise instances :D
[14:13:11] <YuviPanda>	 hashar: I hope it's the latter, but we'll find out the next time someone creates a precise instance :)
[14:16:12] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850538 (10yuvipanda) Manually cleaned the old cert and requested a new one and it's alright now, for this instance. let's see if this recurs.
[14:16:20] <shinken-wm>	 RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:17:59] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:22:24] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850549 (10hashar) 5Open>3Resolved Thanks, per our discussion lets close this and figure out later on when someone create another Precise instance. Might have been a tran...
[14:22:25] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850551 (10hashar)
[14:22:33] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:22:37] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:27:16] <shinken-wm>	 PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[14:30:31] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "All fine. Jobs are deployed, will reload Zuul when its merged." [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[14:42:16] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[14:45:24] <grrrit-wm>	 (03PS5) 10Hashar: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[14:46:21] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "Was missing the qunit jenkins job https://gerrit.wikimedia.org/r/#/c/176232/4..5/jjb/mediawiki-extensions.yaml,unified" [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[14:51:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang)
[14:57:16] <shinken-wm>	 RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:57:32] <grrrit-wm>	 (03PS2) 10Hashar: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm)
[15:00:10] <grrrit-wm>	 (03PS3) 10Hashar: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm)
[15:00:32] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm)
[15:03:02] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm)
[15:05:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm)
[15:07:47] <wikibugs>	 3Wikibugs, Continuous-Integration: Set up jenkins jobs for labs/tools/wikibugs2 repo - https://phabricator.wikimedia.org/T75707#850640 (10hashar) 5Open>3Resolved a:3hashar The CI change have been deployed. Thanks @legoktm :)
[15:08:53] <grrrit-wm>	 (03PS1) 10Hashar: mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 
[15:10:03] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 (owner: 10Hashar)
[15:10:51] <grrrit-wm>	 (03Merged) 10jenkins-bot: mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 (owner: 10Hashar)
[15:25:12] <wikibugs>	 3Phabricator, Phabricator.org, Release-Engineering: Answer questions about ongoing maintenance of phabricator customizations/extensions - https://phabricator.wikimedia.org/T78464#850658 (10Qgil) >>! In T78464#847841, @Dzahn wrote: >>>! In T78464#846063, @MZMcBride wrote: >> key selling point to switching to Phab...
[15:34:39] <wikibugs>	 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850672 (10hashar) So I deleted the old blocked instance and created a fresh one.  Applied the manifests I needed and did...
[15:35:34] <wikibugs>	 3Quality-Assurance: rubocop fixes in mediawiki/selenium - https://phabricator.wikimedia.org/T75898#850673 (10zeljkofilipin) @stan3, I do not think I have tried that so far, but looks like it at least confused gerrit.
[15:38:53] <wikibugs>	 3Quality-Assurance: rubocop fixes in mediawiki/selenium - https://phabricator.wikimedia.org/T75898#850685 (10hashar) You want to push your change in Gerrit with:   git push gerrit whateverlocalbranch:refs/for/env-abstraction-layer  This way Gerrit will know they are for the env-abstraction-layer branch.   It is...
[15:40:22] <wikibugs>	 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850690 (10hashar)
[15:40:32] <wikibugs>	 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#793736 (10hashar) [Get console output](https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=conso...
[15:45:59] <wikibugs>	 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850700 (10hashar) From coren, the relevant bits are: ``` tmpfs: Bad value 'jenkins-deploy' for mount option 'uid' ... An...
[15:51:36] <wikibugs>	 3Beta-Cluster: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10Cmcmahon) 3NEW
[15:53:07] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:57:56] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 49150 bytes in 0.686 second response time  
[16:10:50] <wikibugs>	 3Release-Engineering: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#850768 (10Dzahn) i think this should only be disabled for files in ./role/ but _not_ for files in ./modules/.  Role classes will never be in autoload module classes should alway...
[16:12:07] <wikibugs>	 3Release-Engineering: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#850776 (10Dzahn) this might also be a duplicate of T1289
[16:13:18] <wikibugs>	 3Continuous-Integration: jenkins - operations-puppet-puppetlint-lenient - --no-autoloader_layout-check - https://phabricator.wikimedia.org/T75117#850781 (10Dzahn) also see T1289 . it might be a duplicate. i don't think T1289 should be resolved as originally requested though. let's not remove the entire check ple...
[16:13:45] <wikibugs>	 3Continuous-Integration: jenkins - operations-puppet-puppetlint-lenient - --no-autoloader_layout-check - https://phabricator.wikimedia.org/T75117#850786 (10Dzahn)
[16:14:01] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1818 bytes in 3.059 second response time  
[16:17:43] <wikibugs>	 3Beta-Cluster, MediaWiki-extensions-Flow: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10hashar)
[16:18:26] <wikibugs>	 3Beta-Cluster, MediaWiki-extensions-Flow: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10hashar) Looking at https://logstash-beta.wmflabs.org/ hhvm reports a bunch of slow queries such as. The Selenium user having user_id 820.  ```SlowTimer [59999ms] at r...
[16:18:35] * greg-g does the cafe/friend house hopping in search of wifi for our call
[16:19:06] <hashar>	 heading out to grab daughter /  grocery.  Be back for the call :D
[16:46:50] <chrismcmahon>	 beta labs is down "(Cannot contact the database server: Can't connect to MySQL server on '10.68.17.94' (4) (10.68.17.94))"
[16:49:31] <bd808>	 chrismcmahon: mutante is looking I think, see #-labs
[16:51:15] <greg-g>	 good timing
[17:02:24] <mutante>	 hi. could you paste the last lines from icinga-wm?
[17:02:37] <chrismcmahon>	 greg-g: we're all here
[17:08:47] <bd808>	 !log deployment-salt:/var/lib/git/operations/puppet is a rebase hell of cherry-picks that don't apply
[17:08:50] <qa-morebots>	 Logged the message, Master
[17:08:56] <wikibugs>	 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850916 (10hashar) + RelEng project to make this task appear on our team work board.
[17:08:57] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 49150 bytes in 0.716 second response time  
[17:09:23] <bd808>	 !log trying to fix it without using important changes
[17:09:25] <qa-morebots>	 Logged the message, Master
[17:10:25] <mutante>	 there, beta is fixed by labs 
[17:10:34] <mutante>	 Coren restarted a db server instance
[17:13:42] <hashar>	 :-(
[17:13:50] <bd808>	 !log removed cherry pick of I6084f49e97c855286b86dbbd6ce8e80e94069492 (merged by Ori with a change)
[17:13:53] <qa-morebots>	 Logged the message, Master
[17:14:58] <bd808>	 !log removed cherry pick of I08c24578596506a1a8baedb7f4a42c2c78be295a (-2 by _joe_ in gerrit; replaced by Iba742c94aa3df7497fbff52a856d7ba16cf22cc7)
[17:15:01] <qa-morebots>	 Logged the message, Master
[17:15:43] <bd808>	 !log removed cherry pick of I3b6e37a2b6b9389c1a03bd572f422f898970c5b4 (modified in gerrit by bd808 and not repicked; merged)
[17:15:46] <qa-morebots>	 Logged the message, Master
[17:16:48] <bd808>	 !log removed cherry pick of Ib2a0401a7aa5632fb79a5b17c0d0cef8955cf990 (-2 by _joe_; replaced by Ibcad98a95413044fd6c5e9bd3c0a6fb486bd5fe9)
[17:16:51] <qa-morebots>	 Logged the message, Master
[17:17:26] <bd808>	 !log git-sync-upstream runs cleanly on deployment-salt again!
[17:17:29] <qa-morebots>	 Logged the message, Master
[17:19:24] <wikibugs>	 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#850943 (10Krinkle) p:5Volunteer?>3Normal
[17:20:09] <wikibugs>	 3Continuous-Integration: Jenkins: integration-zuul-layoutdiff job says "No layout changes" when there are - https://phabricator.wikimedia.org/T73740#850945 (10Krinkle) p:5Triage>3Volunteer?
[17:21:23] <wikibugs>	 3Continuous-Integration: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#850948 (10Krinkle) p:5Triage>3Normal
[17:21:57] <wikibugs>	 3Continuous-Integration, Librarization: Set up composer validate job for operations/mediawiki-config - https://phabricator.wikimedia.org/T76621#850952 (10Krinkle) p:5Triage>3Normal
[17:22:19] <wikibugs>	 3Continuous-Integration, Wikimedia-Git-or-Gerrit: labs-tools-grrrit-yamllint job broken? - https://phabricator.wikimedia.org/T76508#850953 (10Krinkle) p:5Triage>3High
[17:22:37] <wikibugs>	 3Continuous-Integration: integration-config-tox-py27 fails existing master when repos get deleted from Gerrit - https://phabricator.wikimedia.org/T76853#850954 (10Krinkle) p:5Triage>3Low
[17:25:29] <wikibugs>	 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins - https://phabricator.wikimedia.org/T78590#850957 (10Krinkle) Sounds like an issue with the code. Unit tests should mock anything they don't depe...
[17:26:14] <wikibugs>	 3Beta-Cluster: Puppet failures on deployment-bastion - https://phabricator.wikimedia.org/T75520#850960 (10mmodell) > @bd808: (copied from gerrit) >Keys can be placed in beta via local commits in deployment-salt:/var/lib/git/labs/private. That is how the ssh keypair for beta's scap wrapper were provisioned.  so w...
[17:27:19] <wikibugs>	 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850970 (10hashar)
[17:27:31] <wikibugs>	 3Continuous-Integration: Empty .git/config for mediawiki/core.git clone in mediawiki-phpunit workspace on gallium - https://phabricator.wikimedia.org/T78474#850971 (10Krinkle)
[17:29:24] <wikibugs>	 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#850973 (10Krinkle)
[17:34:49] <shinken-wm>	 PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[17:42:56] <wikibugs>	 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#851039 (10matmarex) Ruby 1.9.x will also be EOL'd in February, by the way, so the target should probably be 2.0 or 2.1.
[17:52:18] <Krinkle>	 hashar: Mmm any ideas on https://phabricator.wikimedia.org/T74184 ?
[17:52:43] <hashar>	 Krinkle: no clue. But I am sure WMDE folks will figure it out
[17:53:12] <hashar>	 not sure why mw-debug-log is non writable, should be 777
[17:53:40] <hashar>	 marxarelli: sorry about the rspec / puppet / MediaWikiVagrant mess early in the meeting
[17:53:40] <Krinkle>	 greg-g: had to leave, VE standup coming up. 
[17:53:54] <Krinkle>	 hashar: Yeah, is it still happening?
[17:53:55] <greg-g>	 Krinkle: no worries
[17:54:00] <marxarelli>	 hashar: ain't no thing
[17:54:10] <hashar>	 marxarelli: I don't really have a priority between ops/puppet vs  vagrant.  I would just love some kind of RSpec tests to start appearing magically on one of the repos using Puppet
[17:54:14] <greg-g>	 hollar back ya'll
[17:54:23] <marxarelli>	 hashar: i'm down to help with either case, whichever we think is higher priority
[17:54:44] <hashar>	 Krinkle: no idea :-(  I don't think it is much of a priority for them. IIRC they rely more on the selenium browser tests to verify whether their spring is done / non regressing
[17:54:49] <greg-g>	 probably, ideally, ops/puppet, so it has benefits for both production and beta cluser
[17:54:53] <Krinkle>	 hashar: I wrote to the log yesterday that I created a new precise instance (slave1005). I left 1001 as is (dead/shutdown) for labs admins to investigate if they find it interesting.
[17:54:54] <marxarelli>	 hashar: i mean, you already have us halfway there, so it might make sense to just figure out the remaining pieces in the context of ops puppet
[17:54:56] <greg-g>	 is my first level opinion at least
[17:55:28] <hashar>	 marxarelli: yeah I am sure the ops/puppet can be polished/finished up by someone with more ruby/RSpec  knowledge than I .  
[17:55:56] <hashar>	 marxarelli: I attempted to build a Rake entry point in ops/puppet to execute  rspec in each of the puppet module  having a 'spec' dir.  But that never worked fine :(
[17:55:58] <marxarelli>	 hashar: i'll take a look later this week and early next
[17:55:59] <hashar>	 marxarelli: a patch is https://gerrit.wikimedia.org/r/#/c/180162/
[17:56:40] <hashar>	 marxarelli: I wanted to:  rake spec,  then:   foreach module:  RSpec::core::run( somemodule )  && RSpec.reset ; end
[17:56:42] <marxarelli>	 hashar: ah. i can help fix that up
[17:56:59] <hashar>	 the idea is to have the rake task to use the RSPec provided by bundler
[17:57:06] <hashar>	 so the entry point would be called as:  bundle exec rspec spec
[17:57:12] <hashar>	 and run a rspec per module
[17:57:36] <hashar>	 Krinkle: yeah I deleted 1001 while investigating with YuviPanda
[17:57:37] <marxarelli>	 hashar: resetting _shouldn't_ be necessary as each example is setup/torndown 
[17:58:04] <hashar>	 Krinkle: coren looked at it, the issue is that the mount are released before network/ldap is available.  So the tmpfs  mounted for jenkins-deploy user fails because that user is not known yet :-(
[17:58:41] <marxarelli>	 hashar: but yeah, i have to dedicate today/tomorrow to mw-selenium. after that, i'm freed up to help with rspec/puppet
[18:00:56] <Krinkle>	 hashar: The user is used by other stuff fine though. The puppet class not existing is not the problem. It's just the ldap thing.
[18:01:09] <Krinkle>	 hashar: I remember working around this by using uid's instead (numerical)
[18:01:16] <Krinkle>	 must've forgotten to put that back in the patch
[18:01:23] <Krinkle>	 Which was Reedy's recommendation.
[18:02:23] <hashar>	 marxarelli: here is the lame idea that fails: https://gerrit.wikimedia.org/r/#/c/180215/1/rakefile
[18:02:43] <hashar>	 Krinkle: yeah it is not really an issue with your patch
[18:02:51] <hashar>	 Krinkle: more about the boot sequence  apparently
[18:03:08] <hashar>	 Krinkle: feel free to poke coren about it
[18:03:16] <marxarelli>	 gbtw, forgot to mention during the meeting that i have a dentist appointment at 11, and i'll be using that an excuse to visit my favorite burger joint in downtown
[18:03:28] <hashar>	 same heading out 
[18:03:31] <hashar>	 have a board meeting
[18:03:33] * hashar waves
[18:03:50] <hashar>	 thanks Krinkle to have joined the meeting :]  
[18:14:36] <wikibugs>	 3Continuous-Integration: common gating job for mediawiki core and extensions - https://phabricator.wikimedia.org/T60772#851226 (10Krinkle) Is this task ready to be closed?
[18:15:39] * greg-g is moving to a lunch location
[18:17:42] <wikibugs>	 3Release-Engineering: Create phpunit test in mediawiki-config repo to validate Parsoid settings - https://phabricator.wikimedia.org/T70532#851228 (10Krinkle) p:5Normal>3Triage
[18:57:16] <grrrit-wm>	 (03PS1) 10Awight: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 
[19:12:26] <wikibugs>	 3VisualEditor, Continuous-Integration: Jenkins: Convert mwext qunit from grunt-contrib-qunit (PhantomJS) to grunt-karma (Chromium) - https://phabricator.wikimedia.org/T74063#851420 (10Krinkle) Test comment to see if mentioning #Patch-For-Review adds it.
[19:12:44] <wikibugs>	 3VisualEditor, Continuous-Integration: Jenkins: Convert mwext qunit from grunt-contrib-qunit (PhantomJS) to grunt-karma (Chromium) - https://phabricator.wikimedia.org/T74063#851424 (10Krinkle)
[19:15:04] <grrrit-wm>	 (03PS2) 10Krinkle: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight)
[19:16:49] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight)
[19:17:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight)
[20:14:50] <spagewmf>	 just got "(Cannot contact the database server: Can't connect to MySQL server on '10.68.17.94' (4) (10.68.17.94))" on beta labs making a regular page edit
[20:15:21] <spagewmf>	 but retried and it succeeded.
[20:27:17] <chrismcmahon>	 UploadWizard wtf
[20:27:31] <chrismcmahon>	 spagewmf: weird, it did that this morning
[20:31:34] <spagewmf>	 chrismcmahon: the problem is intermittent beta labs failures like this cause intermittent test failures. I think the fix is for the cucumber run to retry any test that failed, and only if it fails twice does the test count as an error
[20:31:53] <spagewmf>	 I think Jon robson proposed retrying failed tests
[20:32:39] <bd808>	 The QA automation guy at my $DAYJOB-1 had a rather elaborate retry system for failed full-stack tests.
[20:33:00] <bd808>	 I think it too 3/5 runs failing to actually mark the test as failed
[20:33:13] <bd808>	 *took
[20:37:55] <chrismcmahon>	 I've written retry loops from scratch but never for Jenkins.  Right now I don't think we have enough time in the day to add a retry loop. 
[20:38:25] <spagewmf>	 chrismcmahon: you mean time for Jenkins or time for a developer to code it?
[20:38:33] <greg-g>	 spagewmf: for jenkins
[20:38:53] <greg-g>	 as it is resourced now, but we're adding more jenkins slaves soon
[20:38:59] <chrismcmahon>	 also, I like knowing that tests fail when beta labs is messed up
[20:39:23] <greg-g>	 there should be other ways of knowing that :)
[20:41:02] <chrismcmahon>	 the other problem with a retry loop is that when tests fail for real reasons it costs a lot of time. 
[20:41:20] <spagewmf>	 greg-g: maybe. Flow generally has 0-2 test failures out of 30, seems a small increase in load.   There are occasional meltdowns, e.g. https://integration.wikimedia.org/ci/view/BrowserTests/view/Echo+Flow/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/
[20:42:15] <greg-g>	 spagewmf: that graph is looking better :)
[20:43:11] <spagewmf>	 chrismcmahon: yes tests failing because beta labs had a transient problem is important, but it's hard to tease out when you're looking at a particular test like above.  "Jenkins, were other tests having problems that minute?" :)
[20:43:23] <spagewmf>	 qa=AI problem 
[20:44:15] <greg-g>	 actually... if we have a sufficient number of tests run near a specific point in time....
[20:44:18] <greg-g>	 :)
[20:47:50] <spagewmf>	 greg-g, chrismcmahon ^ I assume a QATAW (QA Test Analysis Wizard) knows how to take a single test failure and quickly establish if other test runs had problems. (Me, I get stuck converting "Last failure 1 hr 24 min" to UTC :) )
[20:52:01] <bd808>	 Somebody could feed all the test results in to beta's logstash
[20:52:35] <bd808>	 https://wiki.jenkins-ci.org/display/JENKINS/Logstash+Plugin
[20:55:59] <chrismcmahon>	 but that doesn't turn any tests from red to green. you still have to look at the red ones.
[20:56:24] <bd808>	 No, but it would make the "did everything fail or just this" easy to spot
[20:56:27] <greg-g>	 it just gives you trend data
[20:57:15] <greg-g>	 bd808: does logstash-beta have room to grow for a while, datastore wise?
[20:58:44] <marxarelli>	 limiting retry of failures do to 502/503's might be a good starting point
[20:58:51] <marxarelli>	 *due*
[20:59:03] <greg-g>	 +1
[21:00:19] <marxarelli>	 it can be done in mw-selenium as well. it wouldn't necessarily require anything in jenkins
[21:01:25] <bd808>	 greg-g: not sure... let me peek
[21:02:28] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[21:02:53] <bd808>	 greg-g: hehe. yeah. we are at 3% used on the /var/lib/elasticsearch mount
[21:03:12] <greg-g>	 bd808: awesome
[21:03:17] <bd808>	 log all the things!
[21:03:21] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[21:03:31] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[21:03:41] <greg-g>	 grrr, puppet
[21:03:46] <bd808>	 And we age out at 31 days there just like prod so we'd have to add a lot more data to cause a problem
[21:03:55] <bd808>	 grrrr dns probably
[21:04:22] <greg-g>	 huh, no wikibugs
[21:04:28] <greg-g>	 that's not a good sign either
[21:04:39] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[21:04:46] <greg-g>	 oh, they just restarted it
[21:05:18] <bd808>	 our openstack install uses what I would consider to be a dns resolver suitable for a small home network as the authoritative resolver for all of labs.
[21:06:10] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[21:06:36] <greg-g>	 bd808: what makes it SOHO-quality?
[21:06:46] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[21:06:50] <bd808>	 http://www.thekelleys.org.uk/dnsmasq/doc.html -- Read the page title :)
[21:07:03] <YuviPanda>	 greg-g: yeah, wikibugs is just valhallasw playing with it
[21:07:03] <greg-g>	 well then..
[21:07:09] <greg-g>	 YuviPanda: /me nods
[21:07:18] <greg-g>	 bd808: pretty good reason
[21:07:39] <YuviPanda>	 bd808: did you check if it was DNS again? I saw some strange LVM errors in one as well
[21:07:50] <shinken-wm>	 PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[21:08:01] <bd808>	 greg-g: dnsmasq is the dns server that is embedded in most home gateway routers. It's a tiny binary for a tiny job
[21:08:23] <bd808>	 YuviPanda: I didn't look, just assumed
[21:08:40] <YuviPanda>	 bd808: hmm, ok
[21:08:51] <YuviPanda>	 wouldn’t be surpriesd, though, since it’s a bunch of ‘em going down
[21:08:55] <greg-g>	 :(
[21:08:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[21:09:00] <YuviPanda>	 bd808: greg-g how’s the noisiness level of shinken-wm  now?
[21:09:08] <YuviPanda>	 hmm
[21:09:11] <YuviPanda>	 well, terrible question
[21:09:19] <YuviPanda>	 it’s spammy but that’s because of actual failures...
[21:09:33] <greg-g>	 hah, I was going to say
[21:09:51] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[21:10:10] <YuviPanda>	 I should really spend the 2h required to customize that stupif error message when something goes critical
[21:10:18] <bd808>	 -LVS_SERVICE_IPS="10.0.5.3" +LVS_SERVICE_IPS="" -- boom!
[21:10:59] <bd808>	 Could not retrieve file metadata for puppet:///modules/ldap/scripts/ldapkeys: Connection refused - connect(2)
[21:10:59] <greg-g>	 YuviPanda: is there credence to/and thoughts about bd808's "blame the SOHO-quality dns software"
[21:11:23] <YuviPanda>	 bd808: oh, it’s the puppetmaster failing?
[21:11:50] <YuviPanda>	 greg-g: it is a terrible DNS server, and coren’s been meaning to put in more patching in place...
[21:11:53] <bd808>	 it seems to be a wide variety of random shit
[21:13:14] <bd808>	 ummm.... firewall rules on deployment-salt may have changed to block everything if I'm reading this log right
[21:13:38] <greg-g>	 "more patching"? that sounds....
[21:13:41] <shinken-wm>	 PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[21:14:10] <YuviPanda>	 bd808: mutante was working on adding base::firewall to bastion roles...
[21:14:15] <YuviPanda>	 bd808: not sure if that got deployed...
[21:14:39] <bd808>	 Yeah it was deployment-bastion not deployment-salt
[21:15:51] <greg-g>	 bd808: (btw, feel free to unsub from that jenkins/logstash bug I just filed)
[21:16:02] <greg-g>	 ('tis a low prio)
[21:16:26] <YuviPanda>	 bd808: doesn’t seem to be merged yet
[21:17:40] <bd808>	 YuviPanda: I was just scanning through logs at https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/puppet%20runs and saw changes to DROP rules but now that I think about it I think the puppet role does that on every run
[21:19:54] <bd808>	 oomkiller is going nuts on deployment-pdf01
[21:20:19] * greg-g gets back on the travel train (plane)
[21:32:23] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:32:25] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[21:33:20] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:33:28] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:34:38] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:34:52] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:36:05] <shinken-wm>	 RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:36:40] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:37:37] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 
[21:37:50] <shinken-wm>	 RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:38:42] <shinken-wm>	 RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:38:56] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:50:47] <grrrit-wm>	 (03Abandoned) 10Dduvall: Try local gem cache in bundle job [integration/config] - 10https://gerrit.wikimedia.org/r/176818 (owner: 10Dduvall)
[21:54:47] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180079 (owner: 10Dduvall)
[21:55:18] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 (owner: 10Dduvall)
[21:55:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fixed screenshots upon failed scenarios [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 (owner: 10Dduvall)
[21:56:06] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180084 (owner: 10Dduvall)
[22:02:24] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[22:07:02] <wikibugs>	 3Continuous-Integration, Phabricator: Phabricator project display names use awkward capitalization and hyphenation - https://phabricator.wikimedia.org/T911#851808 (10Krinkle) @MZMcBride: I don't think adding redirects would improve anything. Nothings points to those urls in the first place. It's safe to say we'r...
[22:25:12] <hashar>	 greg-g: hi, during our weekly checkin you said the beta cluster DB is dead/slow somehow.  If you could point to a phab task about it it would be nice, seems beta works fine for me
[22:25:23] <hashar>	 (by email since I am heading bed :D )
[22:27:52] <wikibugs>	 3Continuous-Integration: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#851852 (10hashar)
[22:33:07] <grrrit-wm>	 (03Abandoned) 10MaxSem: Disable npm checks for MobileFrontend [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem)
[22:34:59] <wikibugs>	 3Beta-Cluster: Can not sudo on deployment-cache-mobile03 - https://phabricator.wikimedia.org/T78720#851871 (10hashar) 3NEW
[22:36:26] <grrrit-wm>	 (03CR) 10Hashar: "We probably could use a bug about npm being slow from time to time. Could be mitigated by having the jenkins slaves to use a caching proxy" [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem)
[22:53:17] <grrrit-wm>	 (03PS1) 10Hashar: operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 
[22:53:32] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 (owner: 10Hashar)
[22:58:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 (owner: 10Hashar)
[23:03:46] <grrrit-wm>	 (03CR) 10Krinkle: "@Hashar: That hasn't been the case for almost 2 years. The whole 0.8.x upgrade issues with ssl hacks was because the old version was relyi" [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem)