[00:06:37] (03PS1) 10Dduvall: Better encapsulation of API-related functionality [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180079 [00:06:39] (03PS1) 10Dduvall: Fixed screenshots upon failed scenarios [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 [00:18:48] (03PS1) 10Dduvall: API helper method for creating accounts [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180084 [00:19:55] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:27:27] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:31:34] Yippee, build fixed! [00:31:34] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #347: FIXED in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/347/ [00:34:54] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:36:18] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:36:57] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:38:15] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:38:49] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [00:41:57] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:42:09] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [00:42:47] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:43:05] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:46:17] Project beta-scap-eqiad build #34079: FAILURE in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34079/ [00:46:21] PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:46:39] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:50:51] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:52:25] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:52:35] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:56:32] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:56:36] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:56:41] Yippee, build fixed! [00:56:41] Project beta-scap-eqiad build #34080: FIXED in 2 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34080/ [00:57:32] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:59:06] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:59:50] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:59:54] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:01:59] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:06:23] PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:06:28] Project beta-scap-eqiad build #34081: FAILURE in 2 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34081/ [01:07:09] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:08:14] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:09:07] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:09:17] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:10:41] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:10:53] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:12:57] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:13:11] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:16:18] RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:19:14] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:19:22] PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:21:18] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:21:57] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:22:38] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [01:26:27] Yippee, build fixed! [01:26:27] Project beta-scap-eqiad build #34083: FIXED in 2 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34083/ [01:26:33] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:27:19] PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:27:47] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:29:48] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [01:33:26] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:36:27] Project beta-scap-eqiad build #34084: FAILURE in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34084/ [01:36:42] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:53] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:38:13] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:39:07] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [01:40:33] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:40:39] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [01:40:49] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:43:50] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [01:44:15] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:44:23] RECOVERY - Puppet failure on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:08] Yippee, build fixed! [01:46:08] Project beta-scap-eqiad build #34085: FIXED in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34085/ [01:47:32] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [01:47:38] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:48:48] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [01:50:06] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:51:23] RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:51:52] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:54:07] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:01:34] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:04:06] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [02:05:31] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:07] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:45] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:12:20] RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0] [02:12:40] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [02:14:17] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [02:15:09] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [02:16:51] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:19:04] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:20:53] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:23:27] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:27:29] 3MediaWiki-extensions-TimedMediaHandler, Continuous-Integration: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#849777 (10Krinkle) The usage for both CirrusSearch and TimedMediaHandler stems from OggHandler: [[ https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandl... [03:42:31] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #192: FAILURE in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/192/ [04:11:15] PROBLEM - Puppet staleness on deployment-sca-cache01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:15:18] Project beta-scap-eqiad build #34100: FAILURE in 1 min 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34100/ [04:35:12] Yippee, build fixed! [04:35:13] Project beta-scap-eqiad build #34102: FIXED in 1 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34102/ [05:54:50] Project beta-scap-eqiad build #34110: FAILURE in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34110/ [06:15:13] Yippee, build fixed! [06:15:14] Project beta-scap-eqiad build #34112: FIXED in 1 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34112/ [07:40:38] Yippee, build fixed! [07:40:39] Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #227: FIXED in 20 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/227/ [08:15:30] 3Beta-Cluster: Puppet failures on deployment-bastion - https://phabricator.wikimedia.org/T75520#849860 (10Joe) I added the correct patch here https://gerrit.wikimedia.org/r/#/c/180126/ Now on beta you just need to configure role::deployment::deployment_servers::common::key_source to some file in the beta privat... [08:28:53] Yippee, build fixed! [08:28:53] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #193: FIXED in 41 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/193/ [08:38:38] (03PS4) 10Adrian Lang: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 [08:39:23] (03CR) 10Adrian Lang: "Rebased. Also, Ife21da0b9989d5d292f73ad9f946e34215bb6be7 should fix the issue in WikibaseJavaScriptApi." [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [09:17:12] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #348: FAILURE in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/348/ [09:28:54] zeljkof:Hi!areyou busy? [09:29:09] sorry, in a meeting [09:29:13] Jagori: ^ [09:29:15] no issues [09:31:47] Yippee, build fixed! [09:31:47] Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #366: FIXED in 1 hr 23 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/366/ [10:35:19] Project beta-scap-eqiad build #34139: FAILURE in 1 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34139/ [10:36:55] 3MediaWiki-extensions-TimedMediaHandler, Multimedia, Continuous-Integration: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#850078 (10Gilles) [10:45:34] 3Release-Engineering, Phabricator.org, Phabricator: Answer questions about ongoing maintenance of phabricator customizations/extensions - https://phabricator.wikimedia.org/T78464#850106 (10mmodell) @dzahn: > just to be clear about it, this is _not T175 ? I simply meant to say that the phabricator team is not /... [10:55:27] Yippee, build fixed! [10:55:27] Project beta-scap-eqiad build #34141: FIXED in 1 min 22 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34141/ [11:31:45] 3Quality-Assurance, Release-Engineering: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#850162 (10akosiaris) [11:31:47] 3Quality-Assurance, Release-Engineering: role/phabricator.pp include a password class in the global puppet scope - https://phabricator.wikimedia.org/T78344#850159 (10akosiaris) 5Open>3Resolved a:3akosiaris After I ran a (noop) catalog compiling test, I merged and ran puppet on iridium. No changes as expect... [12:51:54] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:58:13] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [13:08:52] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:10:18] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:21:47] hashar: feel free to merge, https://gerrit.wikimedia.org/r/#/c/178772 :) [13:21:55] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:22:17] 3Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850376 (10hashar) [13:22:20] 3Multimedia, Continuous-Integration, MediaWiki-extensions-TimedMediaHandler: hhvm on CI slaves can not find PEAR.php - https://phabricator.wikimedia.org/T78556#850374 (10hashar) 5Open>3Resolved TimedMediaHandler is happy, thank you! https://integration.wikimedia.org/ci/job/mwext-CirrusSearch-testextension-hh... [13:30:04] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [13:35:17] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [13:37:13] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:38:51] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:40:13] fwiw, there are actually no puppet failures [13:43:10] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:50:59] !log deleting integration-slave1001 and recreating it. It is blocked on boot and we can't console on it https://phabricator.wikimedia.org/T76250 [13:51:01] Logged the message, Master [13:51:17] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:52:55] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:55:07] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [13:56:38] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850455 (10hashar) a:3hashar Thanks @yuvipanda. I have deleted the old instance and recreated it (with IP 10.68.17.119)... [13:57:32] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:57:34] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:00:16] hmm [14:00:21] these are actually all real transient errors [14:00:27] from DNS failure and LVM issues?! [14:01:26] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:01:49] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850484 (10hashar) 3NEW a:3hashar [14:02:23] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850493 (10hashar) Puppet choke on newly created Precise instances T78661 :-( [14:03:05] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850484 (10hashar) [14:03:32] YuviPanda: I haven't looked yet at the puppet failures [14:04:08] hashar: hmm, it could possibly be lvm [14:04:09] cache-bits01 had a resolution error indeed [14:04:10] Error: /Stage[main]/Role::Protoproxy::Ssl::Beta::Common/Install_certificate[star.wmflabs.org]/File[/etc/ssl/localcerts/star.wmflabs.org.crt]: Could not evaluate: getaddrinfo: Temporary failure in name resolution Could not retrieve file metadata for puppet:///files/ssl/star.wmflabs.org.crt: getaddrinfo: Temporary failure in name resolution [14:04:25] hashar: yeah, and some others have lvm ones [14:04:29] is dnsmasq running on virt1000 ? [14:04:42] the lvm do a host lookup as well [14:04:45] should be [14:04:47] aah, hmm [14:05:01] hashar: is the new integration-slave1001 having the same issue? [14:05:15] na seems a certname which is not recognized by the puppet master [14:11:36] hashar: fixed :) [14:11:41] manually cleaned and re-requested [14:11:56] hashar: puppet is running now [14:12:49] YuviPanda: I am wondering whether it is transient or whether we suddenly lost the ability to create working Precise instances :D [14:13:11] hashar: I hope it's the latter, but we'll find out the next time someone creates a precise instance :) [14:16:12] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850538 (10yuvipanda) Manually cleaned the old cert and requested a new one and it's alright now, for this instance. let's see if this recurs. [14:16:20] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [14:17:59] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:22:24] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Puppet stalled on fresh Precise instance - https://phabricator.wikimedia.org/T78661#850549 (10hashar) 5Open>3Resolved Thanks, per our discussion lets close this and figure out later on when someone create another Precise instance. Might have been a tran... [14:22:25] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850551 (10hashar) [14:22:33] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:22:37] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:27:16] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:30:31] (03CR) 10Hashar: [C: 032] "All fine. Jobs are deployed, will reload Zuul when its merged." [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [14:42:16] (03CR) 10jenkins-bot: [V: 04-1] Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [14:45:24] (03PS5) 10Hashar: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [14:46:21] (03CR) 10Hashar: [C: 032] "Was missing the qunit jenkins job https://gerrit.wikimedia.org/r/#/c/176232/4..5/jjb/mediawiki-extensions.yaml,unified" [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [14:51:52] (03Merged) 10jenkins-bot: Add jobs for WikibaseJavaScriptApi [integration/config] - 10https://gerrit.wikimedia.org/r/176232 (owner: 10Adrian Lang) [14:57:16] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [14:57:32] (03PS2) 10Hashar: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm) [15:00:10] (03PS3) 10Hashar: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm) [15:00:32] (03CR) 10Hashar: [C: 032] Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm) [15:03:02] (03CR) 10jenkins-bot: [V: 04-1] Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm) [15:05:59] (03Merged) 10jenkins-bot: Add jobs for labs/tools/wikibugs2 [integration/config] - 10https://gerrit.wikimedia.org/r/175607 (owner: 10Legoktm) [15:07:47] 3Wikibugs, Continuous-Integration: Set up jenkins jobs for labs/tools/wikibugs2 repo - https://phabricator.wikimedia.org/T75707#850640 (10hashar) 5Open>3Resolved a:3hashar The CI change have been deployed. Thanks @legoktm :) [15:08:53] (03PS1) 10Hashar: mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 [15:10:03] (03CR) 10Hashar: [C: 032] mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 (owner: 10Hashar) [15:10:51] (03Merged) 10jenkins-bot: mwext-WikibaseJavaScriptApi-qunit non voting [integration/config] - 10https://gerrit.wikimedia.org/r/180189 (owner: 10Hashar) [15:25:12] 3Phabricator, Phabricator.org, Release-Engineering: Answer questions about ongoing maintenance of phabricator customizations/extensions - https://phabricator.wikimedia.org/T78464#850658 (10Qgil) >>! In T78464#847841, @Dzahn wrote: >>>! In T78464#846063, @MZMcBride wrote: >> key selling point to switching to Phab... [15:34:39] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850672 (10hashar) So I deleted the old blocked instance and created a fresh one. Applied the manifests I needed and did... [15:35:34] 3Quality-Assurance: rubocop fixes in mediawiki/selenium - https://phabricator.wikimedia.org/T75898#850673 (10zeljkofilipin) @stan3, I do not think I have tried that so far, but looks like it at least confused gerrit. [15:38:53] 3Quality-Assurance: rubocop fixes in mediawiki/selenium - https://phabricator.wikimedia.org/T75898#850685 (10hashar) You want to push your change in Gerrit with: git push gerrit whateverlocalbranch:refs/for/env-abstraction-layer This way Gerrit will know they are for the env-abstraction-layer branch. It is... [15:40:22] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850690 (10hashar) [15:40:32] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#793736 (10hashar) [Get console output](https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=conso... [15:45:59] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#850700 (10hashar) From coren, the relevant bits are: ``` tmpfs: Bad value 'jenkins-deploy' for mount option 'uid' ... An... [15:51:36] 3Beta-Cluster: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10Cmcmahon) 3NEW [15:53:07] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:56] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 49150 bytes in 0.686 second response time [16:10:50] 3Release-Engineering: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#850768 (10Dzahn) i think this should only be disabled for files in ./role/ but _not_ for files in ./modules/. Role classes will never be in autoload module classes should alway... [16:12:07] 3Release-Engineering: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#850776 (10Dzahn) this might also be a duplicate of T1289 [16:13:18] 3Continuous-Integration: jenkins - operations-puppet-puppetlint-lenient - --no-autoloader_layout-check - https://phabricator.wikimedia.org/T75117#850781 (10Dzahn) also see T1289 . it might be a duplicate. i don't think T1289 should be resolved as originally requested though. let's not remove the entire check ple... [16:13:45] 3Continuous-Integration: jenkins - operations-puppet-puppetlint-lenient - --no-autoloader_layout-check - https://phabricator.wikimedia.org/T75117#850786 (10Dzahn) [16:14:01] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1818 bytes in 3.059 second response time [16:17:43] 3Beta-Cluster, MediaWiki-extensions-Flow: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10hashar) [16:18:26] 3Beta-Cluster, MediaWiki-extensions-Flow: Beta labs Special:Contributions lags by a long time - https://phabricator.wikimedia.org/T78671#850716 (10hashar) Looking at https://logstash-beta.wmflabs.org/ hhvm reports a bunch of slow queries such as. The Selenium user having user_id 820. ```SlowTimer [59999ms] at r... [16:18:35] * greg-g does the cafe/friend house hopping in search of wifi for our call [16:19:06] heading out to grab daughter / grocery. Be back for the call :D [16:46:50] beta labs is down "(Cannot contact the database server: Can't connect to MySQL server on '10.68.17.94' (4) (10.68.17.94))" [16:49:31] chrismcmahon: mutante is looking I think, see #-labs [16:51:15] good timing [17:02:24] hi. could you paste the last lines from icinga-wm? [17:02:37] greg-g: we're all here [17:08:47] !log deployment-salt:/var/lib/git/operations/puppet is a rebase hell of cherry-picks that don't apply [17:08:50] Logged the message, Master [17:08:56] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850916 (10hashar) + RelEng project to make this task appear on our team work board. [17:08:57] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 49150 bytes in 0.716 second response time [17:09:23] !log trying to fix it without using important changes [17:09:25] Logged the message, Master [17:10:25] there, beta is fixed by labs [17:10:34] Coren restarted a db server instance [17:13:42] :-( [17:13:50] !log removed cherry pick of I6084f49e97c855286b86dbbd6ce8e80e94069492 (merged by Ori with a change) [17:13:53] Logged the message, Master [17:14:58] !log removed cherry pick of I08c24578596506a1a8baedb7f4a42c2c78be295a (-2 by _joe_ in gerrit; replaced by Iba742c94aa3df7497fbff52a856d7ba16cf22cc7) [17:15:01] Logged the message, Master [17:15:43] !log removed cherry pick of I3b6e37a2b6b9389c1a03bd572f422f898970c5b4 (modified in gerrit by bd808 and not repicked; merged) [17:15:46] Logged the message, Master [17:16:48] !log removed cherry pick of Ib2a0401a7aa5632fb79a5b17c0d0cef8955cf990 (-2 by _joe_; replaced by Ibcad98a95413044fd6c5e9bd3c0a6fb486bd5fe9) [17:16:51] Logged the message, Master [17:17:26] !log git-sync-upstream runs cleanly on deployment-salt again! [17:17:29] Logged the message, Master [17:19:24] 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#850943 (10Krinkle) p:5Volunteer?>3Normal [17:20:09] 3Continuous-Integration: Jenkins: integration-zuul-layoutdiff job says "No layout changes" when there are - https://phabricator.wikimedia.org/T73740#850945 (10Krinkle) p:5Triage>3Volunteer? [17:21:23] 3Continuous-Integration: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#850948 (10Krinkle) p:5Triage>3Normal [17:21:57] 3Continuous-Integration, Librarization: Set up composer validate job for operations/mediawiki-config - https://phabricator.wikimedia.org/T76621#850952 (10Krinkle) p:5Triage>3Normal [17:22:19] 3Continuous-Integration, Wikimedia-Git-or-Gerrit: labs-tools-grrrit-yamllint job broken? - https://phabricator.wikimedia.org/T76508#850953 (10Krinkle) p:5Triage>3High [17:22:37] 3Continuous-Integration: integration-config-tox-py27 fails existing master when repos get deleted from Gerrit - https://phabricator.wikimedia.org/T76853#850954 (10Krinkle) p:5Triage>3Low [17:25:29] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins - https://phabricator.wikimedia.org/T78590#850957 (10Krinkle) Sounds like an issue with the code. Unit tests should mock anything they don't depe... [17:26:14] 3Beta-Cluster: Puppet failures on deployment-bastion - https://phabricator.wikimedia.org/T75520#850960 (10mmodell) > @bd808: (copied from gerrit) >Keys can be placed in beta via local commits in deployment-salt:/var/lib/git/labs/private. That is how the ssh keypair for beta's scap wrapper were provisioned. so w... [17:27:19] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#850970 (10hashar) [17:27:31] 3Continuous-Integration: Empty .git/config for mediawiki/core.git clone in mediawiki-phpunit workspace on gallium - https://phabricator.wikimedia.org/T78474#850971 (10Krinkle) [17:29:24] 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#850973 (10Krinkle) [17:34:49] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:42:56] 3Continuous-Integration: Jenkins: JSDuck should run on Ruby 1.9 instead of Ruby 1.8 - https://phabricator.wikimedia.org/T62138#851039 (10matmarex) Ruby 1.9.x will also be EOL'd in February, by the way, so the target should probably be 2.0 or 2.1. [17:52:18] hashar: Mmm any ideas on https://phabricator.wikimedia.org/T74184 ? [17:52:43] Krinkle: no clue. But I am sure WMDE folks will figure it out [17:53:12] not sure why mw-debug-log is non writable, should be 777 [17:53:40] marxarelli: sorry about the rspec / puppet / MediaWikiVagrant mess early in the meeting [17:53:40] greg-g: had to leave, VE standup coming up. [17:53:54] hashar: Yeah, is it still happening? [17:53:55] Krinkle: no worries [17:54:00] hashar: ain't no thing [17:54:10] marxarelli: I don't really have a priority between ops/puppet vs vagrant. I would just love some kind of RSpec tests to start appearing magically on one of the repos using Puppet [17:54:14] hollar back ya'll [17:54:23] hashar: i'm down to help with either case, whichever we think is higher priority [17:54:44] Krinkle: no idea :-( I don't think it is much of a priority for them. IIRC they rely more on the selenium browser tests to verify whether their spring is done / non regressing [17:54:49] probably, ideally, ops/puppet, so it has benefits for both production and beta cluser [17:54:53] hashar: I wrote to the log yesterday that I created a new precise instance (slave1005). I left 1001 as is (dead/shutdown) for labs admins to investigate if they find it interesting. [17:54:54] hashar: i mean, you already have us halfway there, so it might make sense to just figure out the remaining pieces in the context of ops puppet [17:54:56] is my first level opinion at least [17:55:28] marxarelli: yeah I am sure the ops/puppet can be polished/finished up by someone with more ruby/RSpec knowledge than I . [17:55:56] marxarelli: I attempted to build a Rake entry point in ops/puppet to execute rspec in each of the puppet module having a 'spec' dir. But that never worked fine :( [17:55:58] hashar: i'll take a look later this week and early next [17:55:59] marxarelli: a patch is https://gerrit.wikimedia.org/r/#/c/180162/ [17:56:40] marxarelli: I wanted to: rake spec, then: foreach module: RSpec::core::run( somemodule ) && RSpec.reset ; end [17:56:42] hashar: ah. i can help fix that up [17:56:59] the idea is to have the rake task to use the RSPec provided by bundler [17:57:06] so the entry point would be called as: bundle exec rspec spec [17:57:12] and run a rspec per module [17:57:36] Krinkle: yeah I deleted 1001 while investigating with YuviPanda [17:57:37] hashar: resetting _shouldn't_ be necessary as each example is setup/torndown [17:58:04] Krinkle: coren looked at it, the issue is that the mount are released before network/ldap is available. So the tmpfs mounted for jenkins-deploy user fails because that user is not known yet :-( [17:58:41] hashar: but yeah, i have to dedicate today/tomorrow to mw-selenium. after that, i'm freed up to help with rspec/puppet [18:00:56] hashar: The user is used by other stuff fine though. The puppet class not existing is not the problem. It's just the ldap thing. [18:01:09] hashar: I remember working around this by using uid's instead (numerical) [18:01:16] must've forgotten to put that back in the patch [18:01:23] Which was Reedy's recommendation. [18:02:23] marxarelli: here is the lame idea that fails: https://gerrit.wikimedia.org/r/#/c/180215/1/rakefile [18:02:43] Krinkle: yeah it is not really an issue with your patch [18:02:51] Krinkle: more about the boot sequence apparently [18:03:08] Krinkle: feel free to poke coren about it [18:03:16] gbtw, forgot to mention during the meeting that i have a dentist appointment at 11, and i'll be using that an excuse to visit my favorite burger joint in downtown [18:03:28] same heading out [18:03:31] have a board meeting [18:03:33] * hashar waves [18:03:50] thanks Krinkle to have joined the meeting :] [18:14:36] 3Continuous-Integration: common gating job for mediawiki core and extensions - https://phabricator.wikimedia.org/T60772#851226 (10Krinkle) Is this task ready to be closed? [18:15:39] * greg-g is moving to a lunch location [18:17:42] 3Release-Engineering: Create phpunit test in mediawiki-config repo to validate Parsoid settings - https://phabricator.wikimedia.org/T70532#851228 (10Krinkle) p:5Normal>3Triage [18:57:16] (03PS1) 10Awight: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 [19:12:26] 3VisualEditor, Continuous-Integration: Jenkins: Convert mwext qunit from grunt-contrib-qunit (PhantomJS) to grunt-karma (Chromium) - https://phabricator.wikimedia.org/T74063#851420 (10Krinkle) Test comment to see if mentioning #Patch-For-Review adds it. [19:12:44] 3VisualEditor, Continuous-Integration: Jenkins: Convert mwext qunit from grunt-contrib-qunit (PhantomJS) to grunt-karma (Chromium) - https://phabricator.wikimedia.org/T74063#851424 (10Krinkle) [19:15:04] (03PS2) 10Krinkle: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight) [19:16:49] (03CR) 10Krinkle: [C: 032] Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight) [19:17:42] (03Merged) 10jenkins-bot: Make jslint voting for fundraising-dash [integration/config] - 10https://gerrit.wikimedia.org/r/180226 (owner: 10Awight) [20:14:50] just got "(Cannot contact the database server: Can't connect to MySQL server on '10.68.17.94' (4) (10.68.17.94))" on beta labs making a regular page edit [20:15:21] but retried and it succeeded. [20:27:17] UploadWizard wtf [20:27:31] spagewmf: weird, it did that this morning [20:31:34] chrismcmahon: the problem is intermittent beta labs failures like this cause intermittent test failures. I think the fix is for the cucumber run to retry any test that failed, and only if it fails twice does the test count as an error [20:31:53] I think Jon robson proposed retrying failed tests [20:32:39] The QA automation guy at my $DAYJOB-1 had a rather elaborate retry system for failed full-stack tests. [20:33:00] I think it too 3/5 runs failing to actually mark the test as failed [20:33:13] *took [20:37:55] I've written retry loops from scratch but never for Jenkins. Right now I don't think we have enough time in the day to add a retry loop. [20:38:25] chrismcmahon: you mean time for Jenkins or time for a developer to code it? [20:38:33] spagewmf: for jenkins [20:38:53] as it is resourced now, but we're adding more jenkins slaves soon [20:38:59] also, I like knowing that tests fail when beta labs is messed up [20:39:23] there should be other ways of knowing that :) [20:41:02] the other problem with a retry loop is that when tests fail for real reasons it costs a lot of time. [20:41:20] greg-g: maybe. Flow generally has 0-2 test failures out of 30, seems a small increase in load. There are occasional meltdowns, e.g. https://integration.wikimedia.org/ci/view/BrowserTests/view/Echo+Flow/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/ [20:42:15] spagewmf: that graph is looking better :) [20:43:11] chrismcmahon: yes tests failing because beta labs had a transient problem is important, but it's hard to tease out when you're looking at a particular test like above. "Jenkins, were other tests having problems that minute?" :) [20:43:23] qa=AI problem [20:44:15] actually... if we have a sufficient number of tests run near a specific point in time.... [20:44:18] :) [20:47:50] greg-g, chrismcmahon ^ I assume a QATAW (QA Test Analysis Wizard) knows how to take a single test failure and quickly establish if other test runs had problems. (Me, I get stuck converting "Last failure 1 hr 24 min" to UTC :) ) [20:52:01] Somebody could feed all the test results in to beta's logstash [20:52:35] https://wiki.jenkins-ci.org/display/JENKINS/Logstash+Plugin [20:55:59] but that doesn't turn any tests from red to green. you still have to look at the red ones. [20:56:24] No, but it would make the "did everything fail or just this" easy to spot [20:56:27] it just gives you trend data [20:57:15] bd808: does logstash-beta have room to grow for a while, datastore wise? [20:58:44] limiting retry of failures do to 502/503's might be a good starting point [20:58:51] *due* [20:59:03] +1 [21:00:19] it can be done in mw-selenium as well. it wouldn't necessarily require anything in jenkins [21:01:25] greg-g: not sure... let me peek [21:02:28] PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:02:53] greg-g: hehe. yeah. we are at 3% used on the /var/lib/elasticsearch mount [21:03:12] bd808: awesome [21:03:17] log all the things! [21:03:21] PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:03:31] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:03:41] grrr, puppet [21:03:46] And we age out at 31 days there just like prod so we'd have to add a lot more data to cause a problem [21:03:55] grrrr dns probably [21:04:22] huh, no wikibugs [21:04:28] that's not a good sign either [21:04:39] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:04:46] oh, they just restarted it [21:05:18] our openstack install uses what I would consider to be a dns resolver suitable for a small home network as the authoritative resolver for all of labs. [21:06:10] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:06:36] bd808: what makes it SOHO-quality? [21:06:46] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:06:50] http://www.thekelleys.org.uk/dnsmasq/doc.html -- Read the page title :) [21:07:03] greg-g: yeah, wikibugs is just valhallasw playing with it [21:07:03] well then.. [21:07:09] YuviPanda: /me nods [21:07:18] bd808: pretty good reason [21:07:39] bd808: did you check if it was DNS again? I saw some strange LVM errors in one as well [21:07:50] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:08:01] greg-g: dnsmasq is the dns server that is embedded in most home gateway routers. It's a tiny binary for a tiny job [21:08:23] YuviPanda: I didn't look, just assumed [21:08:40] bd808: hmm, ok [21:08:51] wouldn’t be surpriesd, though, since it’s a bunch of ‘em going down [21:08:55] :( [21:08:57] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [21:09:00] bd808: greg-g how’s the noisiness level of shinken-wm now? [21:09:08] hmm [21:09:11] well, terrible question [21:09:19] it’s spammy but that’s because of actual failures... [21:09:33] hah, I was going to say [21:09:51] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:10:10] I should really spend the 2h required to customize that stupif error message when something goes critical [21:10:18] -LVS_SERVICE_IPS="10.0.5.3" +LVS_SERVICE_IPS="" -- boom! [21:10:59] Could not retrieve file metadata for puppet:///modules/ldap/scripts/ldapkeys: Connection refused - connect(2) [21:10:59] YuviPanda: is there credence to/and thoughts about bd808's "blame the SOHO-quality dns software" [21:11:23] bd808: oh, it’s the puppetmaster failing? [21:11:50] greg-g: it is a terrible DNS server, and coren’s been meaning to put in more patching in place... [21:11:53] it seems to be a wide variety of random shit [21:13:14] ummm.... firewall rules on deployment-salt may have changed to block everything if I'm reading this log right [21:13:38] "more patching"? that sounds.... [21:13:41] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:14:10] bd808: mutante was working on adding base::firewall to bastion roles... [21:14:15] bd808: not sure if that got deployed... [21:14:39] Yeah it was deployment-bastion not deployment-salt [21:15:51] bd808: (btw, feel free to unsub from that jenkins/logstash bug I just filed) [21:16:02] ('tis a low prio) [21:16:26] bd808: doesn’t seem to be merged yet [21:17:40] YuviPanda: I was just scanning through logs at https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/puppet%20runs and saw changes to DROP rules but now that I think about it I think the puppet role does that on every run [21:19:54] oomkiller is going nuts on deployment-pdf01 [21:20:19] * greg-g gets back on the travel train (plane) [21:32:23] RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:32:25] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:33:20] RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:33:28] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:34:38] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:34:52] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:05] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:40] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:37] (03PS1) 10Gergő Tisza: Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 [21:37:50] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:42] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:56] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:50:47] (03Abandoned) 10Dduvall: Try local gem cache in bundle job [integration/config] - 10https://gerrit.wikimedia.org/r/176818 (owner: 10Dduvall) [21:54:47] (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180079 (owner: 10Dduvall) [21:55:18] (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 (owner: 10Dduvall) [21:55:40] (03Merged) 10jenkins-bot: Fixed screenshots upon failed scenarios [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180080 (owner: 10Dduvall) [21:56:06] (03CR) 10Dduvall: [C: 032] "Self-merging to experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180084 (owner: 10Dduvall) [22:02:24] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:07:02] 3Continuous-Integration, Phabricator: Phabricator project display names use awkward capitalization and hyphenation - https://phabricator.wikimedia.org/T911#851808 (10Krinkle) @MZMcBride: I don't think adding redirects would improve anything. Nothings points to those urls in the first place. It's safe to say we'r... [22:25:12] greg-g: hi, during our weekly checkin you said the beta cluster DB is dead/slow somehow. If you could point to a phab task about it it would be nice, seems beta works fine for me [22:25:23] (by email since I am heading bed :D ) [22:27:52] 3Continuous-Integration: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#851852 (10hashar) [22:33:07] (03Abandoned) 10MaxSem: Disable npm checks for MobileFrontend [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem) [22:34:59] 3Beta-Cluster: Can not sudo on deployment-cache-mobile03 - https://phabricator.wikimedia.org/T78720#851871 (10hashar) 3NEW [22:36:26] (03CR) 10Hashar: "We probably could use a bug about npm being slow from time to time. Could be mitigated by having the jenkins slaves to use a caching proxy" [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem) [22:53:17] (03PS1) 10Hashar: operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 [22:53:32] (03CR) 10Hashar: [C: 032] operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 (owner: 10Hashar) [22:58:24] (03Merged) 10jenkins-bot: operations-software-ircyall-tox-flake8-trusty [integration/config] - 10https://gerrit.wikimedia.org/r/180335 (owner: 10Hashar) [23:03:46] (03CR) 10Krinkle: "@Hashar: That hasn't been the case for almost 2 years. The whole 0.8.x upgrade issues with ssl hacks was because the old version was relyi" [integration/config] - 10https://gerrit.wikimedia.org/r/179345 (owner: 10MaxSem)