[00:08:37] 10Deployment-Systems, 10MediaWiki-extensions-LocalisationUpdate, 10I18n, 10Patch-For-Review: Set l10nupdate cron to run Mon-Thursday - https://phabricator.wikimedia.org/T164035#3365979 (10Dzahn) 05Open>03Resolved ``` [tin:~] $ sudo crontab -u l10nupdate -l | tail -2 # Puppet Name: l10nupdate 0 2 * * 1,... [00:17:37] (03PS2) 10Thcipriani: puppet docker: add missing wrappers [integration/config] - 10https://gerrit.wikimedia.org/r/360363 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [00:25:48] (03CR) 10Thcipriani: [C: 032] puppet docker: add missing wrappers [integration/config] - 10https://gerrit.wikimedia.org/r/360363 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [00:28:02] (03Merged) 10jenkins-bot: puppet docker: add missing wrappers [integration/config] - 10https://gerrit.wikimedia.org/r/360363 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [00:40:32] (03PS1) 10MaxSem: Add Phan tests to LoginNotify [integration/config] - 10https://gerrit.wikimedia.org/r/360587 [01:34:33] PROBLEM - Puppet staleness on deployment-aqs01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:08:27] Yippee, build fixed! [04:08:27] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,BrowserTests build #429: 09FIXED in 12 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=BrowserTests/429/ [05:20:39] 10Gerrit, 10Operations, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3366375 (10demon) Hmm, all this started after we tried swapping SysV init for systemd. Funny how that correlates 🤔 😏 [05:23:17] 10Gerrit, 10Operations, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3366376 (10Dzahn) Better to raise it (https://gerrit.wikimedia.org/r/#/c//1) than not raise it. I am happy to build the new deb... [07:19:43] 10Deployment-Systems, 10ArchCom-RfC, 10I18n: RFC: Reevaluate LocalisationUpdate extension for WMF - https://phabricator.wikimedia.org/T158360#3366521 (10Nemo_bis) [07:36:04] 10Scap, 10Discovery, 10Interactive-Sprint, 10Maps (Kartotherian), 10Patch-For-Review: Break Kartotherian scap3 deployment into 2 groups - https://phabricator.wikimedia.org/T147337#3366535 (10Gehel) Patch merged, next deployment will use it. This can be closed. [07:38:38] 10Continuous-Integration-Infrastructure, 10MediaWiki-History-or-Diffs, 10TCB-Team, 10wikidiff2, and 3 others: php-compile-hhvm jenkins job failing with "command not found" error - https://phabricator.wikimedia.org/T168410#3366547 (10Tobi_WMDE_SW) Thank you @Paladox&@hashar! [07:39:53] 10Continuous-Integration-Infrastructure, 10MediaWiki-History-or-Diffs, 10TCB-Team, 10wikidiff2, and 3 others: Migrate php-compile-hhvm test to jessie - https://phabricator.wikimedia.org/T168410#3366550 (10Tobi_WMDE_SW) 05Open>03Resolved [08:34:27] (03CR) 10Hashar: [C: 032] Add Phan tests to LoginNotify [integration/config] - 10https://gerrit.wikimedia.org/r/360587 (owner: 10MaxSem) [08:36:16] (03Merged) 10jenkins-bot: Add Phan tests to LoginNotify [integration/config] - 10https://gerrit.wikimedia.org/r/360587 (owner: 10MaxSem) [08:37:24] Project beta-scap-eqiad build #160628: 04FAILURE in 1 min 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160628/ [08:45:58] Yippee, build fixed! [08:45:59] Project beta-scap-eqiad build #160629: 09FIXED in 2 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160629/ [08:55:58] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [09:16:17] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [09:30:57] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [09:38:27] !log upgrading/Rebooting all instances from integration project to catch up with Linux kernel upgrades [09:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:39:07] !log integration: deleting swift and and swift-storage-01 unused [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:39:44] PROBLEM - Puppet errors on saucelabs-01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [09:40:48] PROBLEM - Host swift is DOWN: CRITICAL - Host Unreachable (10.68.20.150) [09:41:18] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:42:18] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Next), 10Wikidata, and 2 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3366736 (10Aleksey_WMDE) Hey! Today I received another pack of emails about failing... [09:42:27] PROBLEM - Host swift-storage-01 is DOWN: CRITICAL - Host Unreachable (10.68.22.30) [09:49:44] RECOVERY - Puppet errors on saucelabs-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:57:11] !log Upgrading puppet 3.7.2 .. 3.8.5 on integration-slave-docker-1001 and integration-slave-docker-1002 [09:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:59:28] !log integration: removing swift / python-swift from integration-puppetmaster01 [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:01:07] PROBLEM - Puppet errors on integration-slave-docker-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [10:04:29] PROBLEM - Puppet errors on integration-slave-docker-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:04:33] PROBLEM - Puppet errors on integration-puppetmaster01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [10:09:56] PROBLEM - Puppet errors on integration-saltmaster is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [10:14:34] RECOVERY - Puppet errors on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:16:06] RECOVERY - Puppet errors on integration-slave-docker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [10:18:17] PROBLEM - Puppet errors on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [10:19:29] RECOVERY - Puppet errors on integration-slave-docker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [10:19:55] RECOVERY - Puppet errors on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [10:20:01] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Next), 10Wikidata, and 2 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3366845 (10Tobi_WMDE_SW) Yeah, most of them are failing due to ``` Sauce could not s... [10:21:56] !log deployment-zotero01 apt-get upgrade and rebooted [10:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:22:27] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [10:28:18] RECOVERY - Puppet errors on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [10:35:54] Project beta-scap-eqiad build #160640: 04FAILURE in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160640/ [10:35:59] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [10:36:07] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [10:46:08] RECOVERY - Puppet errors on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [10:46:12] Yippee, build fixed! [10:46:13] Project beta-scap-eqiad build #160641: 09FIXED in 2 min 29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160641/ [10:48:50] PROBLEM - Puppet errors on deployment-eventlogging03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [10:51:01] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:59:32] PROBLEM - Puppet errors on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:00:28] !log deployment-prep apt-get upgrade and reboot all hosts [11:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:07:51] PROBLEM - Host Generic Beta Cluster is DOWN: PING CRITICAL - Packet loss = 100% [11:08:13] !log deployment-prep : rebooting deployment-tin deployment-mira deployment-cache-text04 deployment-cache-upload04 [11:08:15] PROBLEM - SSH on deployment-tin is CRITICAL: Connection refused [11:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:08:27] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 6.77 ms [11:11:15] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 2298 bytes in 0.048 second response time [11:12:00] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [11:12:12] !log varnish fails on deployment-cache-text04 [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:12:18] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [11:13:16] RECOVERY - SSH on deployment-tin is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [11:14:22] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:14:32] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [11:15:00] !log deployment-cache-text04 : apt-get dist-upgrade [11:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:18:29] Backend host '"citoid.wmflabs.org"' could not be resolved to an IP address: [11:18:29] bah [11:18:31] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:18:51] RECOVERY - Puppet errors on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:22:17] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:23:59] 10Continuous-Integration-Infrastructure, 10MediaWiki-History-or-Diffs, 10TCB-Team, 10wikidiff2, and 3 others: Migrate php-compile-hhvm test to jessie - https://phabricator.wikimedia.org/T168410#3367000 (10Paladox) Your welcome :) [11:24:29] RECOVERY - Puppet errors on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [11:24:44] Project beta-code-update-eqiad build #160748: 04FAILURE in 11 min: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/160748/ [11:24:45] Project beta-update-databases-eqiad build #17909: 04FAILURE in 4 min 43 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/17909/ [11:25:47] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [11:26:54] Yippee, build fixed! [11:26:55] Project beta-code-update-eqiad build #160749: 09FIXED in 51 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/160749/ [11:26:58] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [11:28:59] Project beta-scap-eqiad build #160644: 04FAILURE in 2 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160644/ [11:33:16] 10Beta-Cluster-Infrastructure: Beta cluster varnish fails VCL compilation because citoid.wmflabs.org does not resolve - https://phabricator.wikimedia.org/T168519#3367010 (10hashar) [11:35:07] Project beta-scap-eqiad build #160645: 04STILL FAILING in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160645/ [11:35:49] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [11:36:19] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 33390 bytes in 2.236 second response time [11:37:26] 10Beta-Cluster-Infrastructure: Could not find class role::etcd::common for deployment-conf03.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T168520#3367026 (10hashar) [11:42:30] (03CR) 10Aude: [C: 032] Update Wikidata to wmf/1.30.0-wmf.6 [tools/release] - 10https://gerrit.wikimedia.org/r/360351 (owner: 10Aude) [11:43:20] (03Merged) 10jenkins-bot: Update Wikidata to wmf/1.30.0-wmf.6 [tools/release] - 10https://gerrit.wikimedia.org/r/360351 (owner: 10Aude) [11:45:06] Project beta-scap-eqiad build #160646: 04STILL FAILING in 1 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160646/ [11:45:09] 10Beta-Cluster-Infrastructure: Could not find class role::etcd::common for deployment-conf03.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T168520#3367047 (10hashar) At https://horizon.wikimedia.org/project/prefixpuppet/ Changed `role::etcd::common` to `profile::etcd` [11:45:49] 10Beta-Cluster-Infrastructure: Could not find class role::etcd::common for deployment-conf03.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T168520#3367049 (10hashar) [11:45:51] 10Beta-Cluster-Infrastructure: Beta cluster varnish fails VCL compilation because citoid.wmflabs.org does not resolve - https://phabricator.wikimedia.org/T168519#3367048 (10hashar) [11:51:22] 10Beta-Cluster-Infrastructure: Could not find class role::etcd::common for deployment-conf03.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T168520#3367052 (10hashar) It is broken further due to `profile:etcd:discovery` not being set. In production that seems to point to a k8s cluster. [11:54:23] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [11:55:08] Project beta-scap-eqiad build #160647: 04STILL FAILING in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160647/ [11:57:52] 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Beta cluster varnish fails VCL compilation because citoid.wmflabs.org does not resolve - https://phabricator.wikimedia.org/T168519#3367071 (10hashar) I have cherry picked https://gerrit.wikimedia.org/r/360639 on the beta cluster. Somehow that causes the citoid... [11:58:31] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [11:59:01] !log armed keyholder on deployment-tin and deployment-mira [11:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:02:04] Yippee, build fixed! [12:02:04] Project beta-scap-eqiad build #160648: 09FIXED in 2 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160648/ [12:04:50] !log apt-get dist-upgrade on deployment-mediawiki hosts [12:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:07:01] 10Release-Engineering-Team, 10Page-Previews, 10Reading-Web-Backlog, 10Reading-Web-Kanban-Board: Create bot that automatically rebases and rebuilds patches to master - https://phabricator.wikimedia.org/T167181#3367096 (10Jhernandez) >>! In T167181#3363861, @Jdlrobson wrote: > Yep. I did originally do this f... [12:12:15] PROBLEM - SSH on deployment-mediawiki06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:47] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [12:17:05] RECOVERY - SSH on deployment-mediawiki06 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:17:18] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [12:17:56] ^^ that is me upgrading the hosts [12:18:20] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [12:22:16] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:23:20] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:25:01] PROBLEM - SSH on deployment-jobrunner02 is CRITICAL: Connection refused [12:25:49] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [12:26:18] Project beta-scap-eqiad build #160651: 04FAILURE in 2 min 29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160651/ [12:28:22] !log deployment-imagescaler01 removed puppetmaster and puppetmaster-common packages [12:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:28:47] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [12:29:49] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [12:29:55] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [12:30:00] RECOVERY - SSH on deployment-jobrunner02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:30:32] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [12:33:48] PROBLEM - Puppet errors on deployment-trending01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:34:23] Yippee, build fixed! [12:34:24] Project beta-update-databases-eqiad build #17910: 09FIXED in 14 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/17910/ [12:36:04] Yippee, build fixed! [12:36:04] Project beta-scap-eqiad build #160652: 09FIXED in 2 min 19 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160652/ [12:37:10] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:38:47] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:39:09] PROBLEM - Puppet errors on deployment-puppetdb01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:39:21] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:39:25] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [12:39:49] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:39:49] PROBLEM - Puppet errors on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:39:55] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [12:40:09] PROBLEM - Puppet errors on deployment-mcs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:40:31] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [12:40:31] PROBLEM - Puppet errors on deployment-changeprop is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:42:08] PROBLEM - Puppet errors on deployment-memc04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:42:34] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:42:54] PROBLEM - Puppet errors on deployment-sca03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:43:02] PROBLEM - Puppet errors on deployment-db04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:43:16] PROBLEM - Puppet errors on deployment-sentry01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:43:16] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:43:34] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:43:47] RECOVERY - Puppet errors on deployment-trending01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:44:24] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:44:40] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:44:42] PROBLEM - Puppet errors on deployment-etcd-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:44:44] PROBLEM - Puppet errors on deployment-elastic05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:45:12] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:45:22] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:45:51] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:45:51] PROBLEM - Puppet errors on deployment-ms-be03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:45:55] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:47:03] PROBLEM - Puppet errors on deployment-puppetmaster02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:47:10] !log broke deployment-prep puppet master while upgrading it :( [12:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:48:19] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:49:19] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [12:49:31] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:49:39] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:49:45] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:50:25] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:51:48] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:52:26] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:52:58] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:53:00] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:53:06] PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:53:44] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:53:46] PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:54:16] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:55:00] PROBLEM - Puppet errors on deployment-sca02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:55:22] PROBLEM - Puppet errors on deployment-prometheus01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:55:48] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:56:32] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:56:49] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:58:49] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:59:27] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:00:31] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:00:47] PROBLEM - Puppet errors on deployment-memc05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:01:58] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:03:08] PROBLEM - Puppet errors on deployment-elastic06 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:03:22] PROBLEM - Puppet errors on deployment-restbase02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:04:02] PROBLEM - Puppet errors on deployment-db03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:04:04] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:04:24] PROBLEM - Puppet errors on deployment-ircd is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:04:28] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:04:34] PROBLEM - Puppet errors on deployment-kafka03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:04:46] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [13:04:48] PROBLEM - Puppet errors on deployment-trending01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:06:34] 10Gerrit, 10Operations, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3367211 (10Paladox) According to bin/gerrit.sh status this is what the init script has GERRIT_FDS = 12000 (th... [13:09:25] PROBLEM - Puppet errors on deployment-stream is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:09:49] PROBLEM - Puppet errors on deployment-ms-fe02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:09:57] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:12:07] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:20:51] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review, 10User-zeljkofilipin: Run WebdriverIO tests in CI for extensions - https://phabricator.wikimedia.org/T164721#3367250 (10zeljkofilipin) >>! In T164721#3355289, @Jdlrobson wrote: > The purpose is only for browser tests.... [13:26:35] !log deployment-prep: puppet master got erroneously upgrade to puppet* 4.8. Roll it back to 3.8 which fail, and then back to 3.7! [13:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:27:24] f**** that [13:29:17] hashar what worked for me when that happened was downloading the debs for puppet manually. [13:29:34] ie search on debian, download deb and install it. [13:29:38] apt-get install puppetmaster=3.7.2-4+deb8u1 puppetmaster-common=3.7.2-4+deb8u1 puppet-common=3.7.2-4+deb8u1 puppetmaster-passenger=3.7.2-4+deb8u1 [13:29:39] works :) [13:30:05] oh. I wonder why it is trying to update you to 4.8 :) [13:41:13] ok maybe I have fixed soemthing [13:41:20] but still fail to find roles : could not find class role::beta::mediawiki [13:50:13] RECOVERY - Puppet errors on deployment-mcs01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:54:19] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [13:54:39] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [13:56:29] 10Release-Engineering-Team (Watching / External), 10Operations, 10Performance-Team, 10monitoring, 10Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#3367352 (10Gilles) 05Open>03Resolved a:03Gilles Our new Grafana-ba... [14:01:31] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [14:02:53] !log deployment-puppmaster (cd /etc/puppet && ln -s /var/lib/git/operations/puppet/manifests && ln -s /var/lib/git/operations/puppet/modules) [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:03:41] now they all fail with "Could not find data item discovery::app_routes in any Hiera data file and no default supplied at /etc/puppet/manifests/realm.pp:51" [14:04:29] PROBLEM - Host deployment-ms-be03 is DOWN: CRITICAL - Host Unreachable (10.68.22.125) [14:04:49] PROBLEM - Host deployment-etcd-01 is DOWN: CRITICAL - Host Unreachable (10.68.19.227) [14:04:54] :( [14:09:47] RECOVERY - Puppet errors on deployment-trending01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:11:09] PROBLEM - Puppet errors on deployment-mcs01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:17:21] !log finally fixed puppet on deployment-prep ! [14:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:18:54] ;( [14:19:51] RECOVERY - Host deployment-etcd-01 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [14:20:19] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [14:20:59] RECOVERY - Host deployment-ms-be03 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [14:21:54] !log deployment-prep: force running puppet on all instances [14:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:24:56] PROBLEM - Host deployment-ms-fe02 is DOWN: CRITICAL - Host Unreachable (10.68.19.247) [14:24:56] PROBLEM - Host deployment-eventlogging03 is DOWN: CRITICAL - Host Unreachable (10.68.18.111) [14:24:58] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.23.242) [14:25:00] PROBLEM - Host integration-slave-trusty-1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.168) [14:25:26] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:25:38] PROBLEM - Host integration-slave-trusty-1006 is DOWN: CRITICAL - Host Unreachable (10.68.17.118) [14:25:41] Project beta-scap-eqiad build #160664: 04FAILURE in 1 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160664/ [14:27:05] RECOVERY - Puppet errors on deployment-puppetmaster02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:27:25] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:27:27] PROBLEM - Host deployment-salt02 is DOWN: CRITICAL - Host Unreachable (10.68.17.58) [14:27:33] PROBLEM - Host deployment-sca03 is DOWN: CRITICAL - Host Unreachable (10.68.21.183) [14:27:43] PROBLEM - Host deployment-sca01 is DOWN: CRITICAL - Host Unreachable (10.68.20.183) [14:28:16] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:28:16] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:28:41] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [14:28:49] RECOVERY - Puppet errors on deployment-ms-be04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:28:51] PROBLEM - Host deployment-redis01 is DOWN: CRITICAL - Host Unreachable (10.68.16.177) [14:29:04] PROBLEM - Host deployment-memc05 is DOWN: CRITICAL - Host Unreachable (10.68.23.49) [14:29:30] RECOVERY - Puppet errors on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [14:29:30] RECOVERY - Puppet errors on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:29:36] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:29:47] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.17.68) [14:29:47] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:29:59] RECOVERY - Puppet errors on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:30:51] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:30:51] RECOVERY - Puppet errors on deployment-ms-be03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:31:09] RECOVERY - Puppet errors on deployment-mcs01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:31:14] RECOVERY - Host integration-slave-trusty-1006 is UP: PING OK - Packet loss = 0%, RTA = 3.44 ms [14:31:24] RECOVERY - Host integration-slave-trusty-1001 is UP: PING OK - Packet loss = 0%, RTA = 4.82 ms [14:31:48] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:32:59] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:01] RECOVERY - Host deployment-eventlogging03 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [14:33:05] RECOVERY - Host deployment-ms-fe02 is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms [14:33:07] RECOVERY - Puppet errors on deployment-kafka01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:15] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 7.87 ms [14:33:44] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:16] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:20] Project selenium-WikiLove » firefox,beta,Linux,BrowserTests build #430: 04FAILURE in 2 min 18 sec: https://integration.wikimedia.org/ci/job/selenium-WikiLove/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/430/ [14:34:26] RECOVERY - Puppet errors on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:40] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:58] RECOVERY - Puppet errors on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:35:20] RECOVERY - Puppet errors on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:35:34] RECOVERY - Puppet errors on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [14:35:40] Project beta-scap-eqiad build #160665: 04STILL FAILING in 1 min 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160665/ [14:35:50] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [14:35:57] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [14:36:07] RECOVERY - Host deployment-memc05 is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [14:36:49] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [14:37:07] RECOVERY - Puppet errors on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:37:11] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:37:17] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [14:37:35] RECOVERY - Host deployment-sca03 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:37:43] RECOVERY - Host deployment-sca01 is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms [14:38:23] RECOVERY - Puppet errors on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:38:30] RECOVERY - Host deployment-redis01 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:39:03] RECOVERY - Puppet errors on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:39:09] RECOVERY - Puppet errors on deployment-puppetdb01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:39:27] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [14:39:28] Project beta-scap-eqiad build #160666: 04STILL FAILING in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160666/ [14:39:36] RECOVERY - Puppet errors on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:40:12] PROBLEM - Host deployment-stream is DOWN: CRITICAL - Host Unreachable (10.68.17.106) [14:40:42] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:40:44] PROBLEM - Host saucelabs-01 is DOWN: CRITICAL - Host Unreachable (10.68.21.186) [14:41:07] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [14:41:19] PROBLEM - Host deployment-zotero01 is DOWN: CRITICAL - Host Unreachable (10.68.17.102) [14:41:28] !log deployment-tmh01 is down for some reason [14:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:41:37] PROBLEM - Host deployment-kafka04 is DOWN: CRITICAL - Host Unreachable (10.68.17.9) [14:42:22] ah no [14:42:25] hashar they are rebooting labvirts [14:42:27] that is just instances being repooled [14:42:34] PROBLEM - Host deployment-redis02 is DOWN: CRITICAL - Host Unreachable (10.68.16.231) [14:42:58] PROBLEM - Host deployment-tmh01 is DOWN: CRITICAL - Host Unreachable (10.68.16.211) [14:43:06] that explains it :) [14:43:48] RECOVERY - Puppet errors on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:44:02] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:44:50] RECOVERY - Puppet errors on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:44:52] RECOVERY - Puppet errors on deployment-ms-fe02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:45] Project beta-scap-eqiad build #160667: 04STILL FAILING in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160667/ [14:45:49] RECOVERY - Puppet errors on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:55] PROBLEM - Puppet errors on integration-saltmaster is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:47:09] RECOVERY - Puppet errors on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:38] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:44] RECOVERY - Host deployment-stream is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [14:47:52] RECOVERY - Puppet errors on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:02] RECOVERY - Puppet errors on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:06] RECOVERY - Host deployment-tmh01 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [14:48:16] RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:52] RECOVERY - Host deployment-zotero01 is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [14:49:04] RECOVERY - Host deployment-redis02 is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [14:49:21] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:45] RECOVERY - Puppet errors on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:49] RECOVERY - Host deployment-kafka04 is UP: PING OK - Packet loss = 0%, RTA = 3.41 ms [14:50:11] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:24] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:39] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:57] RECOVERY - Puppet errors on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:12] RECOVERY - Host saucelabs-01 is UP: PING OK - Packet loss = 0%, RTA = 4.91 ms [14:52:26] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is OK: OK: Less than 100.00% above the threshold [0.0] [14:52:57] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:53:35] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:18] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:44] PROBLEM - Puppet errors on saucelabs-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:55:48] PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (10.68.23.25) [14:56:06] Yippee, build fixed! [14:56:07] Project beta-scap-eqiad build #160668: 09FIXED in 2 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/160668/ [14:56:27] PROBLEM - Host deployment-zookeeper02 is DOWN: CRITICAL - Host Unreachable (10.68.18.75) [14:57:34] PROBLEM - Host integration-slave-trusty-1003 is DOWN: CRITICAL - Host Unreachable (10.68.17.54) [14:57:48] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:58:06] PROBLEM - Host deployment-aqs01 is DOWN: CRITICAL - Host Unreachable (10.68.18.237) [14:58:08] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.16.128) [14:58:41] PROBLEM - Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) [15:00:28] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:00:36] PROBLEM - Puppet errors on deployment-kafka03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:00:49] PROBLEM - Puppet errors on deployment-trending01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:01:30] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:01:49] PROBLEM - Puppet errors on deployment-memc05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:02:57] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [15:03:59] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:04:21] PROBLEM - Puppet errors on deployment-restbase02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:05:01] PROBLEM - Puppet errors on deployment-db03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:05:01] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:05:16] RECOVERY - Host deployment-puppetdb01 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [15:05:24] PROBLEM - Puppet errors on deployment-ircd is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:05:34] RECOVERY - Host deployment-aqs01 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [15:05:42] RECOVERY - Host deployment-zookeeper02 is UP: PING OK - Packet loss = 0%, RTA = 2.17 ms [15:05:42] RECOVERY - Puppet errors on saucelabs-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:05:42] RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [15:06:00] RECOVERY - Host integration-slave-trusty-1003 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [15:06:04] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 7.58 ms [15:06:26] PROBLEM - Puppet errors on deployment-stream is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:08:11] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:09:34] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [0.0] [15:10:08] PROBLEM - Puppet errors on deployment-puppetdb01 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [0.0] [15:10:18] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:10:47] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:10:51] PROBLEM - Puppet errors on deployment-ms-fe02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:10:51] PROBLEM - Puppet errors on deployment-eventlogging03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:11:57] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:13:07] PROBLEM - Puppet errors on deployment-memc04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:14:25] PROBLEM - Host Generic Beta Cluster is DOWN: CRITICAL - Host Unreachable (en.wikipedia.beta.wmflabs.org) [15:14:28] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189) [15:15:08] PROBLEM - Host deployment-ms-be04 is DOWN: CRITICAL - Host Unreachable (10.68.16.139) [15:15:52] PROBLEM - Host deployment-sentry01 is DOWN: CRITICAL - Host Unreachable (10.68.19.148) [15:16:36] PROBLEM - Host deployment-cache-text04 is DOWN: CRITICAL - Host Unreachable (10.68.18.103) [15:17:02] PROBLEM - Host deployment-imagescaler01 is DOWN: CRITICAL - Host Unreachable (10.68.19.158) [15:17:04] PROBLEM - Host deployment-trending01 is DOWN: CRITICAL - Host Unreachable (10.68.18.186) [15:17:18] PROBLEM - Host deployment-aqs02 is DOWN: CRITICAL - Host Unreachable (10.68.17.90) [15:17:54] PROBLEM - Host deployment-fluorine02 is DOWN: CRITICAL - Host Unreachable (10.68.23.106) [15:18:08] RECOVERY - Puppet errors on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:20:45] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:21:31] RECOVERY - Host deployment-fluorine02 is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [15:21:39] RECOVERY - Host deployment-aqs02 is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [15:22:01] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [15:22:01] RECOVERY - Host deployment-imagescaler01 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [15:22:23] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 4.98 ms [15:22:59] RECOVERY - Host deployment-cache-text04 is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [15:22:59] RECOVERY - Host deployment-trending01 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms [15:23:03] RECOVERY - Host deployment-sentry01 is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms [15:24:32] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:24:57] RECOVERY - Host deployment-ms-be04 is UP: PING OK - Packet loss = 0%, RTA = 3.20 ms [15:26:25] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [15:29:47] PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]