[00:03:33] <bd808>	 !log cleaned up redis leftovers on deployment-logstash1
[00:03:36] <qa-morebots>	 Logged the message, Master
[00:11:18] <wikibugs>	 10Continuous-Integration, 7developer-notice: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1229838 (10Krinkle)
[00:17:24] <bd808>	 !log beta cluster fatal monitor full of "Bad file descriptor: AH00646: Error writing to /data/project/logs/apache-access.log"
[00:17:26] <qa-morebots>	 Logged the message, Master
[00:29:14] <bd808>	 !log cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205969/ (logstash: Convert $::realm switches to hiera)
[00:29:17] <qa-morebots>	 Logged the message, Master
[00:46:23] <wmf-insecte>	 Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1821: FAILURE in 22 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1821/
[00:50:05] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229915 (10Nuria) I have just enabled puppet again, no reason to have it disabled anymore (we did so for testing purposes couple weeks back)
[01:05:09] <grrrit-wm>	 (03PS1) 10Krinkle: Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 
[01:06:14] <grrrit-wm>	 (03Abandoned) 10Krinkle: Publish doxygen doc for the 'cdb' project [integration/config] - 10https://gerrit.wikimedia.org/r/174417 (https://bugzilla.wikimedia.org/73530) (owner: 10Hashar)
[01:06:40] <wikibugs>	 10Continuous-Integration: Publish cdb documentation to doc.wikimedia.org - https://phabricator.wikimedia.org/T75530#1229945 (10Krinkle) a:5hashar>3Krinkle
[01:19:28] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #231: FAILURE in 1 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/231/
[01:20:44] <wmf-insecte>	 Project beta-update-databases-eqiad build #9107: FAILURE in 44 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9107/
[01:25:27] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.157 second response time  
[01:26:02] <wikibugs>	 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1229983 (10Shizhao) me2
[01:40:28] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47222 bytes in 1.108 second response time  
[01:48:03] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1520 bytes in 2.323 second response time  
[01:53:01] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46939 bytes in 1.050 second response time  
[01:55:20] <wmf-insecte>	 Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce build #48: FAILURE in 1 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce/48/
[02:06:59] <wikibugs>	 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1230005 (10Krinkle) >>! In T92909#1222266, @Jdlrobson wrote: > When I do spend time in it it takes too long to get code reviewed/fixed and merged (it reall...
[02:07:00] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2072.68 ms  
[02:11:11] <shinken-wm>	 PROBLEM - Host deployment-stream is DOWN: CRITICAL - Host Unreachable (10.68.17.106)  
[02:11:11] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms  
[02:12:19] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[02:16:57] <shinken-wm>	 RECOVERY - Host deployment-stream is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms  
[02:17:19] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 211.65 ms  
[02:21:00] <wmf-insecte>	 Yippee, build fixed!
[02:21:01] <wmf-insecte>	 Project beta-update-databases-eqiad build #9108: FIXED in 1 min 0 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9108/
[02:25:29] <wmf-insecte>	 Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce build #77: FAILURE in 3 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce/77/
[02:36:32] <wmf-insecte>	 Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #550: FAILURE in 3 min 32 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/550/
[02:42:14] <wikibugs>	 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1230032 (10Jdlrobson) I understand.. but the reason I use github.com for personal development is the simplicity. Put frankly, I don't want to spend time on...
[02:45:05] <wmf-insecte>	 Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #514: FAILURE in 4 min 4 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/514/
[02:48:08] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51)  
[02:50:44] <shinken-wm>	 PROBLEM - Host deployment-db1 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2734.00 ms  
[02:57:27] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5742.75 ms  
[03:00:38] <grrrit-wm>	 (03PS1) 10Krinkle: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) 
[03:03:05] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51)  
[03:05:08] <grrrit-wm>	 (03CR) 10Krinkle: "@Jdlrobson: Let us know when this can be enabled. Right now it's failing with 5 jscs errors." [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[03:09:14] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[03:14:12] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 296.27 ms  
[03:20:36] <wmf-insecte>	 Project beta-update-databases-eqiad build #9109: FAILURE in 36 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9109/
[03:21:37] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3586.25 ms  
[03:21:51] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[03:22:00] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 108.56 ms  
[03:22:44] <shinken-wm>	 PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:24:44] <shinken-wm>	 RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[03:27:00] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 68.73 ms  
[03:27:40] <shinken-wm>	 RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[03:28:05] <wikibugs>	 10Continuous-Integration, 6operations, 7Puppet: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1230073 (10Krinkle)
[03:28:43] <wikibugs>	 10Browser-Tests, 7Puppet: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230074 (10Krinkle)
[03:36:27] <wikibugs>	 10Browser-Tests, 7Puppet: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230094 (10Krinkle) 5Open>3Resolved a:3Krinkle Haven't seen this error in the 2 instance re-creation sprints. Works for me.
[03:44:39] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:01:36] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2222.41 ms  
[04:03:12] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 690.79 ms  
[04:07:06] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #422: FAILURE in 5.5 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/422/
[04:07:26] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3510.59 ms  
[04:12:20] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 381.31 ms  
[04:15:35] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #417: FAILURE in 23 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/417/
[04:21:58] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:26:50] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 0.487 second response time  
[04:30:00] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2526.07 ms  
[04:30:27] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 (owner: 10Krinkle)
[04:31:55] <grrrit-wm>	 (03PS1) 10Krinkle: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) 
[04:32:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 (owner: 10Krinkle)
[04:32:08] <grrrit-wm>	 (03PS2) 10Krinkle: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) 
[04:32:14] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) (owner: 10Krinkle)
[04:33:12] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 3690.88 ms  
[04:34:12] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[04:34:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) (owner: 10Krinkle)
[04:34:41] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms  
[04:35:29] <Krinkle>	 !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/206044 and https://gerrit.wikimedia.org/r/206072
[04:35:34] <qa-morebots>	 Logged the message, Master
[04:51:33] <shinken-wm>	 FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms  
[04:51:37] <wikibugs>	 10Continuous-Integration: /var/lib/mysql/ filling up on old Precise slaves due to mysql usage - https://phabricator.wikimedia.org/T94138#1230223 (10Krinkle) 5Open>3declined a:3Krinkle
[04:56:00] <wikibugs>	 10Continuous-Integration, 6Labs: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1230309 (10Krinkle) 5Resolved>3Open Our goal for 30G space was based on the following estimate: > 10G for system, 10G for git replication and 10G for workspace.  How...
[04:56:02] <wikibugs>	 10Continuous-Integration: Convert pool from a few large slaves (4X) to more smaller slaves (1X) - https://phabricator.wikimedia.org/T96629#1230311 (10Krinkle)
[05:06:47] <grrrit-wm>	 (03CR) 10Krinkle: Convert 'operations-puppet-doc' job to run on a labs slave (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/204982 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[05:12:21] <shinken-wm>	 FLAPPINGSTART - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms  
[05:20:24] <wmf-insecte>	 Yippee, build fixed!
[05:20:24] <wmf-insecte>	 Project beta-update-databases-eqiad build #9111: FIXED in 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9111/
[05:21:09] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 27.21 ms  
[05:36:34] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3219.94 ms  
[05:44:16] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #48: FAILURE in 28 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/48/
[05:52:13] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2315.02 ms  
[05:57:11] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms  
[06:00:04] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 8747.20 ms  
[06:02:02] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 85.15 ms  
[06:03:39] <grrrit-wm>	 (03PS1) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) 
[06:05:00] <grrrit-wm>	 (03CR) 10Krinkle: "Example output:" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) (owner: 10Krinkle)
[06:06:08] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 10.48 ms  
[06:08:47] <grrrit-wm>	 (03PS2) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) 
[06:08:52] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: No route to host  
[06:10:11] <shinken-wm>	 PROBLEM - SSH on deployment-cache-mobile03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[06:10:28] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 100%  
[06:11:19] <Krinkle>	 !log integration-slave-trusty-1021 stays depooled (see T96629 and T96706)
[06:11:21] <qa-morebots>	 Logged the message, Master
[06:11:49] <Krinkle>	 !log Running git-cache-update inside screen on integration-slave-trusty-1021 at /mnt/git
[06:11:52] <qa-morebots>	 Logged the message, Master
[06:18:20] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[06:18:56] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 8.717 second response time  
[06:19:14] <wmf-insecte>	 Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #591: FAILURE in 13 sec: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/591/
[06:20:12] <shinken-wm>	 RECOVERY - SSH on deployment-cache-mobile03 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[06:20:33] <wmf-insecte>	 Project beta-update-databases-eqiad build #9112: FAILURE in 32 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9112/
[06:21:26] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[06:21:35] <shinken-wm>	 FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 249.68 ms  
[06:22:51] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227)  
[06:23:23] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[06:23:35] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 304.20 ms  
[06:23:45] <shinken-wm>	 PROBLEM - SSH on deployment-kafka02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[06:26:11] <shinken-wm>	 PROBLEM - SSH on deployment-cache-mobile03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[06:42:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[06:42:51] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227)  
[06:43:05] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51)  
[06:43:31] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[06:44:13] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 153.39 ms  
[06:48:37] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[06:52:54] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227)  
[06:57:51] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4423.80 ms  
[07:01:03] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[07:07:28] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:13:08] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 280.79 ms  
[07:18:35] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:20:41] <shinken-wm>	 FLAPPINGSTOP - Host deployment-db1 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms  
[07:22:10] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #15: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/15/
[07:26:05] <shinken-wm>	 PROBLEM - SSH on deployment-memc02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[07:27:14] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2709.28 ms  
[07:30:58] <shinken-wm>	 RECOVERY - SSH on deployment-memc02 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[07:31:06] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 198.72 ms  
[07:32:12] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 219.84 ms  
[07:37:01] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3552.57 ms  
[07:38:33] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.789 second response time  
[07:38:50] <wmf-insecte>	 Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce-DEBUG build #2: FAILURE in 24 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce-DEBUG/2/
[07:41:06] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 43.66 ms  
[07:42:11] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[07:43:31] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.492 second response time  
[08:00:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[08:10:05] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #578: FAILURE in 4.4 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/578/
[08:12:52] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms  
[08:13:10] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 726.89 ms  
[08:16:32] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2516.26 ms  
[08:21:38] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 3073.90 ms  
[08:22:09] <wmf-insecte>	 Project browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #560: FAILURE in 2 min 8 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/560/
[08:23:33] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 176.14 ms  
[08:24:41] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156)  
[08:30:37] <shinken-wm>	 RECOVERY - SSH on deployment-kafka02 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[08:38:10] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[08:48:13] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 161.23 ms  
[08:49:37] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[08:54:37] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186)  
[08:56:32] <shinken-wm>	 RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 3.73 ms  
[08:59:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[09:00:23] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100%  
[09:06:37] <shinken-wm>	 FLAPPINGSTART - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 128.24 ms  
[09:07:11] <shinken-wm>	 FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms  
[09:12:12] <shinken-wm>	 PROBLEM - SSH on deployment-restbase01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:19:36] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[09:23:22] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100%  
[09:23:22] <shinken-wm>	 PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused  
[09:31:46] <wikibugs>	 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1230772 (10Aklapper) p:5High>3Un...
[09:34:42] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1230775 (10Christopher) Default show Sprint Start and Sprint End date fields on Project Create form is available for test and review on https://phab08.wmflabs.org/project/create...
[09:38:23] <shinken-wm>	 RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.376 second response time  
[09:42:21] <wmf-insecte>	 Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #463: FAILURE in 5 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/463/
[09:43:02] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[09:49:03] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms  
[09:55:03] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2169.48 ms  
[10:02:51] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227)  
[10:03:09] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[10:07:20] <shinken-wm>	 FLAPPINGSTOP - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 10.39 ms  
[10:12:48] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227)  
[10:17:20] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3205.42 ms  
[10:18:38] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[10:21:52] <shinken-wm>	 PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204)  
[10:22:16] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[10:22:54] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2214.64 ms  
[10:26:32] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1230866 (10Aklapper) Should tasks in Continuous-Integration-Config be automatically part of Continuous-Integration (subproject style), or should these...
[10:27:10] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[10:27:18] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[10:35:39] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace.root.byte_percentfree (<10.00%) WARN: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%)  
[10:35:41] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:32] <shinken-wm>	 PROBLEM - Host deployment-mediawiki03 is DOWN: PING CRITICAL - Packet loss = 100%  
[10:39:42] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186)  
[10:41:36] <shinken-wm>	 RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 5.93 ms  
[10:41:52] <shinken-wm>	 RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 16%, RTA = 308.56 ms  
[10:42:46] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[10:58:39] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[11:01:33] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2280.78 ms  
[11:03:31] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 359.24 ms  
[11:07:12] <shinken-wm>	 FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms  
[11:08:30] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 395.52 ms  
[11:12:28] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0]  
[11:12:40] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0]  
[11:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52911: FAILURE in 0.21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52911/
[11:20:05] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2063.23 ms  
[11:24:36] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[11:27:34] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47225 bytes in 8.847 second response time  
[11:34:31] <wmf-insecte>	 Yippee, build fixed!
[11:34:31] <wmf-insecte>	 Project beta-code-update-eqiad build #52913: FIXED in 1 min 30 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52913/
[11:37:13] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2793.19 ms  
[11:38:36] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[11:40:22] <kart_>	 thcipriani|afk: we still have issue with Beta/deployment-prep?
[11:42:36] <shinken-wm>	 FLAPPINGSTOP - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 74.65 ms  
[11:44:18] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2179.11 ms  
[11:46:16] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[11:49:13] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 447.99 ms  
[11:49:55] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce build #226: FAILURE in 1 min 55 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce/226/
[11:53:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52915: FAILURE in 70 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52915/
[11:57:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[12:00:09] <wmf-insecte>	 Project browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #87: FAILURE in 3 min 9 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/87/
[12:03:28] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5695.72 ms  
[12:07:54] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 5436.45 ms  
[12:08:35] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 45382 bytes in 8.287 second response time  
[12:09:17] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 57.36 ms  
[12:09:37] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2445.42 ms  
[12:10:03] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms  
[12:10:21] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2071.84 ms  
[12:14:11] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3498.98 ms  
[12:14:15] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 264.04 ms  
[12:14:18] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 2811.37 ms  
[12:14:20] <wmf-insecte>	 Yippee, build fixed!
[12:14:20] <wmf-insecte>	 Project beta-code-update-eqiad build #52917: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52917/
[12:16:20] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3923.35 ms  
[12:18:11] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[12:21:01] <shinken-wm>	 PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:24:15] <shinken-wm>	 PROBLEM - SSH on deployment-elastic06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:24:25] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 4770.32 ms  
[12:29:06] <shinken-wm>	 RECOVERY - SSH on deployment-elastic06 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[12:33:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52919: FAILURE in 69 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52919/
[12:34:16] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120)  
[12:35:14] <shinken-wm>	 PROBLEM - SSH on deployment-elastic06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:40:07] <shinken-wm>	 RECOVERY - SSH on deployment-elastic06 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[12:43:43] <shinken-wm>	 PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:43:43] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[12:46:37] <wmf-insecte>	 Yippee, build fixed!
[12:46:37] <wmf-insecte>	 Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1823: FIXED in 36 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1823/
[12:48:33] <shinken-wm>	 RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[12:48:54] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #231: FAILURE in 1 min 53 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/231/
[12:54:28] <wmf-insecte>	 Yippee, build fixed!
[12:54:29] <wmf-insecte>	 Project beta-code-update-eqiad build #52921: FIXED in 1 min 28 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52921/
[12:55:21] <wmf-insecte>	 Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #450: FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/450/
[12:56:34] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52)  
[12:59:45] <shinken-wm>	 RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 343.94 ms  
[13:03:49] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 2678.28 ms  
[13:04:45] <shinken-wm>	 RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 185.76 ms  
[13:06:51] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120)  
[13:08:12] <shinken-wm>	 FLAPPINGSTOP - Host deployment-memc02 is UP: PING WARNING - Packet loss = 16%, RTA = 1819.97 ms  
[13:09:44] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3939.57 ms  
[13:10:01] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 16%, RTA = 333.55 ms  
[13:10:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[13:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52923: FAILURE in 66 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52923/
[13:13:10] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[13:18:11] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms  
[13:19:59] <shinken-wm>	 FLAPPINGSTART - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms  
[13:23:11] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[13:23:41] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:25:21] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING WARNING - Packet loss = 0%, RTA = 847.67 ms  
[13:26:35] <shinken-wm>	 FLAPPINGSTART - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 233.65 ms  
[13:33:28] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[13:34:38] <wmf-insecte>	 Yippee, build fixed!
[13:34:39] <wmf-insecte>	 Project beta-code-update-eqiad build #52925: FIXED in 1 min 38 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52925/
[13:39:40] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[13:40:36] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:40:45] <shinken-wm>	 PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:41:05] <shinken-wm>	 PROBLEM - Host deployment-videoscaler01 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 2988.20 ms  
[13:45:57] <shinken-wm>	 RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms  
[13:46:29] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[13:48:34] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 336.35 ms  
[13:52:36] <grrrit-wm>	 (03CR) 10Jforrester: [C: 031] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[13:53:01] <wmf-insecte>	 Project beta-code-update-eqiad build #52927: FAILURE in 0.52 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52927/
[13:54:52] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[13:58:28] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[14:02:32] <thcipriani|afk>	 kart_: FWIW, I pinged some folks in -labs, not sure what's happening yet, could be ldap stuff, could be weird network strain.
[14:03:22] <thcipriani|afk>	 I'm off to  do morning-type things, but I'll be idle here
[14:06:59] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[14:09:05] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #295: FAILURE in 6.8 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/295/
[14:09:39] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:14:13] <wmf-insecte>	 Yippee, build fixed!
[14:14:13] <wmf-insecte>	 Project beta-code-update-eqiad build #52929: FIXED in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52929/
[14:16:31] <shinken-wm>	 FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms  
[14:22:11] <shinken-wm>	 FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 419.76 ms  
[14:23:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[14:26:55] <shinken-wm>	 PROBLEM - Host deployment-zookeeper01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2725.34 ms  
[14:28:29] <shinken-wm>	 PROBLEM - Host deployment-sentry2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2545.83 ms  
[14:31:05] <shinken-wm>	 FLAPPINGSTOP - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 78.41 ms  
[14:32:19] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 100%  
[14:33:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52931: FAILURE in 62 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52931/
[14:36:49] <shinken-wm>	 RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 31.49 ms  
[14:36:56] <shinken-wm>	 RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms  
[14:38:04] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[14:43:52] <wikibugs>	 10Beta-Cluster: Cannot open pages in Beta Cluster, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1231265 (10Jdforrester-WMF)
[14:45:36] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:45:56] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2103.36 ms  
[14:48:17] <shinken-wm>	 RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:50:55] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms  
[14:52:29] <thcipriani>	 andrewbogott: have you been able to narrow down the network issues at all?
[14:52:35] <thcipriani>	 (if it is/was network issues)
[14:53:55] <wmf-insecte>	 Yippee, build fixed!
[14:53:56] <wmf-insecte>	 Project beta-code-update-eqiad build #52933: FIXED in 55 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52933/
[14:56:05] <andrewbogott>	 thcipriani: not really, although I’m still poking.
[14:56:09] <andrewbogott>	 Are there actual results from the issue or just monitoring complaints?
[14:56:55] <thcipriani>	 you mean what is the overall impact of this?
[14:57:42] <andrewbogott>	 yeah — any actual user consequences?
[14:58:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[14:58:52] <thcipriani>	 starting to see tickets come in about it: https://phabricator.wikimedia.org/T96905
[14:59:05] <thcipriani>	 I don't think anyone can't do their job, though, afaik
[14:59:39] <andrewbogott>	 oh — ok, that counts as a user consequence :)
[15:00:28] <andrewbogott>	 there was a long gap from about 10pm pst to about 6AM pst when I wasn’t migrating anything.  It sounds like the issue was still occuring then, right?
[15:02:12] <thcipriani>	 judging from scrollback it's been fairly continual since about 7pm pdt-ish
[15:02:21] <thcipriani>	 continually intermitent
[15:02:23] <thcipriani>	 he
[15:06:43] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 11%, RTA = 4710.61 ms  
[15:06:55] <shinken-wm>	 FLAPPINGSTART - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[15:09:28] <wmf-insecte>	 Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #203: STILL FAILING in 47 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/203/
[15:12:45] <AndyRussG>	 Hi Reedy... how's it going? are you the one to ping for a beta cluster issue?
[15:12:54] <wikibugs>	 10Beta-Cluster, 6Labs: Migrate deployment-prep to new labvirt hosts - https://phabricator.wikimedia.org/T96678#1231284 (10Andrew) 5Open>3Resolved This is done!
[15:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52935: FAILURE in 67 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52935/
[15:14:22] <andrewbogott>	 AndyRussG: this is the right place but Reedy is probably not the right person.  Probably best if you just describe the issue so that anyone here can follow up.
[15:14:38] <andrewbogott>	 (Reedy is, I believe, now studying to be a pilot.)
[15:15:16] <greg-g>	 I should probably remove his +v so he doesn't get pinged
[15:15:38] <AndyRussG>	 andrewbogott: ah cool thanks! Yeah besides the fact that I'm getting lots of server errors, I forgot my password for my WMF account there and didn't register an e-mail address (silly me ;P )
[15:16:18] <AndyRussG>	 Uhh and in addition (if I'm not already, might already be) I should be a CentralNotice admin there...
[15:16:40] <greg-g>	 define the server errors
[15:16:40] <andrewbogott>	 Does the ‘lots of server errors’ issue resemble this one?  https://phabricator.wikimedia.org/T96905
[15:16:50] <greg-g>	 bah, nevermind
[15:16:58] <AndyRussG>	 Something odd is happening to the CentralNotice campaigns running on the beta cluster, and our browser tests depend on them
[15:17:20] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 6683.87 ms  
[15:17:43] <AndyRussG>	 andrewbogott: yeah lots of those intermittently
[15:18:02] <AndyRussG>	 But sometimes the pages load fine (though even slower than usual it seems)
[15:18:15] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database - https://phabricator.wikimedia.org/T96905#1231300 (10greg)
[15:18:54] <greg-g>	 thcipriani: is ^^ due to the restarts yesterday?
[15:19:17] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #31: FAILURE in 1 min 16 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/31/
[15:19:44] <thcipriani>	 I think it's due to the issues we've been seeing since about 7pm last night with the whole beta cluster intermittently
[15:19:57] <andrewbogott>	 greg-g: in theory I’m investigating that issue, but I’m starting to think it has nothing to do with me :(
[15:20:15] <greg-g>	 thcipriani: any leads?
[15:20:25] <greg-g>	 bug for that?
[15:20:41] <thcipriani>	 not yet, filing now
[15:20:51] <thcipriani>	 and no, no leads yet :(
[15:20:51] * greg-g was in migraine-ville yesterday after 3 or so, may have missed much
[15:20:54] <greg-g>	 :(
[15:21:46] <greg-g>	 http://shinken.wmflabs.org/problems?search=hg:deployment-prep looks not happy
[15:22:18] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10thcipriani) 3NEW
[15:22:43] <thcipriani>	 greg-g: yeah, the problem is, nothing is failing consistantly
[15:22:43] <AndyRussG>	 greg-g: andrewbogott: so... who should I ask to reset my beta cluster account?
[15:22:48] <greg-g>	 and beta scap still hasn't run successfully since 2 days 10 hr ago
[15:22:59] <andrewbogott>	 AndyRussG: I dont know.
[15:23:11] <greg-g>	 AndyRussG: right now? no one in this room, we're trying to get the beta cluster working again
[15:23:23] <bd808>	 greg-g: :((
[15:23:27] <bd808>	 need help?
[15:23:27] <greg-g>	 legoktm might be able to help
[15:23:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0]  
[15:23:35] <greg-g>	 bd808: :( maybe
[15:23:50] <AndyRussG>	 greg-g: K thanks sorry to interrupt ;/
[15:24:35] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186)  
[15:25:00] <thcipriani>	 only thing that happened around that time in the SAL was  00:29 bd808: cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205969/ (logstash: Convert $::realm switches to hiera)
[15:25:29] <bd808>	 Related: I unstuck puppet replication just before that
[15:25:40] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231321 (10greg)
[15:25:42] <thcipriani>	 but it was a good hour before anything started erroring out
[15:25:50] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231325 (10greg) p:5Triage>3Unbreak!
[15:26:39] <shinken-wm>	 FLAPPINGSTART - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms  
[15:26:40] <bd808>	 Puppet on deployment-salt had been in a detached head state for some unknown time
[15:27:01] <bd808>	 But that should have changed things sooner than an hour
[15:27:31] <shinken-wm>	 PROBLEM - SSH on deployment-test is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:27:40] <thcipriani>	 right, it seems like the network is just super slow and there aren't actually any problems on the boxes
[15:28:06] <wmf-insecte>	 Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #512: FAILURE in 6.1 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/512/
[15:28:16] <thcipriani>	 I can log into the boxes with problems, the load is fine, ssh is super slow, the stuff in syslog is complaining about an ldap group
[15:28:34] <thcipriani>	  nslcd[1270]: [b9698b] <group=50062> error writing to client: Broken pipe
[15:28:45] <andrewbogott>	 thcipriani: it would help me out if you can detect any sort of pattern — like, is the network slower for some instances than others, and if so, which ones?
[15:28:46] <bd808>	 that's known noise
[15:29:43] <bd808>	 the nslcd log line is related to an ldap group that is too large to fit in a packet. We've had that all over labs for a very long time
[15:30:03] <thcipriani>	 andrewbogott: deployment-parsoid05 was certainly the worst one I logged into this morning
[15:31:06] <greg-g>	 thcipriani: btw, if this takes a while, don't worry about our 1:1
[15:31:09] <andrewbogott>	 ok, that one’s on labvirt1005
[15:31:18] <andrewbogott>	 any others?
[15:32:45] <thcipriani>	 iirc both of the deployment-db{1,2} were sticky as well
[15:32:50] <shinken-wm>	 FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 0%, RTA = 533.97 ms  
[15:33:06] <thcipriani>	 oh and that one ^
[15:33:15] <greg-g>	 jebus
[15:33:34] <greg-g>	 (re the RTA)
[15:34:18] <wmf-insecte>	 Yippee, build fixed!
[15:34:18] <wmf-insecte>	 Project beta-code-update-eqiad build #52937: FIXED in 1 min 17 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52937/
[15:34:56] <greg-g>	 whoa, nice
[15:35:50] <andrewbogott>	 thcipriani: restbase01 is also on labvirt1005.  The db hosts are not.
[15:37:04] <thcipriani>	 I was just on db hosts, seem fine now, could have been mis-remembering, resbase and parsoid are definitely still very very slow
[15:37:39] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 100%  
[15:38:00] <thcipriani>	 what other vms are on 1005?
[15:38:49] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1231353 (10mark) a:5mark>3faidon
[15:38:55] <andrewbogott>	 Hm… cvn-app4   is there and it is BUSY
[15:40:20] * greg-g wants dedicated virt hardware for beta cluster :/
[15:42:34] <greg-g>	 bye bye qa-morebots
[15:44:18] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3099.60 ms  
[15:49:15] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 393.48 ms  
[15:50:57] <shinken-wm>	 PROBLEM - Host deployment-videoscaler01 is DOWN: CRITICAL - Host Unreachable (10.68.16.211)  
[15:53:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52939: FAILURE in 58 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52939/
[15:54:14] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 661.74 ms  
[15:54:57] <thcipriani>	 deployment-restbase01 just gave me a kernel:[2042464.137735] BUG: soft lockup - CPU#2 stuck for 25s!
[15:55:53] <bd808>	 is the compute node those vms are on dying?
[15:55:58] <shinken-wm>	 RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms  
[15:58:07] <thcipriani>	 judging from icinga alerts in -operations labvirt1005 and labvirt1006 aren't thrilled today
[16:00:12] <bd808>	 load on deployment-logstash1 is crazy. 12+
[16:00:29] <shinken-wm>	 PROBLEM - Host deployment-mediawiki03 is DOWN: PING CRITICAL - Packet loss = 100%  
[16:00:40] <andrewbogott>	 thcipriani: I’m moving that cvn instance to a new host.  We’ll see if that improves things or at least alters the pattern
[16:00:49] <thcipriani>	 kk
[16:01:15] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186)  
[16:01:53] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120)  
[16:03:04] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 100%  
[16:06:19] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[16:09:13] <shinken-wm>	 FLAPPINGSTOP - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 100%  
[16:09:19] <wmf-insecte>	 Project beta-parsoid-update-eqiad build #937: FAILURE in 0.11 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/937/
[16:09:59] <shinken-wm>	 PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (10.68.17.69)  
[16:11:29] <shinken-wm>	 FLAPPINGSTOP - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24)  
[16:12:20] <shinken-wm>	 FLAPPINGSTOP - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 100%  
[16:14:21] <wmf-insecte>	 Yippee, build fixed!
[16:14:22] <wmf-insecte>	 Project beta-code-update-eqiad build #52941: FIXED in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52941/
[16:20:00] <shinken-wm>	 RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms  
[16:21:50] <shinken-wm>	 FLAPPINGSTOP - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms  
[16:33:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52943: FAILURE in 71 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52943/
[16:33:59] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:35:38] <bd808>	 ^d: We need to split up InitialiseSettings. It's waaay too big
[16:35:56] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2599.41 ms  
[16:36:19] <^d>	 bd808: Finish the config project and move it into a real storage instead of a PHP file :p
[16:36:30] <bd808>	 +2
[16:36:47] <bd808>	 If only legoktm hadn't been sent to flow land :(
[16:38:24] <wikibugs>	 10Continuous-Integration: Disable html format for xdebug/var_dump for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1231531 (10Krinkle) 3NEW
[16:41:01] <^d>	 bd808: Where does general architecture cleanups unrelated to reading/editing/performance/security get handled now?
[16:41:17] <bd808>	 "magic elves"
[16:41:18] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[16:41:30] <shinken-wm>	 RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[16:41:33] <^d>	 bd808: Ah, I missed those in the new org chart ;-)
[16:41:54] <shinken-wm>	 RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[16:42:16] <^d>	 bd808: You could argue almost any of those groups take it
[16:42:19] <bd808>	 I should edit the chart and add some marker for all the designated magic elves. I'm pretty sure I know who they are
[16:42:20] <^d>	 And argue they shouldn't :)
[16:42:42] <shinken-wm>	 RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[16:43:08] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Disable xdebug's html formatting of PHP errors for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1231569 (10Krinkle)
[16:43:17] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 353.07 ms  
[16:43:22] <^d>	 bd808: I found 3 the other day. cross-ref my e-mail about ES ownership ;-)
[16:44:00] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms  
[16:44:18] <bd808>	 I think the current high level thinking is that (a) people will do these things because they feel compelled; (b) debt should only be cleaned up in  connection with "real" work; and (c) "oops didn't think of that because I never did it"
[16:45:03] <^d>	 (a) happened before the reorg too, so I guess I can see that continuing
[16:45:54] <bd808>	 imma gonna do the things I see that need doing until somebody physically stops me
[16:46:11] <^d>	 "What do you do?"
[16:46:14] <^d>	 "What needs to be done"
[16:46:21] <^d>	 "Ah, carry on!"
[16:46:23] <chasemp>	 *slides on sunglasses*
[16:46:52] <andrewbogott>	 thcipriani: Are things behaving now, at least for the moment?
[16:47:17] <bd808>	 Somebody has to be ReedyBot until he gets tired of having a life with dreams and stuff
[16:47:36] <mutante>	 !quip
[16:48:00] <^d>	 bd808: I've tried, I seem to require more sleep/food though
[16:48:17] <bd808>	 *nod* I'm too old to do it properly too
[16:48:24] <shinken-wm>	 RECOVERY - SSH on deployment-test is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[16:48:41] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[16:48:49] <bd808>	 my brain doesn't work fast enough to keep up with the smart kids
[16:49:01] <bd808>	 and my hands seize up
[16:49:39] <bd808>	 We should specifically recruit for hyperactive insomniacs ;)
[16:50:20] <mutante>	 http://test2.wikipedia.org/wiki/Special:RecentChanges
[16:50:23] <mutante>	 so many questions
[16:50:24] <^d>	 "Do you suffer from episodes of mania that prevent you from sleeping for days on end?"
[16:50:26] <mutante>	 is this used?
[16:50:29] <chasemp>	 seeking coder who loves untangling messes and requires little sleep with access to free concert tickets
[16:51:05] <bd808>	 mutante: test2 is for people to mess with semi-randomly
[16:51:28] <^d>	 test2 exists because we needed a way to test multiversion when test.wp was even more hacky than it is
[16:51:35] <^d>	 Now it's just a playground
[16:51:37] <^d>	 Runs group0
[16:51:53] <shinken-wm>	 FLAPPINGSTOP - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 261.54 ms  
[16:52:14] <bd808>	 test.wp.o is used in the automation tests and used to get you punched for messing up
[16:52:26] <mutante>	 ok guys, so then i'm not using "delete test2 from DNS" as the random edit i need to make
[16:52:45] <mutante>	 because if i dont edit the wikipedia.org zone file templates, the new "gom" language won't get added 
[16:52:48] <mutante>	 heh
[16:53:57] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28282 bytes in 7.221 second response time  
[16:54:20] <wmf-insecte>	 Yippee, build fixed!
[16:54:20] <wmf-insecte>	 Project beta-code-update-eqiad build #52945: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52945/
[16:54:35] <shinken-wm>	 RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[16:55:21] <mutante>	 quality.wikipedia.org - This wiki has been closed and its content has been moved to meta 
[16:55:41] <^d>	 Heh, qualitywiki
[16:55:47] <mutante>	 (i'm just looking at the DNS template, what is not a language etc)
[16:58:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:58:42] <tgr>	 is there a master bug / tracking project for the beta issues?
[16:59:16] <grrrit-wm>	 (03PS3) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) 
[16:59:16] <mutante>	 tgr: https://phabricator.wikimedia.org/tag/beta-cluster/  
[16:59:56] <thcipriani>	 tgr https://phabricator.wikimedia.org/T97033 is ongoing
[17:00:26] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[17:00:52] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[17:01:21] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[17:02:05] <andrewbogott>	 oh, what now?
[17:02:19] <thcipriani>	 andrewbogott: things did seem better there for a few
[17:03:16] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231643 (10Tgr) Could be the cause for {T97047} (which does not seem to be intermittent though).
[17:03:42] <thcipriani>	 now just seems like parsoidcache02 afai can tell
[17:04:28] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]  
[17:05:18] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[17:07:52] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[17:07:53] <thcipriani>	 deployment-cache-mobile03 doesn't seem to be doing too well either :\
[17:09:08] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 129.59 ms  
[17:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52947: FAILURE in 57 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52947/
[17:14:29] <shinken-wm>	 PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[17:15:59] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 6477.80 ms  
[17:16:49] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[17:17:17] <andrewbogott>	 thcipriani: mind if I reboot deployment-parsoidcache02 just to see if it settles down?  There’s nothing wrong in the infrastructure anymore, as far as  I can tell.
[17:17:47] <thcipriani>	 andrewbogott: sure should be fine
[17:17:48] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[17:18:07] * andrewbogott grasps a straw, clings for dear life
[17:19:45] <thcipriani>	 andrewbogott: thanks for taking the time on this today—finding my way from an individual instance to the libvirt host a bit opaque to me :)
[17:20:01] <shinken-wm>	 FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 707.18 ms  
[17:20:04] <andrewbogott>	 thcipriani: it doesn’t help that the instance pages on wikitech are out of date :(
[17:20:33] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 45372 bytes in 6.615 second response time  
[17:21:17] <shinken-wm>	 FLAPPINGSTOP - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[17:21:17] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[17:24:16] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[17:29:04] <shinken-wm>	 PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[17:29:30] <shinken-wm>	 RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0]  
[17:29:43] <andrewbogott>	 thcipriani: shinken hasn’t caught up yet, but I can see that deployment-parsoidcache02 is up and reachable now.  Does it look to you like things are mostly working?
[17:30:07] * thcipriani looks
[17:32:54] <thcipriani>	 andrewbogott: yeah, varnish looks up, seems like all that should be running
[17:33:10] <thcipriani>	 seems really responsive, so that's nice, too
[17:33:33] <andrewbogott>	 ok.  labvirt1006 still looks a bit off to me, but I’ll keep an eye on it for a bit, see if it calms down.
[17:33:44] <andrewbogott>	 Meanwhile I’m going to step away for a bit.
[17:34:13] <wmf-insecte>	 Yippee, build fixed!
[17:34:13] <wmf-insecte>	 Project beta-code-update-eqiad build #52949: FIXED in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52949/
[17:34:20] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 4527.02 ms  
[17:34:51] <thcipriani>	 andrewbogott: kk, yeah, stuff is definitely still wacky :\
[17:35:00] <andrewbogott>	 dammit, that one isn’t on 1006
[17:35:47] <thcipriani>	 what about deployment-logstash1 that machine is...unhappy :(
[17:36:18] <thcipriani>	 that may just be excess load though...
[17:37:35] <andrewbogott>	 it’s on 1006
[17:38:10] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms  
[17:39:34] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[17:43:10] <shinken-wm>	 FLAPPINGSTOP - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[17:45:22] <grrrit-wm>	 (03PS4) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) 
[17:49:15] <shinken-wm>	 RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[17:53:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52951: FAILURE in 62 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52951/
[17:55:30] <shinken-wm>	 PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[17:58:11] <wmf-insecte>	 Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #518: FAILURE in 11 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/518/
[18:02:21] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[18:03:08] <wikibugs>	 10Beta-Cluster, 10Continuous-Integration, 6Release-Engineering, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1231878 (10Mattflaschen) I don't know whether it's updating at deploy-time for F...
[18:03:10] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[18:04:11] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms  
[18:05:22] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1017 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:05:24] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:05:25] <andrewbogott>	 thcipriani: ok, one more big interruption — rebooting labvirt1005 fixed a bunch of things, going to see if the same is true of 1006
[18:06:13] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:06:18] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:06:49] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:06:58] <thcipriani>	 andrewbogott: aweseom.
[18:07:48] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:08:50] <grrrit-wm>	 (03CR) 10Jdlrobson: [C: 031] "Timo you're a rock star. I'd merge if I could but I only have +1 here." [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[18:09:09] <wikibugs>	 10Continuous-Integration, 10Gather, 3Gather Sprint Forward, 5Patch-For-Review: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1231886 (10Jdlrobson)
[18:09:29] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:10:07] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[18:10:07] <shinken-wm>	 PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189)  
[18:10:19] <shinken-wm>	 PROBLEM - Host deployment-upload is DOWN: CRITICAL - Host Unreachable (10.68.16.189)  
[18:10:51] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[18:12:19] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms  
[18:12:43] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[18:12:46] <grrrit-wm>	 (03PS2) 10Krinkle: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) 
[18:12:51] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[18:13:36] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[18:14:08] <wmf-insecte>	 Yippee, build fixed!
[18:14:08] <wmf-insecte>	 Project beta-code-update-eqiad build #52953: FIXED in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52953/
[18:14:15] <shinken-wm>	 RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[18:14:29] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[18:14:36] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[18:15:06] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms  
[18:15:19] <Krinkle>	 !log Deploying Zuul config https://gerrit.wikimedia.org/r/206068
[18:16:55] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms  
[18:16:59] <subbu>	 Krinkle, can i get you to review a gerrit-wm patch to remove duplicate messages in #parsoid channel?
[18:17:15] <shinken-wm>	 RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms  
[18:17:16] <subbu>	 grrrit-wm
[18:17:21] <subbu>	 oh, you are busy.
[18:17:49] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:18:09] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[18:19:06] <shinken-wm>	 RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:19:36] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[18:19:52] <Krinkle>	 subbu: I don't know gerrit-wm well enough. And kind of busy at the moment.
[18:20:05] <subbu>	 np
[18:20:27] <wmf-insecte>	 Yippee, build fixed!
[18:20:28] <wmf-insecte>	 Project beta-parsoid-update-eqiad build #939: FIXED in 40 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/939/
[18:21:10] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms  
[18:21:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[18:21:47] <shinken-wm>	 PROBLEM - Host deployment-zookeeper01 is DOWN: CRITICAL - Host Unreachable (10.68.17.157)  
[18:25:01] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2743.35 ms  
[18:26:19] <shinken-wm>	 FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms  
[18:26:33] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:26:49] <shinken-wm>	 RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[18:29:39] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms  
[18:30:08] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #295: FAILURE in 7.5 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/295/
[18:30:10] <andrewbogott>	 thcipriani: I’m going to lunch now — we’ll see how much shinken chatter piles up while I’m gone.
[18:30:33] <shinken-wm>	 RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:33:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52955: FAILURE in 0.15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52955/
[18:38:17] <grrrit-wm>	 (03CR) 10Jdlrobson: "thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle)
[18:39:13] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 16%, RTA = 898.67 ms  
[18:41:37] <^d>	 mobrovac, twentyafterfour: Afterthought. On "statuses" -- I can think of 3 distinct statuses a node could have off the top of my head.
[18:41:44] <^d>	 1) Is there at least a process running
[18:41:51] <^d>	 2) Is it responding to a connection
[18:42:11] <^d>	 3) Is it actually returning something useful and non-error when you do connect?
[18:42:47] <mobrovac>	 yes, wgere "non-error" is user-defined (a 404 might be needed/wanted e.g.)
[18:42:59] <^d>	 *nod*
[18:43:44] <mobrovac>	 also, firewalls and stuff play a part here, so in extreme cases "responding to outside conns" may differ from "responding to conns"
[18:43:53] <^d>	 non-error is a bad phrase. "responding as expected to a nominal query"
[18:43:55] <mobrovac>	 (these being, misconfig, etc)
[18:44:29] <mobrovac>	 ^d: exactly, something like assert(res.status == whatever_i_need_it_to_be)
[18:45:08] <^d>	 Yep
[18:45:11] <mobrovac>	 (or res.body or headers or whatever)
[18:45:42] <^d>	 In Elastic's case that's basically returning 200 for GET / :)
[18:46:27] <wikibugs>	 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1232048 (10mmodell) 3NEW a:3mmodell
[18:46:27] <wmf-insecte>	 Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1824: FAILURE in 26 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1824/
[18:46:34] <mobrovac>	 that's easy enough to fit in any model :)
[18:47:33] <^d>	 twentyafterfour: It should have 7 layers like any good model :P
[18:48:06] <^d>	 Or maybe I mean salad
[18:48:32] <^d>	 Now I want a 7 layer salad
[18:48:48] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[18:54:24] <wmf-insecte>	 Yippee, build fixed!
[18:54:25] <wmf-insecte>	 Project beta-code-update-eqiad build #52957: FIXED in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52957/
[18:57:12] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[18:58:14] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[19:02:05] <wikibugs>	 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1232111 (10Jdlrobson)
[19:02:06] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1232110 (10Jdlrobson)
[19:02:10] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 255.42 ms  
[19:02:46] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1156443 (10Jdlrobson) @hashar remaining 2 test failures are due to T96920 and the required fron...
[19:09:48] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3767.94 ms  
[19:10:02] <shinken-wm>	 FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 353.57 ms  
[19:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52959: FAILURE in 70 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52959/
[19:13:52] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:14:40] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3913.17 ms  
[19:19:29] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[19:19:43] <aharoni>	 Hallo.
[19:20:11] <aharoni>	 Which ICU version is running on the Wikimedia production cluster?
[19:22:49] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232188 (10KLans_WMF) @Christopher Default show Sprint Start and Sprint End date fields on Project Create form is a great enhancement. I'm hoping this will save @AKlapper some h...
[19:25:52] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[19:26:02] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232209 (10KLans_WMF) I am remiss in not saying thank you @Chrsitopher for your work here :-)
[19:28:16] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-precise-1011 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:34:11] <wmf-insecte>	 Yippee, build fixed!
[19:34:11] <wmf-insecte>	 Project beta-code-update-eqiad build #52961: FIXED in 1 min 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52961/
[19:36:29] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 3949.59 ms  
[19:37:27] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 36%, RTA = 4572.32 ms  
[19:41:19] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms  
[19:44:39] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2847.72 ms  
[19:45:01] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms  
[19:53:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52963: FAILURE in 57 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52963/
[20:00:03] <shinken-wm>	 FLAPPINGSTOP - Host deployment-kafka02 is UP: PING WARNING - Packet loss = 0%, RTA = 658.45 ms  
[20:02:55] <shinken-wm>	 FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms  
[20:04:37] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[20:06:01] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 2783.89 ms  
[20:07:21] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3804.36 ms  
[20:14:09] <wmf-insecte>	 Yippee, build fixed!
[20:14:10] <wmf-insecte>	 Project beta-code-update-eqiad build #52965: FIXED in 1 min 9 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52965/
[20:15:55] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms  
[20:16:05] <shinken-wm>	 PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189)  
[20:17:17] <shinken-wm>	 RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms  
[20:17:21] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3235.82 ms  
[20:26:36] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24)  
[20:31:35] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms  
[20:33:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52967: FAILURE in 0.31 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52967/
[20:33:13] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 3542.82 ms  
[20:37:55] <shinken-wm>	 FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 0%, RTA = 1384.07 ms  
[20:38:29] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[20:41:27] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms  
[20:41:33] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24)  
[20:41:54] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms  
[20:42:20] <shinken-wm>	 FLAPPINGSTART - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 94.68 ms  
[20:51:19] <shinken-wm>	 FLAPPINGSTART - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 93.59 ms  
[20:52:45] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[20:54:18] <wmf-insecte>	 Yippee, build fixed!
[20:54:19] <wmf-insecte>	 Project beta-code-update-eqiad build #52969: FIXED in 1 min 18 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52969/
[21:01:22] <shinken-wm>	 PROBLEM - Host deployment-redis01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4835.36 ms  
[21:01:38] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2415.78 ms  
[21:06:14] <shinken-wm>	 RECOVERY - Host deployment-redis01 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms  
[21:09:33] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232526 (10Christopher) @KLans_WMF thank you kindly for your help.    I added a few more options to relieve the aggravation of the "mandatory" start and end dates.  Rather than...
[21:13:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52971: FAILURE in 60 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52971/
[21:17:48] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:22:52] <shinken-wm>	 FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 379.06 ms  
[21:32:45] <wikibugs>	 6Release-Engineering, 7Jenkins: Run the git whitespace checker as part of CirrusSearch V+2  - https://phabricator.wikimedia.org/T97086#1232648 (10EBernhardson) 3NEW
[21:34:43] <wmf-insecte>	 Yippee, build fixed!
[21:34:44] <wmf-insecte>	 Project beta-code-update-eqiad build #52973: FIXED in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52973/
[21:36:36] <shinken-wm>	 FLAPPINGSTART - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 90.97 ms  
[21:42:20] <shinken-wm>	 FLAPPINGSTOP - Host deployment-db2 is UP: PING WARNING - Packet loss = 0%, RTA = 944.12 ms  
[21:51:20] <shinken-wm>	 FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 12.81 ms  
[21:53:00] <wmf-insecte>	 Project beta-code-update-eqiad build #52975: FAILURE in 68 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52975/
[21:53:34] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3573.36 ms  
[21:56:29] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 457.15 ms  
[21:58:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[22:04:26] <ebernhardson>	 if anyone happens to randomly know ... i'm following https://www.mediawiki.org/wiki/Continuous_integration/Jenkins_job_builder from a new laptop, but just getting a "KeyError: 'numToKeep'" exception.  pastie here: https://phabricator.wikimedia.org/P550
[22:05:58] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3422.71 ms  
[22:06:17] <grrrit-wm>	 (03PS1) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) 
[22:06:46] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 04-1] "untested, just checking jenkins output here" [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson)
[22:07:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson)
[22:10:45] <grrrit-wm>	 (03PS2) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) 
[22:12:15] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2439.16 ms  
[22:15:33] <wmf-insecte>	 Yippee, build fixed!
[22:15:33] <wmf-insecte>	 Project beta-code-update-eqiad build #52977: FIXED in 2 min 32 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52977/
[22:20:50] <marxarelli>	 ^d: i noticed that you can approve new git repos. :) i just submitted a request for integration/raita (the browser test dashboard)
[22:20:51] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[22:30:54] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[22:30:56] <shinken-wm>	 FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 180.85 ms  
[22:31:28] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[22:32:14] <grrrit-wm>	 (03PS3) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) 
[22:32:54] <shinken-wm>	 FLAPPINGSTOP - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 115.24 ms  
[22:33:32] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms  
[22:35:45] <grrrit-wm>	 (03PS4) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) 
[22:40:44] <grrrit-wm>	 (03CR) 10EBernhardson: "This patch is mostly guesswork, based on mediawiki-core-whitespaces job. would appreciate review. Also i understand it wont be voting, ba" [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson)
[22:47:07] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2655.32 ms  
[22:51:08] <shinken-wm>	 PROBLEM - Host deployment-bastion is DOWN: CRITICAL - Host Unreachable (10.68.16.58)  
[22:51:28] <shinken-wm>	 PROBLEM - Host deployment-lucid-salt is DOWN: CRITICAL - Host Unreachable (10.68.17.49)  
[22:51:54] <shinken-wm>	 PROBLEM - Host deployment-cache-text02 is DOWN: CRITICAL - Host Unreachable (10.68.16.16)  
[22:52:02] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 351.68 ms  
[22:52:20] <shinken-wm>	 PROBLEM - Host deployment-salt is DOWN: CRITICAL - Host Unreachable (10.68.16.99)  
[22:52:32] <shinken-wm>	 PROBLEM - Host deployment-urldownloader is DOWN: CRITICAL - Host Unreachable (10.68.16.135)  
[22:53:15] <shinken-wm>	 PROBLEM - Host deployment-elastic08 is DOWN: CRITICAL - Host Unreachable (10.68.17.188)  
[22:53:17] <shinken-wm>	 PROBLEM - Host deployment-pdf01 is DOWN: CRITICAL - Host Unreachable (10.68.16.73)  
[22:53:27] <shinken-wm>	 PROBLEM - Host deployment-rsync01 is DOWN: CRITICAL - Host Unreachable (10.68.17.66)  
[22:53:37] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 7044.15 ms  
[22:54:05] <shinken-wm>	 PROBLEM - Host deployment-memc03 is DOWN: CRITICAL - Host Unreachable (10.68.16.15)  
[22:54:21] <shinken-wm>	 PROBLEM - Host Generic Beta Cluster is DOWN: CRITICAL - Host Unreachable (en.wikipedia.beta.wmflabs.org)  
[22:56:31] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[22:58:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[22:59:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[23:01:17] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[23:01:27] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[23:02:09] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[23:02:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[23:02:53] <shinken-wm>	 PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[23:03:01] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[23:04:22] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[23:04:30] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[23:06:40] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[23:06:54] <wmf-insecte>	 Project beta-code-update-eqiad build #52981: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52981/
[23:07:08] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[23:09:50] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[23:11:00] <shinken-wm>	 PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[23:11:56] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[23:12:22] <shinken-wm>	 PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[23:12:25] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[23:12:25] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[23:12:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[23:13:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[23:13:43] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[23:13:45] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[23:13:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[23:13:57] <shinken-wm>	 PROBLEM - Puppet failure on deployment-test is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[23:14:09] <shinken-wm>	 PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[23:14:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[23:15:30] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[23:15:52] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms  
[23:17:04] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[23:17:10] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[23:18:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[23:20:09] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[23:20:11] <shinken-wm>	 FLAPPINGSTOP - Host deployment-kafka02 is UP: PING WARNING - Packet loss = 73%, RTA = 121.96 ms  
[23:22:05] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 45152 bytes in 6.514 second response time  
[23:23:19] <shinken-wm>	 RECOVERY - Host deployment-memc03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[23:23:39] <shinken-wm>	 RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[23:23:47] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 100%  
[23:24:22] <shinken-wm>	 RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms  
[23:24:25] <shinken-wm>	 RECOVERY - Host deployment-rsync01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[23:25:02] <shinken-wm>	 RECOVERY - Host deployment-pdf01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[23:25:04] <shinken-wm>	 RECOVERY - Host deployment-elastic08 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms  
[23:25:14] <shinken-wm>	 RECOVERY - Host deployment-cache-text02 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms  
[23:25:22] <shinken-wm>	 RECOVERY - Host deployment-lucid-salt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms  
[23:25:54] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[23:26:19] <thcipriani>	 ^ andrewbogott did you ever know that you're my hero :)
[23:26:32] <shinken-wm>	 RECOVERY - Host deployment-urldownloader is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms  
[23:26:34] <shinken-wm>	 RECOVERY - Host deployment-salt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms  
[23:26:45] <andrewbogott>	 thcipriani: keep in mind that I also broke them in the first place...
[23:26:59] <andrewbogott>	 And that’s only one of 6 hosts rebooted, so there will be similar outage storms coming up
[23:27:15] <andrewbogott>	 Well, maybe not tonight since I want to make sure the fix on labvirt1001 really fixed things.
[23:28:15] <thcipriani>	 sure, well, thanks for digging deep on this one. Some days innocuous problems result in new kernels across clusters, I guess :P
[23:28:31] <shinken-wm>	 PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[23:28:51] <bd808>	 YuviPanda: jouncebot is sick :( It's joining and responding to commands except it does nothing when "next" is asked
[23:29:21] <YuviPanda>	 bd808: I’m going to restart it and see if that helps
[23:29:22] <andrewbogott>	 thcipriani, the next step is to decide if these instances are officially working correctly:  https://dpaste.de/A2pb
[23:29:28] <andrewbogott>	 If so, I can apply the same medicine to other hosts
[23:29:37] <andrewbogott>	 Meanwhile, time to put dinner in the oven
[23:29:53] <YuviPanda>	 bd808: done
[23:29:56] <bd808>	 YuviPanda: k. I've cycled it a couple of time
[23:30:23] * bd808 was about to add more not so secret debugging commands to it
[23:30:32] <YuviPanda>	 heh
[23:30:41] <shinken-wm>	 PROBLEM - Host deployment-db1 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2572.68 ms  
[23:30:41] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[23:31:18] <bd808>	 YuviPanda: are we running it twice?
[23:31:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[23:31:23] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]  
[23:31:26] <YuviPanda>	 bd808: not sure where the other one is coming from
[23:31:33] <YuviPanda>	 rebooting only bounced jouncebot_
[23:31:36] <bd808>	 I started it on tools-bastion-01
[23:31:43] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[23:31:46] <bd808>	 which has jouncebot nick
[23:32:08] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[23:32:14] <wmf-insecte>	 Yippee, build fixed!
[23:32:14] <wmf-insecte>	 Project beta-code-update-eqiad build #52982: FIXED in 1 min 44 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52982/
[23:32:26] <bd808>	 $ job jouncebot -- 258195
[23:33:02] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1520 bytes in 2.109 second response time  
[23:33:47] <YuviPanda>	 bd808: let me shut it down adn start it back up on trusty
[23:33:56] <bd808>	 k
[23:34:50] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3195.11 ms  
[23:38:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:38:46] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:38:58] <shinken-wm>	 RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:39:08] <shinken-wm>	 RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:39:26] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:39:44] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52)  
[23:40:10] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:40:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:41:28] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms  
[23:41:32] <shinken-wm>	 RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms  
[23:42:16] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 57%, RTA = 7134.16 ms  
[23:43:05] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:43:26] <thcipriani>	 andrewbogott: the guests on that machine all seem to ping in sub 1 ms range consistantly
[23:43:59] <andrewbogott>	 thcipriani: that’s a good sign!
[23:44:14] <andrewbogott>	 I’m doing labvirt1002 now, that will upset another set of instances…
[23:44:19] <andrewbogott>	 then I’m going to lay off until the morning.
[23:44:21] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:44:30] <thcipriani>	 kk
[23:44:31] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:44:33] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:44:49] <thcipriani>	 I'll update the ticket once that is complete
[23:44:53] <shinken-wm>	 PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52)  
[23:45:07] <shinken-wm>	 PROBLEM - Host deployment-mathoid is DOWN: PING CRITICAL - Packet loss = 100%  
[23:45:33] <shinken-wm>	 PROBLEM - Host deployment-sca01 is DOWN: CRITICAL - Host Unreachable (10.68.17.54)  
[23:46:05] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[23:46:43] <shinken-wm>	 PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (10.68.17.69)  
[23:46:45] <shinken-wm>	 PROBLEM - Host deployment-pdf02 is DOWN: CRITICAL - Host Unreachable (10.68.16.129)  
[23:46:45] <shinken-wm>	 PROBLEM - Host deployment-videoscaler01 is DOWN: CRITICAL - Host Unreachable (10.68.16.211)  
[23:47:05] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 179.12 ms  
[23:47:07] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:47:11] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:47:21] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:47:41] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms  
[23:47:56] <shinken-wm>	 RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:47:58] <shinken-wm>	 PROBLEM - Host deployment-db1 is DOWN: CRITICAL - Host Unreachable (10.68.16.193)  
[23:48:08] <shinken-wm>	 PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51)  
[23:50:54] <shinken-wm>	 FLAPPINGSTOP - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms  
[23:51:30] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 100%  
[23:51:38] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:52:08] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[23:52:10] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:52:40] <shinken-wm>	 RECOVERY - Host deployment-sca01 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms  
[23:53:06] <shinken-wm>	 RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[23:53:12] <shinken-wm>	 RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms  
[23:53:18] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 5597.84 ms  
[23:53:24] <shinken-wm>	 RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[23:53:28] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:53:37] <shinken-wm>	 RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms  
[23:53:41] <shinken-wm>	 RECOVERY - Host deployment-pdf02 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[23:53:51] <shinken-wm>	 RECOVERY - Host deployment-mathoid is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms  
[23:54:29] <shinken-wm>	 RECOVERY - Host deployment-db1 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms  
[23:55:58] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2049.79 ms  
[23:56:00] <shinken-wm>	 RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:56:18] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 2129.21 ms  
[23:57:01] <andrewbogott>	 thcipriani: ok, that’s all the breaking and/or fixing that I’m going to do tonight.  Hosts not on labvirt1001 or labvirt1002 will continue to stutter as they have been… if all looks well I’ll sort them out tomorrow.
[23:57:02] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:57:22] <shinken-wm>	 RECOVERY - Puppet failure on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:57:28] <shinken-wm>	 RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:57:33] <thcipriani>	 andrewbogott: kk, I'll update the ticket and do a little clean up of the guests on rebooted hosts
[23:57:47] <thcipriani>	 do you have a list of labvirt1002 guests?
[23:58:20] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:58:56] <shinken-wm>	 RECOVERY - Puppet failure on deployment-test is OK: OK: Less than 1.00% above the threshold [0.0]  
[23:59:06] <andrewbogott>	 thcipriani: labvirt1002:  https://dpaste.de/ku27
[23:59:11] <thcipriani>	 andrewbogott: thanks again for all your help—was a problem I'd have never found :)
[23:59:20] <andrewbogott>	 me neither, Coren found it :)
[23:59:34] <thcipriani>	 well it _was_ found
[23:59:44] <thcipriani>	 that's what's important here
[23:59:48] <thcipriani>	 :)
[23:59:50] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]