[00:03:33] !log cleaned up redis leftovers on deployment-logstash1 [00:03:36] Logged the message, Master [00:11:18] 10Continuous-Integration, 7developer-notice: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1229838 (10Krinkle) [00:17:24] !log beta cluster fatal monitor full of "Bad file descriptor: AH00646: Error writing to /data/project/logs/apache-access.log" [00:17:26] Logged the message, Master [00:29:14] !log cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205969/ (logstash: Convert $::realm switches to hiera) [00:29:17] Logged the message, Master [00:46:23] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1821: FAILURE in 22 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1821/ [00:50:05] 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229915 (10Nuria) I have just enabled puppet again, no reason to have it disabled anymore (we did so for testing purposes couple weeks back) [01:05:09] (03PS1) 10Krinkle: Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 [01:06:14] (03Abandoned) 10Krinkle: Publish doxygen doc for the 'cdb' project [integration/config] - 10https://gerrit.wikimedia.org/r/174417 (https://bugzilla.wikimedia.org/73530) (owner: 10Hashar) [01:06:40] 10Continuous-Integration: Publish cdb documentation to doc.wikimedia.org - https://phabricator.wikimedia.org/T75530#1229945 (10Krinkle) a:5hashar>3Krinkle [01:19:28] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #231: FAILURE in 1 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/231/ [01:20:44] Project beta-update-databases-eqiad build #9107: FAILURE in 44 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9107/ [01:25:27] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.157 second response time [01:26:02] 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1229983 (10Shizhao) me2 [01:40:28] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47222 bytes in 1.108 second response time [01:48:03] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1520 bytes in 2.323 second response time [01:53:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46939 bytes in 1.050 second response time [01:55:20] Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce build #48: FAILURE in 1 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce/48/ [02:06:59] 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1230005 (10Krinkle) >>! In T92909#1222266, @Jdlrobson wrote: > When I do spend time in it it takes too long to get code reviewed/fixed and merged (it reall... [02:07:00] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2072.68 ms [02:11:11] PROBLEM - Host deployment-stream is DOWN: CRITICAL - Host Unreachable (10.68.17.106) [02:11:11] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [02:12:19] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [02:16:57] RECOVERY - Host deployment-stream is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [02:17:19] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 211.65 ms [02:21:00] Yippee, build fixed! [02:21:01] Project beta-update-databases-eqiad build #9108: FIXED in 1 min 0 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9108/ [02:25:29] Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce build #77: FAILURE in 3 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce/77/ [02:36:32] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #550: FAILURE in 3 min 32 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/550/ [02:42:14] 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1230032 (10Jdlrobson) I understand.. but the reason I use github.com for personal development is the simplicity. Put frankly, I don't want to spend time on... [02:45:05] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #514: FAILURE in 4 min 4 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/514/ [02:48:08] PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51) [02:50:44] PROBLEM - Host deployment-db1 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2734.00 ms [02:57:27] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5742.75 ms [03:00:38] (03PS1) 10Krinkle: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) [03:03:05] PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51) [03:05:08] (03CR) 10Krinkle: "@Jdlrobson: Let us know when this can be enabled. Right now it's failing with 5 jscs errors." [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [03:09:14] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [03:14:12] FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 296.27 ms [03:20:36] Project beta-update-databases-eqiad build #9109: FAILURE in 36 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9109/ [03:21:37] PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3586.25 ms [03:21:51] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [03:22:00] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 108.56 ms [03:22:44] PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:44] RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [03:27:00] FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 68.73 ms [03:27:40] RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [03:28:05] 10Continuous-Integration, 6operations, 7Puppet: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1230073 (10Krinkle) [03:28:43] 10Browser-Tests, 7Puppet: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230074 (10Krinkle) [03:36:27] 10Browser-Tests, 7Puppet: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230094 (10Krinkle) 5Open>3Resolved a:3Krinkle Haven't seen this error in the 2 instance re-creation sprints. Works for me. [03:44:39] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:01:36] PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2222.41 ms [04:03:12] FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 690.79 ms [04:07:06] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #422: FAILURE in 5.5 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/422/ [04:07:26] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3510.59 ms [04:12:20] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 381.31 ms [04:15:35] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #417: FAILURE in 23 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/417/ [04:21:58] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:50] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 0.487 second response time [04:30:00] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2526.07 ms [04:30:27] (03CR) 10Krinkle: [C: 032] Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 (owner: 10Krinkle) [04:31:55] (03PS1) 10Krinkle: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) [04:32:02] (03Merged) 10jenkins-bot: Remove lint-js from test/gate pipeline where npm runs [integration/config] - 10https://gerrit.wikimedia.org/r/206044 (owner: 10Krinkle) [04:32:08] (03PS2) 10Krinkle: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) [04:32:14] (03CR) 10Krinkle: [C: 032] Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) (owner: 10Krinkle) [04:33:12] PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 3690.88 ms [04:34:12] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [04:34:24] (03Merged) 10jenkins-bot: Enable npm job for Cite and BetaFeatures [integration/config] - 10https://gerrit.wikimedia.org/r/206072 (https://phabricator.wikimedia.org/T94547) (owner: 10Krinkle) [04:34:41] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [04:35:29] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/206044 and https://gerrit.wikimedia.org/r/206072 [04:35:34] Logged the message, Master [04:51:33] FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [04:51:37] 10Continuous-Integration: /var/lib/mysql/ filling up on old Precise slaves due to mysql usage - https://phabricator.wikimedia.org/T94138#1230223 (10Krinkle) 5Open>3declined a:3Krinkle [04:56:00] 10Continuous-Integration, 6Labs: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1230309 (10Krinkle) 5Resolved>3Open Our goal for 30G space was based on the following estimate: > 10G for system, 10G for git replication and 10G for workspace. How... [04:56:02] 10Continuous-Integration: Convert pool from a few large slaves (4X) to more smaller slaves (1X) - https://phabricator.wikimedia.org/T96629#1230311 (10Krinkle) [05:06:47] (03CR) 10Krinkle: Convert 'operations-puppet-doc' job to run on a labs slave (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/204982 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [05:12:21] FLAPPINGSTART - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [05:20:24] Yippee, build fixed! [05:20:24] Project beta-update-databases-eqiad build #9111: FIXED in 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9111/ [05:21:09] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 27.21 ms [05:36:34] PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3219.94 ms [05:44:16] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #48: FAILURE in 28 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/48/ [05:52:13] PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2315.02 ms [05:57:11] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [06:00:04] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 8747.20 ms [06:02:02] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 85.15 ms [06:03:39] (03PS1) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) [06:05:00] (03CR) 10Krinkle: "Example output:" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) (owner: 10Krinkle) [06:06:08] FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 10.48 ms [06:08:47] (03PS2) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) [06:08:52] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: No route to host [06:10:11] PROBLEM - SSH on deployment-cache-mobile03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:28] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 100% [06:11:19] !log integration-slave-trusty-1021 stays depooled (see T96629 and T96706) [06:11:21] Logged the message, Master [06:11:49] !log Running git-cache-update inside screen on integration-slave-trusty-1021 at /mnt/git [06:11:52] Logged the message, Master [06:18:20] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [06:18:56] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 8.717 second response time [06:19:14] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #591: FAILURE in 13 sec: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/591/ [06:20:12] RECOVERY - SSH on deployment-cache-mobile03 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [06:20:33] Project beta-update-databases-eqiad build #9112: FAILURE in 32 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9112/ [06:21:26] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [06:21:35] FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 249.68 ms [06:22:51] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [06:23:23] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [06:23:35] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 304.20 ms [06:23:45] PROBLEM - SSH on deployment-kafka02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:11] PROBLEM - SSH on deployment-cache-mobile03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:29] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [06:42:51] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [06:43:05] PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51) [06:43:31] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [06:44:13] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 153.39 ms [06:48:37] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:52:54] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [06:57:51] PROBLEM - Host deployment-mediawiki01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4423.80 ms [07:01:03] RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [07:07:28] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [07:13:08] FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 280.79 ms [07:18:35] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [07:20:41] FLAPPINGSTOP - Host deployment-db1 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [07:22:10] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #15: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/15/ [07:26:05] PROBLEM - SSH on deployment-memc02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:27:14] PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2709.28 ms [07:30:58] RECOVERY - SSH on deployment-memc02 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [07:31:06] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 198.72 ms [07:32:12] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 219.84 ms [07:37:01] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3552.57 ms [07:38:33] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.789 second response time [07:38:50] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce-DEBUG build #2: FAILURE in 24 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce-DEBUG/2/ [07:41:06] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 43.66 ms [07:42:11] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [07:43:31] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.492 second response time [08:00:19] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [08:10:05] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #578: FAILURE in 4.4 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/578/ [08:12:52] FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [08:13:10] FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 726.89 ms [08:16:32] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2516.26 ms [08:21:38] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 3073.90 ms [08:22:09] Project browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #560: FAILURE in 2 min 8 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/560/ [08:23:33] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 176.14 ms [08:24:41] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [08:30:37] RECOVERY - SSH on deployment-kafka02 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:38:10] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [08:48:13] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 161.23 ms [08:49:37] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [08:54:37] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [08:56:32] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 3.73 ms [08:59:35] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:00:23] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100% [09:06:37] FLAPPINGSTART - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 128.24 ms [09:07:11] FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [09:12:12] PROBLEM - SSH on deployment-restbase01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:36] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [09:23:22] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:22] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [09:31:46] 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1230772 (10Aklapper) p:5High>3Un... [09:34:42] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1230775 (10Christopher) Default show Sprint Start and Sprint End date fields on Project Create form is available for test and review on https://phab08.wmflabs.org/project/create... [09:38:23] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.376 second response time [09:42:21] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #463: FAILURE in 5 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/463/ [09:43:02] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [09:49:03] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [09:55:03] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2169.48 ms [10:02:51] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [10:03:09] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [10:07:20] FLAPPINGSTOP - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 10.39 ms [10:12:48] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [10:17:20] PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3205.42 ms [10:18:38] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [10:21:52] PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204) [10:22:16] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [10:22:54] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2214.64 ms [10:26:32] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1230866 (10Aklapper) Should tasks in Continuous-Integration-Config be automatically part of Continuous-Integration (subproject style), or should these... [10:27:10] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [10:27:18] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [10:35:39] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace.root.byte_percentfree (<10.00%) WARN: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%) [10:35:41] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:32] PROBLEM - Host deployment-mediawiki03 is DOWN: PING CRITICAL - Packet loss = 100% [10:39:42] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [10:41:36] RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 5.93 ms [10:41:52] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 16%, RTA = 308.56 ms [10:42:46] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [10:58:39] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [11:01:33] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2280.78 ms [11:03:31] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 359.24 ms [11:07:12] FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [11:08:30] FLAPPINGSTART - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 395.52 ms [11:12:28] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [11:12:40] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [11:13:00] Project beta-code-update-eqiad build #52911: FAILURE in 0.21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52911/ [11:20:05] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2063.23 ms [11:24:36] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [11:27:34] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47225 bytes in 8.847 second response time [11:34:31] Yippee, build fixed! [11:34:31] Project beta-code-update-eqiad build #52913: FIXED in 1 min 30 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52913/ [11:37:13] FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2793.19 ms [11:38:36] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:22] thcipriani|afk: we still have issue with Beta/deployment-prep? [11:42:36] FLAPPINGSTOP - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 74.65 ms [11:44:18] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2179.11 ms [11:46:16] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:49:13] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 447.99 ms [11:49:55] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce build #226: FAILURE in 1 min 55 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce/226/ [11:53:00] Project beta-code-update-eqiad build #52915: FAILURE in 70 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52915/ [11:57:30] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:00:09] Project browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #87: FAILURE in 3 min 9 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/87/ [12:03:28] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5695.72 ms [12:07:54] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 5436.45 ms [12:08:35] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 45382 bytes in 8.287 second response time [12:09:17] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 57.36 ms [12:09:37] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2445.42 ms [12:10:03] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [12:10:21] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2071.84 ms [12:14:11] PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3498.98 ms [12:14:15] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 264.04 ms [12:14:18] PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 2811.37 ms [12:14:20] Yippee, build fixed! [12:14:20] Project beta-code-update-eqiad build #52917: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52917/ [12:16:20] FLAPPINGSTART - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3923.35 ms [12:18:11] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [12:21:01] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:15] PROBLEM - SSH on deployment-elastic06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:25] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 4770.32 ms [12:29:06] RECOVERY - SSH on deployment-elastic06 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [12:33:00] Project beta-code-update-eqiad build #52919: FAILURE in 69 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52919/ [12:34:16] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [12:35:14] PROBLEM - SSH on deployment-elastic06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:07] RECOVERY - SSH on deployment-elastic06 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [12:43:43] PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:43] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:46:37] Yippee, build fixed! [12:46:37] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1823: FIXED in 36 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1823/ [12:48:33] RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [12:48:54] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #231: FAILURE in 1 min 53 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/231/ [12:54:28] Yippee, build fixed! [12:54:29] Project beta-code-update-eqiad build #52921: FIXED in 1 min 28 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52921/ [12:55:21] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #450: FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/450/ [12:56:34] PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52) [12:59:45] RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 343.94 ms [13:03:49] PROBLEM - Host deployment-elastic06 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 2678.28 ms [13:04:45] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 185.76 ms [13:06:51] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [13:08:12] FLAPPINGSTOP - Host deployment-memc02 is UP: PING WARNING - Packet loss = 16%, RTA = 1819.97 ms [13:09:44] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3939.57 ms [13:10:01] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 16%, RTA = 333.55 ms [13:10:35] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:13:00] Project beta-code-update-eqiad build #52923: FAILURE in 66 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52923/ [13:13:10] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [13:18:11] FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [13:19:59] FLAPPINGSTART - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [13:23:11] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [13:23:41] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [13:25:21] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING WARNING - Packet loss = 0%, RTA = 847.67 ms [13:26:35] FLAPPINGSTART - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 233.65 ms [13:33:28] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [13:34:38] Yippee, build fixed! [13:34:39] Project beta-code-update-eqiad build #52925: FIXED in 1 min 38 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52925/ [13:39:40] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:40:36] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [13:40:45] PROBLEM - SSH on deployment-elastic07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:05] PROBLEM - Host deployment-videoscaler01 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 2988.20 ms [13:45:57] RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [13:46:29] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [13:48:34] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 336.35 ms [13:52:36] (03CR) 10Jforrester: [C: 031] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [13:53:01] Project beta-code-update-eqiad build #52927: FAILURE in 0.52 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52927/ [13:54:52] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [13:58:28] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:02:32] kart_: FWIW, I pinged some folks in -labs, not sure what's happening yet, could be ldap stuff, could be weird network strain. [14:03:22] I'm off to do morning-type things, but I'll be idle here [14:06:59] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [14:09:05] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #295: FAILURE in 6.8 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/295/ [14:09:39] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:13] Yippee, build fixed! [14:14:13] Project beta-code-update-eqiad build #52929: FIXED in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52929/ [14:16:31] FLAPPINGSTOP - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [14:22:11] FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 419.76 ms [14:23:19] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:26:55] PROBLEM - Host deployment-zookeeper01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2725.34 ms [14:28:29] PROBLEM - Host deployment-sentry2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2545.83 ms [14:31:05] FLAPPINGSTOP - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 78.41 ms [14:32:19] PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:00] Project beta-code-update-eqiad build #52931: FAILURE in 62 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52931/ [14:36:49] RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 31.49 ms [14:36:56] RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [14:38:04] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:43:52] 10Beta-Cluster: Cannot open pages in Beta Cluster, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1231265 (10Jdforrester-WMF) [14:45:36] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:56] PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2103.36 ms [14:48:17] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:55] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [14:52:29] andrewbogott: have you been able to narrow down the network issues at all? [14:52:35] (if it is/was network issues) [14:53:55] Yippee, build fixed! [14:53:56] Project beta-code-update-eqiad build #52933: FIXED in 55 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52933/ [14:56:05] thcipriani: not really, although I’m still poking. [14:56:09] Are there actual results from the issue or just monitoring complaints? [14:56:55] you mean what is the overall impact of this? [14:57:42] yeah — any actual user consequences? [14:58:23] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:58:52] starting to see tickets come in about it: https://phabricator.wikimedia.org/T96905 [14:59:05] I don't think anyone can't do their job, though, afaik [14:59:39] oh — ok, that counts as a user consequence :) [15:00:28] there was a long gap from about 10pm pst to about 6AM pst when I wasn’t migrating anything. It sounds like the issue was still occuring then, right? [15:02:12] judging from scrollback it's been fairly continual since about 7pm pdt-ish [15:02:21] continually intermitent [15:02:23] he [15:06:43] PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 11%, RTA = 4710.61 ms [15:06:55] FLAPPINGSTART - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:09:28] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #203: STILL FAILING in 47 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/203/ [15:12:45] Hi Reedy... how's it going? are you the one to ping for a beta cluster issue? [15:12:54] 10Beta-Cluster, 6Labs: Migrate deployment-prep to new labvirt hosts - https://phabricator.wikimedia.org/T96678#1231284 (10Andrew) 5Open>3Resolved This is done! [15:13:00] Project beta-code-update-eqiad build #52935: FAILURE in 67 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52935/ [15:14:22] AndyRussG: this is the right place but Reedy is probably not the right person. Probably best if you just describe the issue so that anyone here can follow up. [15:14:38] (Reedy is, I believe, now studying to be a pilot.) [15:15:16] I should probably remove his +v so he doesn't get pinged [15:15:38] andrewbogott: ah cool thanks! Yeah besides the fact that I'm getting lots of server errors, I forgot my password for my WMF account there and didn't register an e-mail address (silly me ;P ) [15:16:18] Uhh and in addition (if I'm not already, might already be) I should be a CentralNotice admin there... [15:16:40] define the server errors [15:16:40] Does the ‘lots of server errors’ issue resemble this one? https://phabricator.wikimedia.org/T96905 [15:16:50] bah, nevermind [15:16:58] Something odd is happening to the CentralNotice campaigns running on the beta cluster, and our browser tests depend on them [15:17:20] PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 6683.87 ms [15:17:43] andrewbogott: yeah lots of those intermittently [15:18:02] But sometimes the pages load fine (though even slower than usual it seems) [15:18:15] 10Beta-Cluster: Can't connect to Beta Cluster database - https://phabricator.wikimedia.org/T96905#1231300 (10greg) [15:18:54] thcipriani: is ^^ due to the restarts yesterday? [15:19:17] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #31: FAILURE in 1 min 16 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/31/ [15:19:44] I think it's due to the issues we've been seeing since about 7pm last night with the whole beta cluster intermittently [15:19:57] greg-g: in theory I’m investigating that issue, but I’m starting to think it has nothing to do with me :( [15:20:15] thcipriani: any leads? [15:20:25] bug for that? [15:20:41] not yet, filing now [15:20:51] and no, no leads yet :( [15:20:51] * greg-g was in migraine-ville yesterday after 3 or so, may have missed much [15:20:54] :( [15:21:46] http://shinken.wmflabs.org/problems?search=hg:deployment-prep looks not happy [15:22:18] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10thcipriani) 3NEW [15:22:43] greg-g: yeah, the problem is, nothing is failing consistantly [15:22:43] greg-g: andrewbogott: so... who should I ask to reset my beta cluster account? [15:22:48] and beta scap still hasn't run successfully since 2 days 10 hr ago [15:22:59] AndyRussG: I dont know. [15:23:11] AndyRussG: right now? no one in this room, we're trying to get the beta cluster working again [15:23:23] greg-g: :(( [15:23:27] need help? [15:23:27] legoktm might be able to help [15:23:29] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [15:23:35] bd808: :( maybe [15:23:50] greg-g: K thanks sorry to interrupt ;/ [15:24:35] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [15:25:00] only thing that happened around that time in the SAL was 00:29 bd808: cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205969/ (logstash: Convert $::realm switches to hiera) [15:25:29] Related: I unstuck puppet replication just before that [15:25:40] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231321 (10greg) [15:25:42] but it was a good hour before anything started erroring out [15:25:50] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231325 (10greg) p:5Triage>3Unbreak! [15:26:39] FLAPPINGSTART - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [15:26:40] Puppet on deployment-salt had been in a detached head state for some unknown time [15:27:01] But that should have changed things sooner than an hour [15:27:31] PROBLEM - SSH on deployment-test is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:40] right, it seems like the network is just super slow and there aren't actually any problems on the boxes [15:28:06] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #512: FAILURE in 6.1 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/512/ [15:28:16] I can log into the boxes with problems, the load is fine, ssh is super slow, the stuff in syslog is complaining about an ldap group [15:28:34] nslcd[1270]: [b9698b] error writing to client: Broken pipe [15:28:45] thcipriani: it would help me out if you can detect any sort of pattern — like, is the network slower for some instances than others, and if so, which ones? [15:28:46] that's known noise [15:29:43] the nslcd log line is related to an ldap group that is too large to fit in a packet. We've had that all over labs for a very long time [15:30:03] andrewbogott: deployment-parsoid05 was certainly the worst one I logged into this morning [15:31:06] thcipriani: btw, if this takes a while, don't worry about our 1:1 [15:31:09] ok, that one’s on labvirt1005 [15:31:18] any others? [15:32:45] iirc both of the deployment-db{1,2} were sticky as well [15:32:50] FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 0%, RTA = 533.97 ms [15:33:06] oh and that one ^ [15:33:15] jebus [15:33:34] (re the RTA) [15:34:18] Yippee, build fixed! [15:34:18] Project beta-code-update-eqiad build #52937: FIXED in 1 min 17 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52937/ [15:34:56] whoa, nice [15:35:50] thcipriani: restbase01 is also on labvirt1005. The db hosts are not. [15:37:04] I was just on db hosts, seem fine now, could have been mis-remembering, resbase and parsoid are definitely still very very slow [15:37:39] PROBLEM - Host deployment-parsoid05 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:00] what other vms are on 1005? [15:38:49] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1231353 (10mark) a:5mark>3faidon [15:38:55] Hm… cvn-app4 is there and it is BUSY [15:40:20] * greg-g wants dedicated virt hardware for beta cluster :/ [15:42:34] bye bye qa-morebots [15:44:18] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3099.60 ms [15:49:15] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 393.48 ms [15:50:57] PROBLEM - Host deployment-videoscaler01 is DOWN: CRITICAL - Host Unreachable (10.68.16.211) [15:53:00] Project beta-code-update-eqiad build #52939: FAILURE in 58 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52939/ [15:54:14] FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 0%, RTA = 661.74 ms [15:54:57] deployment-restbase01 just gave me a kernel:[2042464.137735] BUG: soft lockup - CPU#2 stuck for 25s! [15:55:53] is the compute node those vms are on dying? [15:55:58] RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [15:58:07] judging from icinga alerts in -operations labvirt1005 and labvirt1006 aren't thrilled today [16:00:12] load on deployment-logstash1 is crazy. 12+ [16:00:29] PROBLEM - Host deployment-mediawiki03 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:40] thcipriani: I’m moving that cvn instance to a new host. We’ll see if that improves things or at least alters the pattern [16:00:49] kk [16:01:15] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [16:01:53] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [16:03:04] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:19] FLAPPINGSTOP - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [16:09:13] FLAPPINGSTOP - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:19] Project beta-parsoid-update-eqiad build #937: FAILURE in 0.11 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/937/ [16:09:59] PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (10.68.17.69) [16:11:29] FLAPPINGSTOP - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [16:12:20] FLAPPINGSTOP - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 100% [16:14:21] Yippee, build fixed! [16:14:22] Project beta-code-update-eqiad build #52941: FIXED in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52941/ [16:20:00] RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [16:21:50] FLAPPINGSTOP - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [16:33:00] Project beta-code-update-eqiad build #52943: FAILURE in 71 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52943/ [16:33:59] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:38] ^d: We need to split up InitialiseSettings. It's waaay too big [16:35:56] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2599.41 ms [16:36:19] <^d> bd808: Finish the config project and move it into a real storage instead of a PHP file :p [16:36:30] +2 [16:36:47] If only legoktm hadn't been sent to flow land :( [16:38:24] 10Continuous-Integration: Disable html format for xdebug/var_dump for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1231531 (10Krinkle) 3NEW [16:41:01] <^d> bd808: Where does general architecture cleanups unrelated to reading/editing/performance/security get handled now? [16:41:17] "magic elves" [16:41:18] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:41:30] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:41:33] <^d> bd808: Ah, I missed those in the new org chart ;-) [16:41:54] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [16:42:16] <^d> bd808: You could argue almost any of those groups take it [16:42:19] I should edit the chart and add some marker for all the designated magic elves. I'm pretty sure I know who they are [16:42:20] <^d> And argue they shouldn't :) [16:42:42] RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [16:43:08] 10Continuous-Integration, 5Patch-For-Review: Disable xdebug's html formatting of PHP errors for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1231569 (10Krinkle) [16:43:17] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 353.07 ms [16:43:22] <^d> bd808: I found 3 the other day. cross-ref my e-mail about ES ownership ;-) [16:44:00] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:44:18] I think the current high level thinking is that (a) people will do these things because they feel compelled; (b) debt should only be cleaned up in connection with "real" work; and (c) "oops didn't think of that because I never did it" [16:45:03] <^d> (a) happened before the reorg too, so I guess I can see that continuing [16:45:54] imma gonna do the things I see that need doing until somebody physically stops me [16:46:11] <^d> "What do you do?" [16:46:14] <^d> "What needs to be done" [16:46:21] <^d> "Ah, carry on!" [16:46:23] *slides on sunglasses* [16:46:52] thcipriani: Are things behaving now, at least for the moment? [16:47:17] Somebody has to be ReedyBot until he gets tired of having a life with dreams and stuff [16:47:36] !quip [16:48:00] <^d> bd808: I've tried, I seem to require more sleep/food though [16:48:17] *nod* I'm too old to do it properly too [16:48:24] RECOVERY - SSH on deployment-test is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:48:41] FLAPPINGSTART - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [16:48:49] my brain doesn't work fast enough to keep up with the smart kids [16:49:01] and my hands seize up [16:49:39] We should specifically recruit for hyperactive insomniacs ;) [16:50:20] http://test2.wikipedia.org/wiki/Special:RecentChanges [16:50:23] so many questions [16:50:24] <^d> "Do you suffer from episodes of mania that prevent you from sleeping for days on end?" [16:50:26] is this used? [16:50:29] seeking coder who loves untangling messes and requires little sleep with access to free concert tickets [16:51:05] mutante: test2 is for people to mess with semi-randomly [16:51:28] <^d> test2 exists because we needed a way to test multiversion when test.wp was even more hacky than it is [16:51:35] <^d> Now it's just a playground [16:51:37] <^d> Runs group0 [16:51:53] FLAPPINGSTOP - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 261.54 ms [16:52:14] test.wp.o is used in the automation tests and used to get you punched for messing up [16:52:26] ok guys, so then i'm not using "delete test2 from DNS" as the random edit i need to make [16:52:45] because if i dont edit the wikipedia.org zone file templates, the new "gom" language won't get added [16:52:48] heh [16:53:57] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28282 bytes in 7.221 second response time [16:54:20] Yippee, build fixed! [16:54:20] Project beta-code-update-eqiad build #52945: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52945/ [16:54:35] RECOVERY - SSH on deployment-elastic07 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:55:21] quality.wikipedia.org - This wiki has been closed and its content has been moved to meta [16:55:41] <^d> Heh, qualitywiki [16:55:47] (i'm just looking at the DNS template, what is not a language etc) [16:58:30] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:42] is there a master bug / tracking project for the beta issues? [16:59:16] (03PS3) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) [16:59:16] tgr: https://phabricator.wikimedia.org/tag/beta-cluster/ [16:59:56] tgr https://phabricator.wikimedia.org/T97033 is ongoing [17:00:26] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:00:52] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [17:01:21] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:02:05] oh, what now? [17:02:19] andrewbogott: things did seem better there for a few [17:03:16] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231643 (10Tgr) Could be the cause for {T97047} (which does not seem to be intermittent though). [17:03:42] now just seems like parsoidcache02 afai can tell [17:04:28] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:05:18] PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:07:52] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:07:53] deployment-cache-mobile03 doesn't seem to be doing too well either :\ [17:09:08] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 129.59 ms [17:13:00] Project beta-code-update-eqiad build #52947: FAILURE in 57 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52947/ [17:14:29] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:15:59] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 6477.80 ms [17:16:49] PROBLEM - Puppet failure on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:17:17] thcipriani: mind if I reboot deployment-parsoidcache02 just to see if it settles down? There’s nothing wrong in the infrastructure anymore, as far as I can tell. [17:17:47] andrewbogott: sure should be fine [17:17:48] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:18:07] * andrewbogott grasps a straw, clings for dear life [17:19:45] andrewbogott: thanks for taking the time on this today—finding my way from an individual instance to the libvirt host a bit opaque to me :) [17:20:01] FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 707.18 ms [17:20:04] thcipriani: it doesn’t help that the instance pages on wikitech are out of date :( [17:20:33] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 45372 bytes in 6.615 second response time [17:21:17] FLAPPINGSTOP - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [17:21:17] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:24:16] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [17:29:04] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:29:30] RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0] [17:29:43] thcipriani: shinken hasn’t caught up yet, but I can see that deployment-parsoidcache02 is up and reachable now. Does it look to you like things are mostly working? [17:30:07] * thcipriani looks [17:32:54] andrewbogott: yeah, varnish looks up, seems like all that should be running [17:33:10] seems really responsive, so that's nice, too [17:33:33] ok. labvirt1006 still looks a bit off to me, but I’ll keep an eye on it for a bit, see if it calms down. [17:33:44] Meanwhile I’m going to step away for a bit. [17:34:13] Yippee, build fixed! [17:34:13] Project beta-code-update-eqiad build #52949: FIXED in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52949/ [17:34:20] PROBLEM - Host deployment-cache-upload02 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 4527.02 ms [17:34:51] andrewbogott: kk, yeah, stuff is definitely still wacky :\ [17:35:00] dammit, that one isn’t on 1006 [17:35:47] what about deployment-logstash1 that machine is...unhappy :( [17:36:18] that may just be excess load though... [17:37:35] it’s on 1006 [17:38:10] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [17:39:34] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [17:43:10] FLAPPINGSTOP - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [17:45:22] (03PS4) 10Krinkle: [WIP] Implement git-cache-update script [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206074 (https://phabricator.wikimedia.org/T96687) [17:49:15] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [17:53:00] Project beta-code-update-eqiad build #52951: FAILURE in 62 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52951/ [17:55:30] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:58:11] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #518: FAILURE in 11 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/518/ [18:02:21] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [18:03:08] 10Beta-Cluster, 10Continuous-Integration, 6Release-Engineering, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1231878 (10Mattflaschen) I don't know whether it's updating at deploy-time for F... [18:03:10] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [18:04:11] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [18:05:22] RECOVERY - Puppet failure on integration-slave-trusty-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [18:05:24] RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [18:05:25] thcipriani: ok, one more big interruption — rebooting labvirt1005 fixed a bunch of things, going to see if the same is true of 1006 [18:06:13] RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [18:06:18] RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [18:06:49] RECOVERY - Puppet failure on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [18:06:58] andrewbogott: aweseom. [18:07:48] RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [18:08:50] (03CR) 10Jdlrobson: [C: 031] "Timo you're a rock star. I'd merge if I could but I only have +1 here." [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [18:09:09] 10Continuous-Integration, 10Gather, 3Gather Sprint Forward, 5Patch-For-Review: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1231886 (10Jdlrobson) [18:09:29] RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [18:10:07] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [18:10:07] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189) [18:10:19] PROBLEM - Host deployment-upload is DOWN: CRITICAL - Host Unreachable (10.68.16.189) [18:10:51] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [18:12:19] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [18:12:43] (03CR) 10Krinkle: [C: 032] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [18:12:46] (03PS2) 10Krinkle: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) [18:12:51] (03CR) 10Krinkle: [C: 032] Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [18:13:36] FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [18:14:08] Yippee, build fixed! [18:14:08] Project beta-code-update-eqiad build #52953: FIXED in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52953/ [18:14:15] RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [18:14:29] (03Merged) 10jenkins-bot: Enable npm job for Gather [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [18:14:36] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [18:15:06] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [18:15:19] !log Deploying Zuul config https://gerrit.wikimedia.org/r/206068 [18:16:55] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [18:16:59] Krinkle, can i get you to review a gerrit-wm patch to remove duplicate messages in #parsoid channel? [18:17:15] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [18:17:16] grrrit-wm [18:17:21] oh, you are busy. [18:17:49] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:09] FLAPPINGSTART - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [18:19:06] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:36] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [18:19:52] subbu: I don't know gerrit-wm well enough. And kind of busy at the moment. [18:20:05] np [18:20:27] Yippee, build fixed! [18:20:28] Project beta-parsoid-update-eqiad build #939: FIXED in 40 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/939/ [18:21:10] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [18:21:35] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [18:21:47] PROBLEM - Host deployment-zookeeper01 is DOWN: CRITICAL - Host Unreachable (10.68.17.157) [18:25:01] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2743.35 ms [18:26:19] FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [18:26:33] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [18:26:49] RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [18:29:39] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [18:30:08] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #295: FAILURE in 7.5 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/295/ [18:30:10] thcipriani: I’m going to lunch now — we’ll see how much shinken chatter piles up while I’m gone. [18:30:33] RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0] [18:33:00] Project beta-code-update-eqiad build #52955: FAILURE in 0.15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52955/ [18:38:17] (03CR) 10Jdlrobson: "thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/206068 (https://phabricator.wikimedia.org/T92589) (owner: 10Krinkle) [18:39:13] FLAPPINGSTOP - Host deployment-cache-upload02 is UP: PING WARNING - Packet loss = 16%, RTA = 898.67 ms [18:41:37] <^d> mobrovac, twentyafterfour: Afterthought. On "statuses" -- I can think of 3 distinct statuses a node could have off the top of my head. [18:41:44] <^d> 1) Is there at least a process running [18:41:51] <^d> 2) Is it responding to a connection [18:42:11] <^d> 3) Is it actually returning something useful and non-error when you do connect? [18:42:47] yes, wgere "non-error" is user-defined (a 404 might be needed/wanted e.g.) [18:42:59] <^d> *nod* [18:43:44] also, firewalls and stuff play a part here, so in extreme cases "responding to outside conns" may differ from "responding to conns" [18:43:53] <^d> non-error is a bad phrase. "responding as expected to a nominal query" [18:43:55] (these being, misconfig, etc) [18:44:29] ^d: exactly, something like assert(res.status == whatever_i_need_it_to_be) [18:45:08] <^d> Yep [18:45:11] (or res.body or headers or whatever) [18:45:42] <^d> In Elastic's case that's basically returning 200 for GET / :) [18:46:27] 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1232048 (10mmodell) 3NEW a:3mmodell [18:46:27] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1824: FAILURE in 26 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1824/ [18:46:34] that's easy enough to fit in any model :) [18:47:33] <^d> twentyafterfour: It should have 7 layers like any good model :P [18:48:06] <^d> Or maybe I mean salad [18:48:32] <^d> Now I want a 7 layer salad [18:48:48] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:54:24] Yippee, build fixed! [18:54:25] Project beta-code-update-eqiad build #52957: FIXED in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52957/ [18:57:12] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [18:58:14] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:02:05] 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1232111 (10Jdlrobson) [19:02:06] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1232110 (10Jdlrobson) [19:02:10] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 255.42 ms [19:02:46] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1156443 (10Jdlrobson) @hashar remaining 2 test failures are due to T96920 and the required fron... [19:09:48] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3767.94 ms [19:10:02] FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 353.57 ms [19:13:00] Project beta-code-update-eqiad build #52959: FAILURE in 70 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52959/ [19:13:52] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [19:14:40] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3913.17 ms [19:19:29] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [19:19:43] Hallo. [19:20:11] Which ICU version is running on the Wikimedia production cluster? [19:22:49] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232188 (10KLans_WMF) @Christopher Default show Sprint Start and Sprint End date fields on Project Create form is a great enhancement. I'm hoping this will save @AKlapper some h... [19:25:52] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [19:26:02] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232209 (10KLans_WMF) I am remiss in not saying thank you @Chrsitopher for your work here :-) [19:28:16] RECOVERY - Puppet failure on integration-slave-precise-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [19:34:11] Yippee, build fixed! [19:34:11] Project beta-code-update-eqiad build #52961: FIXED in 1 min 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52961/ [19:36:29] PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 3949.59 ms [19:37:27] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 36%, RTA = 4572.32 ms [19:41:19] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [19:44:39] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2847.72 ms [19:45:01] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [19:53:00] Project beta-code-update-eqiad build #52963: FAILURE in 57 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52963/ [20:00:03] FLAPPINGSTOP - Host deployment-kafka02 is UP: PING WARNING - Packet loss = 0%, RTA = 658.45 ms [20:02:55] FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [20:04:37] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [20:06:01] PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 2783.89 ms [20:07:21] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3804.36 ms [20:14:09] Yippee, build fixed! [20:14:10] Project beta-code-update-eqiad build #52965: FIXED in 1 min 9 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52965/ [20:15:55] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:16:05] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189) [20:17:17] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [20:17:21] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3235.82 ms [20:26:36] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [20:31:35] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [20:33:00] Project beta-code-update-eqiad build #52967: FAILURE in 0.31 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52967/ [20:33:13] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 3542.82 ms [20:37:55] FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 0%, RTA = 1384.07 ms [20:38:29] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [20:41:27] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [20:41:33] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [20:41:54] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [20:42:20] FLAPPINGSTART - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 94.68 ms [20:51:19] FLAPPINGSTART - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 93.59 ms [20:52:45] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:54:18] Yippee, build fixed! [20:54:19] Project beta-code-update-eqiad build #52969: FIXED in 1 min 18 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52969/ [21:01:22] PROBLEM - Host deployment-redis01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4835.36 ms [21:01:38] PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2415.78 ms [21:06:14] RECOVERY - Host deployment-redis01 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [21:09:33] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1232526 (10Christopher) @KLans_WMF thank you kindly for your help. I added a few more options to relieve the aggravation of the "mandatory" start and end dates. Rather than... [21:13:00] Project beta-code-update-eqiad build #52971: FAILURE in 60 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52971/ [21:17:48] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:52] FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 379.06 ms [21:32:45] 6Release-Engineering, 7Jenkins: Run the git whitespace checker as part of CirrusSearch V+2 - https://phabricator.wikimedia.org/T97086#1232648 (10EBernhardson) 3NEW [21:34:43] Yippee, build fixed! [21:34:44] Project beta-code-update-eqiad build #52973: FIXED in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52973/ [21:36:36] FLAPPINGSTART - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 90.97 ms [21:42:20] FLAPPINGSTOP - Host deployment-db2 is UP: PING WARNING - Packet loss = 0%, RTA = 944.12 ms [21:51:20] FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 12.81 ms [21:53:00] Project beta-code-update-eqiad build #52975: FAILURE in 68 ms: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52975/ [21:53:34] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3573.36 ms [21:56:29] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 457.15 ms [21:58:23] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:04:26] if anyone happens to randomly know ... i'm following https://www.mediawiki.org/wiki/Continuous_integration/Jenkins_job_builder from a new laptop, but just getting a "KeyError: 'numToKeep'" exception. pastie here: https://phabricator.wikimedia.org/P550 [22:05:58] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3422.71 ms [22:06:17] (03PS1) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) [22:06:46] (03CR) 10EBernhardson: [C: 04-1] "untested, just checking jenkins output here" [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson) [22:07:56] (03CR) 10jenkins-bot: [V: 04-1] Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson) [22:10:45] (03PS2) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) [22:12:15] PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2439.16 ms [22:15:33] Yippee, build fixed! [22:15:33] Project beta-code-update-eqiad build #52977: FIXED in 2 min 32 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52977/ [22:20:50] ^d: i noticed that you can approve new git repos. :) i just submitted a request for integration/raita (the browser test dashboard) [22:20:51] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [22:30:54] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [22:30:56] FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 180.85 ms [22:31:28] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [22:32:14] (03PS3) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) [22:32:54] FLAPPINGSTOP - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 115.24 ms [22:33:32] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [22:35:45] (03PS4) 10EBernhardson: Run the whitespace checker for CirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) [22:40:44] (03CR) 10EBernhardson: "This patch is mostly guesswork, based on mediawiki-core-whitespaces job. would appreciate review. Also i understand it wont be voting, ba" [integration/config] - 10https://gerrit.wikimedia.org/r/206298 (https://phabricator.wikimedia.org/T97086) (owner: 10EBernhardson) [22:47:07] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2655.32 ms [22:51:08] PROBLEM - Host deployment-bastion is DOWN: CRITICAL - Host Unreachable (10.68.16.58) [22:51:28] PROBLEM - Host deployment-lucid-salt is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [22:51:54] PROBLEM - Host deployment-cache-text02 is DOWN: CRITICAL - Host Unreachable (10.68.16.16) [22:52:02] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 351.68 ms [22:52:20] PROBLEM - Host deployment-salt is DOWN: CRITICAL - Host Unreachable (10.68.16.99) [22:52:32] PROBLEM - Host deployment-urldownloader is DOWN: CRITICAL - Host Unreachable (10.68.16.135) [22:53:15] PROBLEM - Host deployment-elastic08 is DOWN: CRITICAL - Host Unreachable (10.68.17.188) [22:53:17] PROBLEM - Host deployment-pdf01 is DOWN: CRITICAL - Host Unreachable (10.68.16.73) [22:53:27] PROBLEM - Host deployment-rsync01 is DOWN: CRITICAL - Host Unreachable (10.68.17.66) [22:53:37] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 7044.15 ms [22:54:05] PROBLEM - Host deployment-memc03 is DOWN: CRITICAL - Host Unreachable (10.68.16.15) [22:54:21] PROBLEM - Host Generic Beta Cluster is DOWN: CRITICAL - Host Unreachable (en.wikipedia.beta.wmflabs.org) [22:56:31] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [22:58:23] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:59:29] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [23:01:17] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:01:27] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [23:02:09] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:02:19] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:02:53] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:03:01] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:04:22] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:04:30] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:06:40] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:06:54] Project beta-code-update-eqiad build #52981: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52981/ [23:07:08] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:09:50] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:11:00] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:11:56] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:12:22] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:12:25] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [23:12:25] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:12:35] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:13:19] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:13:43] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [23:13:45] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:13:57] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:13:57] PROBLEM - Puppet failure on deployment-test is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:14:09] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:14:29] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:15:30] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:15:52] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [23:17:04] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:17:10] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:29] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:20:09] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:20:11] FLAPPINGSTOP - Host deployment-kafka02 is UP: PING WARNING - Packet loss = 73%, RTA = 121.96 ms [23:22:05] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 45152 bytes in 6.514 second response time [23:23:19] RECOVERY - Host deployment-memc03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [23:23:39] RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:23:47] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:22] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [23:24:25] RECOVERY - Host deployment-rsync01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:25:02] RECOVERY - Host deployment-pdf01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:25:04] RECOVERY - Host deployment-elastic08 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [23:25:14] RECOVERY - Host deployment-cache-text02 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [23:25:22] RECOVERY - Host deployment-lucid-salt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [23:25:54] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [23:26:19] ^ andrewbogott did you ever know that you're my hero :) [23:26:32] RECOVERY - Host deployment-urldownloader is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [23:26:34] RECOVERY - Host deployment-salt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [23:26:45] thcipriani: keep in mind that I also broke them in the first place... [23:26:59] And that’s only one of 6 hosts rebooted, so there will be similar outage storms coming up [23:27:15] Well, maybe not tonight since I want to make sure the fix on labvirt1001 really fixed things. [23:28:15] sure, well, thanks for digging deep on this one. Some days innocuous problems result in new kernels across clusters, I guess :P [23:28:31] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:28:51] YuviPanda: jouncebot is sick :( It's joining and responding to commands except it does nothing when "next" is asked [23:29:21] bd808: I’m going to restart it and see if that helps [23:29:22] thcipriani, the next step is to decide if these instances are officially working correctly: https://dpaste.de/A2pb [23:29:28] If so, I can apply the same medicine to other hosts [23:29:37] Meanwhile, time to put dinner in the oven [23:29:53] bd808: done [23:29:56] YuviPanda: k. I've cycled it a couple of time [23:30:23] * bd808 was about to add more not so secret debugging commands to it [23:30:32] heh [23:30:41] PROBLEM - Host deployment-db1 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2572.68 ms [23:30:41] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [23:31:18] YuviPanda: are we running it twice? [23:31:19] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:31:23] PROBLEM - Puppet staleness on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [23:31:26] bd808: not sure where the other one is coming from [23:31:33] rebooting only bounced jouncebot_ [23:31:36] I started it on tools-bastion-01 [23:31:43] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:31:46] which has jouncebot nick [23:32:08] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [23:32:14] Yippee, build fixed! [23:32:14] Project beta-code-update-eqiad build #52982: FIXED in 1 min 44 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/52982/ [23:32:26] $ job jouncebot -- 258195 [23:33:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1520 bytes in 2.109 second response time [23:33:47] bd808: let me shut it down adn start it back up on trusty [23:33:56] k [23:34:50] PROBLEM - Host deployment-eventlogging02 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3195.11 ms [23:38:30] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [23:38:46] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [23:38:58] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:08] RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:26] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:44] PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52) [23:40:10] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:40:30] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:28] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [23:41:32] RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [23:42:16] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 57%, RTA = 7134.16 ms [23:43:05] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:26] andrewbogott: the guests on that machine all seem to ping in sub 1 ms range consistantly [23:43:59] thcipriani: that’s a good sign! [23:44:14] I’m doing labvirt1002 now, that will upset another set of instances… [23:44:19] then I’m going to lay off until the morning. [23:44:21] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:30] kk [23:44:31] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:33] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:49] I'll update the ticket once that is complete [23:44:53] PROBLEM - Host deployment-eventlogging02 is DOWN: CRITICAL - Host Unreachable (10.68.16.52) [23:45:07] PROBLEM - Host deployment-mathoid is DOWN: PING CRITICAL - Packet loss = 100% [23:45:33] PROBLEM - Host deployment-sca01 is DOWN: CRITICAL - Host Unreachable (10.68.17.54) [23:46:05] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [23:46:43] PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (10.68.17.69) [23:46:45] PROBLEM - Host deployment-pdf02 is DOWN: CRITICAL - Host Unreachable (10.68.16.129) [23:46:45] PROBLEM - Host deployment-videoscaler01 is DOWN: CRITICAL - Host Unreachable (10.68.16.211) [23:47:05] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 179.12 ms [23:47:07] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:11] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:21] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:41] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [23:47:56] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:58] PROBLEM - Host deployment-db1 is DOWN: CRITICAL - Host Unreachable (10.68.16.193) [23:48:08] PROBLEM - Host deployment-cache-upload02 is DOWN: CRITICAL - Host Unreachable (10.68.17.51) [23:50:54] FLAPPINGSTOP - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [23:51:30] PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:38] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:52:08] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [23:52:10] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:52:40] RECOVERY - Host deployment-sca01 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [23:53:06] RECOVERY - Host deployment-eventlogging02 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:53:12] RECOVERY - Host deployment-cache-upload02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [23:53:18] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 5597.84 ms [23:53:24] RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [23:53:28] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:53:37] RECOVERY - Host deployment-videoscaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [23:53:41] RECOVERY - Host deployment-pdf02 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [23:53:51] RECOVERY - Host deployment-mathoid is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [23:54:29] RECOVERY - Host deployment-db1 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [23:55:58] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2049.79 ms [23:56:00] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:56:18] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 2129.21 ms [23:57:01] thcipriani: ok, that’s all the breaking and/or fixing that I’m going to do tonight. Hosts not on labvirt1001 or labvirt1002 will continue to stutter as they have been… if all looks well I’ll sort them out tomorrow. [23:57:02] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:57:22] RECOVERY - Puppet failure on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:57:28] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [23:57:33] andrewbogott: kk, I'll update the ticket and do a little clean up of the guests on rebooted hosts [23:57:47] do you have a list of labvirt1002 guests? [23:58:20] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:58:56] RECOVERY - Puppet failure on deployment-test is OK: OK: Less than 1.00% above the threshold [0.0] [23:59:06] thcipriani: labvirt1002: https://dpaste.de/ku27 [23:59:11] andrewbogott: thanks again for all your help—was a problem I'd have never found :) [23:59:20] me neither, Coren found it :) [23:59:34] well it _was_ found [23:59:44] that's what's important here [23:59:48] :) [23:59:50] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]