[01:07:07] Project beta-scap-eqiad build #56117: FAILURE in 3 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56117/ [01:08:41] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds [01:15:59] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL - Socket timeout after 10 seconds [01:16:29] Yippee, build fixed! [01:16:29] Project beta-scap-eqiad build #56118: FIXED in 2 min 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56118/ [01:18:33] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47611 bytes in 0.622 second response time [01:19:27] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds [01:25:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds [01:28:18] Project beta-scap-eqiad build #56119: FAILURE in 4 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56119/ [01:35:14] Yippee, build fixed! [01:35:14] Project beta-scap-eqiad build #56120: FIXED in 1 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56120/ [01:57:02] Project beta-scap-eqiad build #56122: FAILURE in 3 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56122/ [02:15:51] Yippee, build fixed! [02:15:51] Project beta-scap-eqiad build #56124: FIXED in 2 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56124/ [02:16:59] (03PS1) 10Legoktm: Don't require "wf" prefix on functions that are namespaced [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/216517 (https://phabricator.wikimedia.org/T101623) [02:40:38] Project beta-scap-eqiad build #56126: FAILURE in 5 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56126/ [02:40:50] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #595: FAILURE in 7 min 49 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/595/ [03:01:26] Yippee, build fixed! [03:01:26] Project beta-scap-eqiad build #56128: FIXED in 7 min 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56128/ [03:06:19] Project beta-scap-eqiad build #56129: FAILURE in 2 min 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56129/ [05:13:08] PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds [05:13:08] PROBLEM - HHVM Queue Size on deployment-mediawiki02 is CRITICAL 50.00% of data above the critical threshold [80.0] [05:13:08] PROBLEM - HHVM Queue Size on deployment-mediawiki01 is CRITICAL 50.00% of data above the critical threshold [80.0] [05:13:08] PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds [05:13:17] PROBLEM - SSH on deployment-videoscaler01 is CRITICAL: Server answer [05:13:22] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #467: FAILURE in 6 min 11 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/467/ [05:13:24] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #442: FAILURE in 6 min 11 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/442/ [05:14:07] RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.001 second response time [05:15:01] PROBLEM - Puppet failure on integration-saltmaster is CRITICAL 20.00% of data above the critical threshold [0.0] [05:15:05] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL 22.22% of data above the critical threshold [0.0] [05:15:17] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL 22.22% of data above the critical threshold [0.0] [05:15:21] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30313 bytes in 0.601 second response time [05:15:35] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL 20.00% of data above the critical threshold [0.0] [05:16:13] RECOVERY - SSH on deployment-videoscaler01 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [05:16:43] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL 30.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-sca02 is CRITICAL 55.56% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL 60.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL 22.22% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL 20.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-test is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL 60.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-salt is CRITICAL 60.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-zuul-server is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-slave-precise-1013 is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL 55.56% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-raita is CRITICAL 60.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL 60.00% of data above the critical threshold [0.0] [05:20:00] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL 40.00% of data above the critical threshold [0.0] [05:20:03] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL 44.44% of data above the critical threshold [0.0] [05:20:27] PROBLEM - Puppet failure on integration-puppetmaster is CRITICAL 50.00% of data above the critical threshold [0.0] [05:20:37] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL 30.00% of data above the critical threshold [0.0] [05:20:41] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL 50.00% of data above the critical threshold [0.0] [05:21:27] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 33.33% of data above the critical threshold [0.0] [05:22:09] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL 55.56% of data above the critical threshold [0.0] [05:22:11] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL 55.56% of data above the critical threshold [0.0] [05:22:45] RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.001 second response time [05:22:51] RECOVERY - HHVM Queue Size on deployment-mediawiki01 is OK Less than 30.00% above the threshold [10.0] [05:22:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47625 bytes in 0.633 second response time [05:25:49] Yippee, build fixed! [05:25:49] Project beta-scap-eqiad build #56143: FIXED in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/56143/ [05:29:32] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #93: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/93/ [05:32:00] RECOVERY - HHVM Queue Size on deployment-mediawiki02 is OK Less than 30.00% above the threshold [10.0] [05:38:18] RECOVERY - Puppet failure on deployment-sca02 is OK Less than 1.00% above the threshold [0.0] [05:38:20] RECOVERY - Puppet failure on deployment-elastic05 is OK Less than 1.00% above the threshold [0.0] [05:38:38] RECOVERY - Puppet failure on deployment-test is OK Less than 1.00% above the threshold [0.0] [05:38:42] RECOVERY - Puppet failure on deployment-sca01 is OK Less than 1.00% above the threshold [0.0] [05:38:57] RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK Less than 1.00% above the threshold [0.0] [05:39:13] RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK Less than 1.00% above the threshold [0.0] [05:39:36] RECOVERY - Puppet failure on integration-slave-precise-1011 is OK Less than 1.00% above the threshold [0.0] [05:40:04] RECOVERY - Puppet failure on deployment-zotero01 is OK Less than 1.00% above the threshold [0.0] [05:40:16] RECOVERY - Puppet failure on deployment-restbase02 is OK Less than 1.00% above the threshold [0.0] [05:40:34] RECOVERY - Puppet failure on deployment-memc03 is OK Less than 1.00% above the threshold [0.0] [05:41:41] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK Less than 1.00% above the threshold [0.0] [05:43:30] RECOVERY - Puppet failure on deployment-cxserver03 is OK Less than 1.00% above the threshold [0.0] [05:43:32] RECOVERY - Puppet failure on integration-slave-trusty-1017 is OK Less than 1.00% above the threshold [0.0] [05:43:38] RECOVERY - Puppet failure on deployment-redis01 is OK Less than 1.00% above the threshold [0.0] [05:43:48] RECOVERY - Puppet failure on integration-zuul-server is OK Less than 1.00% above the threshold [0.0] [05:43:48] RECOVERY - Puppet failure on deployment-salt is OK Less than 1.00% above the threshold [0.0] [05:43:54] RECOVERY - Puppet failure on integration-slave-precise-1013 is OK Less than 1.00% above the threshold [0.0] [05:44:00] RECOVERY - Puppet failure on deployment-memc04 is OK Less than 1.00% above the threshold [0.0] [05:44:19] RECOVERY - Puppet failure on integration-raita is OK Less than 1.00% above the threshold [0.0] [05:44:54] RECOVERY - Puppet failure on deployment-sentry2 is OK Less than 1.00% above the threshold [0.0] [05:45:00] RECOVERY - Puppet failure on integration-saltmaster is OK Less than 1.00% above the threshold [0.0] [05:45:04] RECOVERY - Puppet failure on deployment-mathoid is OK Less than 1.00% above the threshold [0.0] [05:45:28] RECOVERY - Puppet failure on integration-puppetmaster is OK Less than 1.00% above the threshold [0.0] [05:45:38] RECOVERY - Puppet failure on deployment-jobrunner01 is OK Less than 1.00% above the threshold [0.0] [05:45:42] RECOVERY - Puppet failure on deployment-pdf02 is OK Less than 1.00% above the threshold [0.0] [05:46:27] RECOVERY - Puppet failure on deployment-bastion is OK Less than 1.00% above the threshold [0.0] [05:47:09] RECOVERY - Puppet failure on deployment-apertium01 is OK Less than 1.00% above the threshold [0.0] [05:47:12] RECOVERY - Puppet failure on deployment-elastic06 is OK Less than 1.00% above the threshold [0.0] [06:18:38] 10Continuous-Integration-Infrastructure, 10MediaWiki-Codesniffer, 7HHVM, 5Patch-For-Review: MediaWiki Codesniffer tests are failing on pre-HHVM 3.5.0 versions - https://phabricator.wikimedia.org/T100544#1343788 (10Legoktm) 5Open>3Resolved a:3Legoktm This is fixed now, in MW-CS we're using 2.3.0 expli... [06:18:51] 10Continuous-Integration-Infrastructure, 7HHVM, 5Patch-For-Review: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1343792 (10Legoktm) [06:36:44] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [06:54:42] RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK [09:22:25] PROBLEM - Puppet failure on deployment-mx is CRITICAL 100.00% of data above the critical threshold [0.0] [11:30:43] PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL deployment-prep.deployment-videoscaler01.diskspace._var.byte_percentfree (<40.00%) [11:42:07] 10Beta-Cluster, 10Graphoid, 6Services: git deploy sync on beta-cluster never finishes fetch (2/4) - https://phabricator.wikimedia.org/T101633#1343952 (10mobrovac) p:5Triage>3Low [13:21:03] PROBLEM - HHVM Queue Size on deployment-mediawiki03 is CRITICAL 66.67% of data above the critical threshold [80.0] [14:34:29] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #526: FAILURE in 8 min 29 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/526/ [15:02:32] PROBLEM - Puppet staleness on deployment-mediawiki03 is CRITICAL 20.00% of data above the critical threshold [43200.0] [15:03:44] PROBLEM - Puppet staleness on integration-slave-trusty-1013 is CRITICAL 10.00% of data above the critical threshold [43200.0] [15:05:44] PROBLEM - Puppet staleness on integration-slave-trusty-1012 is CRITICAL 40.00% of data above the critical threshold [43200.0] [15:06:18] PROBLEM - Puppet staleness on deployment-restbase01 is CRITICAL 55.56% of data above the critical threshold [43200.0] [15:06:55] PROBLEM - Puppet staleness on integration-slave-trusty-1015 is CRITICAL 50.00% of data above the critical threshold [43200.0] [15:23:28] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL 100.00% of data above the critical threshold [0.0] [16:50:17] 10Beta-Cluster, 10ContentTranslation-cxserver, 5Patch-For-Review: CXServer on beta is writing Logs to NFS - https://phabricator.wikimedia.org/T101240#1344298 (10yuvipanda) a:5yuvipanda>3None [16:51:05] 10Beta-Cluster, 10ContentTranslation-cxserver, 5Patch-For-Review: CXServer on beta is writing Logs to NFS - https://phabricator.wikimedia.org/T101240#1333386 (10yuvipanda) Not sure why this was assigned to me. [16:51:56] 10Beta-Cluster: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#1344303 (10yuvipanda) a:5yuvipanda>3None [16:52:29] 10Staging, 5Patch-For-Review: Setup staging-tin as deployment host - https://phabricator.wikimedia.org/T88442#1344308 (10yuvipanda) a:5yuvipanda>3None [16:57:05] 10Beta-Cluster, 7Tracking: Setup monitoring for Beta Cluster (tracking) - https://phabricator.wikimedia.org/T53497#1344332 (10yuvipanda) a:5yuvipanda>3None [17:02:01] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds [17:02:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds [17:06:19] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30313 bytes in 0.821 second response time [17:06:31] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47611 bytes in 0.800 second response time [17:14:25] 10Beta-Cluster, 6operations, 7Puppet: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1344368 (10yuvipanda) a:5yuvipanda>3None [17:14:39] 10Continuous-Integration-Infrastructure: Jenkins: Consistently getting 503 Varnish in response to succesful login - https://phabricator.wikimedia.org/T63710#1344371 (10Krinkle) 5Resolved>3Open Still happening. [17:14:47] 10Continuous-Integration-Infrastructure, 7Jenkins: Jenkins: Consistently getting 503 Varnish in response to succesful login - https://phabricator.wikimedia.org/T63710#1344373 (10Krinkle) [17:20:06] 10Continuous-Integration-Infrastructure: MediaWiki Jenkins jobs stuck for 20 minutes - https://phabricator.wikimedia.org/T101653#1344382 (10Krinkle) 3NEW [17:51:29] !log integration-slave-trusty-1012, trusty-1013 and 1015 unresponsive to pings or ssh. Other trusty slaves still reachable. [17:51:32] Logged the message, Master [17:58:51] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #563: FAILURE in 50 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/563/ [18:21:31] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL 100.00% of data above the critical threshold [0.0] [19:00:56] 10Beta-Cluster, 6operations, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1344425 (10yuvipanda) a:5yuvipanda>3None [19:46:41] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL 100.00% of data above the critical threshold [0.0] [20:16:04] !log Per Yuvi's advice, disabled "Shared project storage" (/data/project NFS mount) for the integration project. Mostly unused. Two existing directories were archived to /home/krinkle/integration-nfs-data-project/ [20:16:09] Logged the message, Master [20:22:18] 10Continuous-Integration-Infrastructure, 6Labs: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1344487 (10Krinkle) As a first step, I disabled "Shared project storage" (/data/project NFS mount) in the [Nova Project management](https://wikitech.wikimedia.org/w/in... [20:42:38] 10Continuous-Integration-Infrastructure: integration-slave-trusty-1012, -1013, and -1015 unresponsive - https://phabricator.wikimedia.org/T101658#1344492 (10Krinkle) 3NEW [20:42:59] !log Rebooting integration-slave-trusty-1015 to see if it comes back so we can inspect logs (T101658) [20:43:02] Logged the message, Master [20:49:32] PROBLEM - Puppet failure on integration-dev is CRITICAL 100.00% of data above the critical threshold [0.0] [21:01:54] RECOVERY - Puppet staleness on integration-slave-trusty-1015 is OK Less than 1.00% above the threshold [3600.0] [21:17:25] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [21:18:25] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:19:41] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:01:41] https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/19823/console <-- ??? [22:01:45] 30m timeout/ [22:02:02] hmm, that's on one of the stcuk slaves [22:02:38] 10Continuous-Integration-Infrastructure: integration-slave-trusty-1012, -1013, and -1015 unresponsive - https://phabricator.wikimedia.org/T101658#1344521 (10Legoktm) https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/19823/consoleFull was on integration-slave-trusty-1012 and timed out after 30 mi... [22:40:54] 10Beta-Cluster, 10Graphoid, 6Services: git deploy sync on beta-cluster never finishes fetch (2/4) - https://phabricator.wikimedia.org/T101633#1344558 (10thcipriani) Seems this is an unintended result of the dns change in deployment prep. deploy/graphoid should have 2 targets: `deployment-sca01` and `deploym... [23:52:21] 10Continuous-Integration-Infrastructure: integration-slave-trusty-1012, -1013, and -1015 unresponsive - https://phabricator.wikimedia.org/T101658#1344592 (10Legoktm) I marked all 3 slaves as offline in jenkins.