[00:03:58] Project beta-scap-eqiad build #164825: 04STILL FAILING in 14 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164825/ [00:17:22] Project beta-scap-eqiad build #164826: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164826/ [00:30:44] Project beta-scap-eqiad build #164827: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164827/ [00:44:00] Project beta-scap-eqiad build #164828: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164828/ [00:57:20] Project beta-scap-eqiad build #164829: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164829/ [00:59:57] 00:56:59 bash: /var/lib/mwdeploy/.bashrc: Permission denied [01:00:11] 00:56:59 00:56:59 ['/usr/bin/scap', 'pull', '--no-update-l10n'] on deployment-mediawiki05.deployment-prep.eqiad.wmflabs returned [70]: Could not chdir to home directory /var/lib/mwdeploy: Permission denied [01:00:11] 00:56:59 bash: /var/lib/mwdeploy/.bashrc: Permission denied [01:00:43] blerg, I think I know what's happening... [01:01:07] we lose our connection to ldap and puppet then creates a local mwdeploy user the shadows the ldap user [01:01:17] rsync gets confused because of different uids [01:02:46] oh, i guess delete the local user and run puppet? :) [01:03:32] yeah, vipw [01:05:09] thanks :) [01:05:35] * paladox goes again - 02:05am [01:10:32] Project beta-scap-eqiad build #164830: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164830/ [01:20:14] alright, I think after this next failure beta-scap-eqiad should work again... [01:21:28] Project beta-scap-eqiad build #164831: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164831/ [01:30:55] and after this next failure :( [01:31:13] Project beta-scap-eqiad build #164832: 04STILL FAILING in 8 min 57 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164832/ [01:37:21] Yippee, build fixed! [01:37:21] Project beta-scap-eqiad build #164833: 09FIXED in 5 min 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/164833/ [01:38:26] !log scap on beta was failing because during the ldap downtime puppet created a shadow mwdeploy user, fixed using vipw and vigr [01:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [01:46:18] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MW-1.30-release-notes, 10MediaWiki-Core-Tests, and 7 others: Parser tests fail if default Skin for unit tests makes use of doEditSectionLink - https://phabricator.wikimedia.org/T170880#3455178 (10Legoktm) @Jdlrobson I basically did the same con... [01:47:50] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MW-1.30-release-notes, 10MediaWiki-Core-Tests, and 7 others: Parser tests fail if default Skin for unit tests makes use of doEditSectionLink - https://phabricator.wikimedia.org/T170880#3455180 (10Legoktm) (Also whenever I would try testing it w... [02:14:31] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:15:05] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:18:46] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [02:20:42] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:21:06] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:21:36] PROBLEM - Puppet errors on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [02:23:06] PROBLEM - Puppet errors on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:23:43] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [02:23:45] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:24:43] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:25:17] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:25:29] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:28:00] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [02:29:46] PROBLEM - Puppet errors on integration-slave-docker-1002 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [02:31:28] PROBLEM - Puppet errors on deployment-restbase02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [02:34:03] PROBLEM - Puppet errors on deployment-db03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:34:11] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:34:51] PROBLEM - Puppet errors on deployment-ircd is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:35:00] PROBLEM - Puppet errors on deployment-stream is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:35:11] PROBLEM - Puppet errors on saucelabs-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:35:25] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:35:59] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:36:47] PROBLEM - Puppet errors on integration-publishing is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:36:53] PROBLEM - Puppet errors on integration-slave-docker-1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:38:54] PROBLEM - Puppet errors on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:39:46] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:48] PROBLEM - Puppet errors on saucelabs-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:43:37] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:46:24] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:54:31] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:55:41] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [02:56:05] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [02:58:04] RECOVERY - Puppet errors on integration-slave-jessie-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [02:58:44] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:58:46] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:00:18] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [03:00:28] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:01:13] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:36] RECOVERY - Puppet errors on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [03:03:43] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:03:45] PROBLEM - Puppet errors on deployment-elastic06 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:04:45] RECOVERY - Puppet errors on integration-slave-docker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [03:04:45] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [03:08:02] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [03:08:04] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:08:51] PROBLEM - Puppet errors on deployment-ms-fe02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:09:05] RECOVERY - Puppet errors on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [03:09:12] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:09:28] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:09:40] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:09:54] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:11:04] PROBLEM - Puppet errors on deployment-memc04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:11:30] RECOVERY - Puppet errors on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:11:32] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:11:36] PROBLEM - Puppet errors on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:11:48] RECOVERY - Puppet errors on integration-publishing is OK: OK: Less than 1.00% above the threshold [0.0] [03:12:19] PROBLEM - Puppet errors on deployment-sentry01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:12:21] PROBLEM - Puppet errors on deployment-db04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:13:22] PROBLEM - Puppet errors on integration-puppetmaster01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:14:46] PROBLEM - Puppet errors on deployment-elastic05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:14:47] RECOVERY - Puppet errors on saucelabs-02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:15:33] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:15:35] PROBLEM - Puppet errors on integration-slave-jessie-android is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:15:43] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:16:01] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:16:25] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:19:39] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:19:43] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:19:43] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:20:44] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:21:29] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:21:41] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:22:05] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:22:33] PROBLEM - Puppet errors on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:23:40] PROBLEM - Puppet errors on jenkinstest is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:23:52] PROBLEM - Puppet errors on integration-saltmaster is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:24:04] PROBLEM - Puppet errors on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:24:42] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:24:46] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:25:16] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:25:45] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:26:17] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:26:40] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:26:40] PROBLEM - Puppet errors on saucelabs-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:26:54] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:29:11] PROBLEM - Puppet errors on deployment-imagescaler01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:29:43] PROBLEM - Puppet errors on deployment-puppetdb01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:30:22] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:31:16] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:35:13] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:38:43] RECOVERY - Puppet errors on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [03:40:11] RECOVERY - Puppet errors on saucelabs-03 is OK: OK: Less than 1.00% above the threshold [0.0] [03:44:51] RECOVERY - Puppet errors on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:00] RECOVERY - Puppet errors on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [03:46:38] RECOVERY - Puppet errors on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [03:46:54] RECOVERY - Puppet errors on integration-slave-docker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [03:47:19] RECOVERY - Puppet errors on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:47:19] RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:48:05] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:48:52] RECOVERY - Puppet errors on deployment-ms-fe02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:48:52] RECOVERY - Puppet errors on integration-slave-trusty-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [03:49:28] RECOVERY - Puppet errors on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:49:40] RECOVERY - Puppet errors on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [03:49:48] RECOVERY - Puppet errors on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [03:49:56] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:50:34] RECOVERY - Puppet errors on integration-slave-jessie-android is OK: OK: Less than 1.00% above the threshold [0.0] [03:50:42] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [03:51:01] RECOVERY - Puppet errors on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:51:25] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [03:51:27] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [03:51:33] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [03:53:24] RECOVERY - Puppet errors on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:53:40] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:55:06] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:55:33] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:56:41] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [03:57:07] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:05] RECOVERY - Puppet errors on integration-slave-jessie-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:41] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:44] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:45] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:47] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:00:44] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [04:00:44] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [04:01:17] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [04:01:29] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [04:01:39] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [04:01:55] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:02:34] RECOVERY - Puppet errors on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [04:03:40] RECOVERY - Puppet errors on jenkinstest is OK: OK: Less than 1.00% above the threshold [0.0] [04:03:50] RECOVERY - Puppet errors on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:46] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:05:16] RECOVERY - Puppet errors on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:17] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:39] RECOVERY - Puppet errors on saucelabs-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:08:22] Yippee, build fixed! [04:08:22] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,BrowserTests build #458: 09FIXED in 12 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=BrowserTests/458/ [04:08:28] (03PS1) 10Legoktm: Reduce false positives in ReferenceThisSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/366504 (https://phabricator.wikimedia.org/T170316) [04:09:10] RECOVERY - Puppet errors on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:42] RECOVERY - Puppet errors on deployment-puppetdb01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:10:13] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:10:21] RECOVERY - Puppet errors on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:10:25] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:46] Yippee, build fixed! [04:18:46] Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #458: 09FIXED in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/458/ [05:25:42] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MW-1.30-release-notes, 10MediaWiki-Core-Tests, and 7 others: Parser tests fail if default Skin for unit tests makes use of doEditSectionLink - https://phabricator.wikimedia.org/T170880#3455358 (10Jdlrobson) Legroom you1 [05:53:27] 10Beta-Cluster-Infrastructure, 10Recommendation-API: recommendation_api module breaking beta labs puppet - https://phabricator.wikimedia.org/T171075#3455380 (10Joe) Please apply the same role/profile we use in production to beta too. [06:00:10] PROBLEM - Puppet errors on deployment-imagescaler01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:34:08] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [06:40:09] RECOVERY - Puppet errors on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:05:07] <_joe_> !log adding myself to projectadmins for integration, trying to troubleshoot castor [07:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:09:55] <_joe_> !log rebooting castor, jobs are failing, and no one seems able to login [07:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:34:56] _joe_: aye, it seems CI isn't working (well) indeed. [07:34:58] Status? [07:35:48] <_joe_> Krinkle: the integration puppetmaster is broken, we are waiting for hashar to come online as we know little about the ci infrastructure [07:36:04] It seems the jobs start fine, but then timeout on trying to write to castor [07:36:16] which is the very last step [07:36:22] <_joe_> yeah, and I cannot log into castor either [07:36:24] e.g. https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/36027/console [07:36:27] <_joe_> as puppet is broken there [07:37:02] been hanging for 34 minutes at "00:02:45.352 Waiting for the completion of castor-save" – after "00:02:45.233 Done. 00:02:45.348 [PostBuildScript] - Execution post build scripts." [07:37:05] <_joe_> so we need someone with admin rights _before_ the incident to log onto the puppetmaster (which is broken since ~ 10 days I'd say) [07:37:16] <_joe_> Krinkle: you could just cancel that job [07:37:20] <_joe_> castor-save I mean [07:37:35] https://integration.wikimedia.org/ci/job/castor-save/ [07:37:41] No, because hte job has't ben created yet [07:37:51] _joe_: which exact wmflabs server? [07:38:03] <_joe_> Krinkle: heh, lemme check [07:38:15] I might be able [07:38:19] <_joe_> Krinkle: if you can log into castor, then you can log into the project [07:38:29] other way around I assume [07:38:30] but yes [07:38:32] which server :) [07:38:47] <_joe_> I have to find out, one sec [07:39:02] only 1 at https://tools.wmflabs.org/openstack-browser/project/integration [07:39:04] so I'll take that one [07:39:16] <_joe_> yeah integration-puppetmaster01 [07:39:26] hm.. key denied at castor.integration.eqiad.wmflabs [07:39:50] integration-puppetmaster01 works fine though [07:39:54] what do you want me to do [07:40:02] <_joe_> dpkg -l apache2 [07:40:06] <_joe_> for starters [07:41:25] _joe_: for the record - https://gist.github.com/Krinkle/e8f07deadc3963d42be24721bc82f30b [07:41:41] Desired=Unknown/Install/Remove/Purge/Hold [07:41:41] | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend [07:41:41] |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) [07:41:41] ||/ Name Version Architecture Description [07:41:41] +++-==========================================-==========================-==========================-========================================================================================== [07:41:41] ii apache2 2.4.10-10+deb8u9+wmf1 amd64 Apache HTTP Server [07:41:50] <_joe_> so it's at the correct version [07:41:55] <_joe_> damn [07:42:09] <_joe_> hashar: can you check why castor cannot run puppet? [07:42:53] hashar: CI is down (mostly) jobs start and run but timeout at saving to castor (also, why are non-gate jobs trying to save to castor? maybe we can make it skip earlier somehow based on pipeline) [07:43:45] https://integration.wikimedia.org/ci/job/castor-save/494271/console [07:45:05] OK. I gotta run unfortunately. It's been a long day. [07:45:06] o/ [07:45:14] <_joe_> so if puppet is not failing globally it's a castor issue [07:54:02] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455468 (10hashar) [07:55:05] !log Refreshing all Jenkins jobs defined in JJB in order to then disable castor entirely for T171148 [07:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:55:11] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [07:55:14] Krinkle: yeah filled it as https://phabricator.wikimedia.org/T171148 [07:55:19] seems something somehow is broken entirely [07:55:23] I am going to disable castor [07:59:55] (03PS1) 10Hashar: Disable castor entirely [integration/config] - 10https://gerrit.wikimedia.org/r/366520 (https://phabricator.wikimedia.org/T171148) [08:00:31] !log Disabled castor entirely via https://gerrit.wikimedia.org/r/366520 . The instance is broken - T171148 [08:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:00:34] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [08:05:35] so castor should be disabled now and no more blocks jobs [08:05:53] _joe_: the instance is no more reachable by any mean :( [08:06:06] we usually use salt as a fallback but the minion is not responding [08:08:49] <_joe_> hashar: did you try the openstack console? [08:09:01] <_joe_> else ask someone with a working root access in labs for assistance [08:09:19] we dont have access to it. It is probably easier to just recreate the instance [08:09:41] (hoping a newly created instance is actually reachable) [08:13:20] <_joe_> wait [08:13:33] <_joe_> ask someone with a working root labs account to help you [08:13:48] <_joe_> mine was outdated, so it doesn't work on castor [08:13:53] ah directly attaching to the kvm host? [08:16:08] _joe_: it is probably easier to just recreate it from scratch. The instance is in puppet, we would just lose the cache that can be repopulated manually for the busiest repos [08:20:20] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Patch-For-Review, 10User-Joe: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3455526 (10Joe) [08:38:37] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455549 (10hashar) From the console log, puppet-agent on boot reports: ``` SSL_connect returned=1 errno=0 state=error... [08:41:46] (03CR) 10Hashar: [C: 032] "Jobs refreshed. I will restore it when a new instance is ready." [integration/config] - 10https://gerrit.wikimedia.org/r/366520 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [08:44:10] 10Release-Engineering-Team (Kanban), 10Cloud-VPS: Labs Jessie images come with puppet 3.7.2, should be 3.8.5 - https://phabricator.wikimedia.org/T168511#3455552 (10hashar) 05Open>03Resolved a:03hashar I have booted a Jessie instance with the latest labs image and it comes with puppet 3.8.5: ``` apt-cache... [08:44:33] (03Merged) 10jenkins-bot: Disable castor entirely [integration/config] - 10https://gerrit.wikimedia.org/r/366520 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [08:53:56] !log Created castor02.integration.eqiad.wmflabs with puppet role role::ci::castor::server and adding it to Jenkins. Will then update the Jenkins jobs to point to it - T171148 [08:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:54:00] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [08:57:13] (03PS1) 10Hashar: Revert "Disable castor entirely" [integration/config] - 10https://gerrit.wikimedia.org/r/366523 (https://phabricator.wikimedia.org/T171148) [08:57:15] (03PS1) 10Hashar: Point Castor to castor02.integration.eqiad.wmflabs [integration/config] - 10https://gerrit.wikimedia.org/r/366524 (https://phabricator.wikimedia.org/T171148) [08:59:18] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455575 (10hashar) p:05Triage>03Unbreak! a:03hashar [09:01:35] (03CR) 10Hashar: [C: 032] "Transient change" [integration/config] - 10https://gerrit.wikimedia.org/r/366523 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [09:02:11] (03CR) 10Hashar: [C: 032] Point Castor to castor02.integration.eqiad.wmflabs [integration/config] - 10https://gerrit.wikimedia.org/r/366524 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [09:03:12] !log Restoring castorby updating all jobs to point to castor02 ( https://gerrit.wikimedia.org/r/366524 ) Starts with a cold cache :( - T171148 [09:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:03:15] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [09:03:39] (03Merged) 10jenkins-bot: Revert "Disable castor entirely" [integration/config] - 10https://gerrit.wikimedia.org/r/366523 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [09:03:44] (03Merged) 10jenkins-bot: Point Castor to castor02.integration.eqiad.wmflabs [integration/config] - 10https://gerrit.wikimedia.org/r/366524 (https://phabricator.wikimedia.org/T171148) (owner: 10Hashar) [09:13:13] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455601 (10hashar) I have manually repopulated the cache for operations/puppet.git by triggering https://integration.... [09:13:18] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455602 (10hashar) 05Open>03Resolved [09:15:55] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Set up experimental Docker CI slave - https://phabricator.wikimedia.org/T150502#3455621 (10hashar) I have removed `integration-slave-docker-1000` since puppet is completely broken on it. [09:17:31] PROBLEM - Host castor is DOWN: CRITICAL - Host Unreachable (10.68.23.216) [09:17:34] !log Spawning and pooling integration-slave-docker-1003 as replacement to integration-slave-docker-1000 (broken) - T150502 [09:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:17:38] T150502: Set up experimental Docker CI slave - https://phabricator.wikimedia.org/T150502 [09:18:09] PROBLEM - Host integration-slave-docker-1000 is DOWN: CRITICAL - Host Unreachable (10.68.19.131) [09:25:54] ^^^ I have deleted both castor and integration-slave-docker-1000 [10:10:14] (03CR) 10Zfilipin: [C: 031] Set up CI for ReadingLists extension [integration/config] - 10https://gerrit.wikimedia.org/r/366248 (https://phabricator.wikimedia.org/T168975) (owner: 10Gergő Tisza) [10:12:49] zeljkof: you can deploy that one :-) [10:13:02] hashar: will do [10:15:43] (03CR) 10Zfilipin: [C: 032] Set up CI for ReadingLists extension [integration/config] - 10https://gerrit.wikimedia.org/r/366248 (https://phabricator.wikimedia.org/T168975) (owner: 10Gergő Tisza) [10:16:34] (03Merged) 10jenkins-bot: Set up CI for ReadingLists extension [integration/config] - 10https://gerrit.wikimedia.org/r/366248 (https://phabricator.wikimedia.org/T168975) (owner: 10Gergő Tisza) [10:18:02] (03PS1) 10Zfilipin: WIP Run WebdriverIO tests in CI for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) [10:19:04] (03CR) 10jerkins-bot: [V: 04-1] WIP Run WebdriverIO tests in CI for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) (owner: 10Zfilipin) [10:20:19] !log Reloading Zuul to deploy 80b9d855443a2f572d877b280783110684344c5d [10:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:20:49] (03CR) 10Zfilipin: "Deployed." [integration/config] - 10https://gerrit.wikimedia.org/r/366248 (https://phabricator.wikimedia.org/T168975) (owner: 10Gergő Tisza) [11:01:05] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455895 (10hashar) [11:01:36] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455907 (10hashar) [11:15:51] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455925 (10hashar) labnet1001.eqiad.wmnet has a lot of such errors in /var/log/nova/nova-network.log* The first suspicious one:... [11:16:19] 10Release-Engineering-Team, 10Cloud-Services, 10Operations, 10Patch-For-Review: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10hashar) That is happening again after something got restarted yesterday. Filled as T171158 [11:21:55] (03Abandoned) 10Zfilipin: Use RelatedArticles' LocalSettings.php when running Selenium tests [integration/config] - 10https://gerrit.wikimedia.org/r/366236 (https://phabricator.wikimedia.org/T164721) (owner: 10Zfilipin) [11:25:02] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455941 (10hashar) The Nodepool launch errors https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=12&fullscreen&orgId=1&... [11:37:15] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455950 (10hashar) Seems the nova database is on `m5-master.eqiad.wmnet` db name `nova`. [11:39:25] (03PS2) 10Zfilipin: WIP Run WebdriverIO tests in CI for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) [11:41:19] hashar the reason why castor.integration.eqiad.wmflabs was unaccissible because it needed two service restarted [11:41:30] due to the ldap certificate being updated [11:41:40] nscd and nslcd [11:43:02] we did reboot it [11:43:14] but puppet was broken on the instance so the new CA was not provisioned on the host [11:43:24] ah [11:43:34] so even with a restart, the instance still had the old/obsolete cert and thus would not connect [11:43:47] yeh [11:44:00] did you manage to salt in? [11:44:11] no it was broken as well [11:44:16] so I just deleted the instance [11:44:16] oh [11:44:39] hashar you can recreate it :). It should work now. [11:44:43] yeah it does [11:44:55] oh i see castor02 [11:45:00] next issue is that openstack is broken and refuses to spawn more instances [11:45:08] oh [11:45:11] and there are tens of instances on beta cluster which are broken [11:45:25] hashar can it ssh? [11:45:42] I dont know [11:45:49] ok [11:46:16] nodepool probaly needs to pick up the new certificate for ldap [11:46:25] since the images are rebuilt at 2pm every day [11:46:40] 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3455974 (10Ladsgroup) [11:49:09] hmm how do i fix [11:49:11] Could not chdir to home directory /home/paladox: Permission denied [11:50:41] (03PS3) 10Zfilipin: WIP Run WebdriverIO tests in CI for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) [12:01:37] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3455996 (10Luke081515) p:05Triage>03High [12:20:06] Project beta-update-databases-eqiad build #18593: 04FAILURE in 5.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/18593/ [12:25:34] Yippee, build fixed! [12:25:34] Project beta-update-databases-eqiad build #18594: 09FIXED in 1 min 39 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/18594/ [12:32:01] PROBLEM - Puppet errors on deployment-memc05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:35:32] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:37:22] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:37:28] PROBLEM - Puppet errors on deployment-restbase02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:37:58] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:39:42] PROBLEM - Puppet errors on deployment-elastic06 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:40:03] PROBLEM - Puppet errors on deployment-db03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:40:53] PROBLEM - Puppet errors on deployment-ircd is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:41:01] PROBLEM - Puppet errors on deployment-stream is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:41:13] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:43:17] PROBLEM - Puppet errors on deployment-sentry01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:44:03] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:45:29] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:45:38] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:45:54] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:47:02] PROBLEM - Puppet errors on deployment-memc04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:47:40] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:48:18] PROBLEM - Puppet errors on deployment-db04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:49:55] PROBLEM - Puppet errors on deployment-ms-fe02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:50:50] PROBLEM - Puppet errors on deployment-elastic05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:51:06] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:52:27] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:52:29] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:52:33] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:52:35] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:54:37] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:55:38] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:55:44] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:57:41] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:57:45] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:58:07] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:59:29] hashar does zuul support edcsa keys? [12:59:33] i get this warnning [12:59:34] UserWarning: Unknown ssh-rsa host key for [127.0.0.1]:29418: 61424736f3dea4ebc8cd59f27ec94a20 [12:59:52] paladox: it uses Paramiko, a python implementation of ssh [13:00:03] ah /me checks paramiko [13:00:04] thanks [13:00:06] that message is because you have to manually accept the gerrit ssh key [13:00:13] oh, i did [13:00:41] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:00:45] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:00:45] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:02:16] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:02:30] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:02:40] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:02:44] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:02:54] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:04:00] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:04:10] hashar how do i manually verify it? [13:04:18] i've ssh into it by using the zuul user [13:04:23] to store it in known_host [13:04:27] but that does not seem to work [13:04:48] PROBLEM - Puppet errors on deployment-prometheus01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:05:10] paladox: double check that it is actually in /var/lib/zuul/.ssh/known_hosts ? [13:05:17] ok [13:05:31] I guess you can do something like: sudo su - zuul [13:05:34] nope that's not in there [13:05:40] ssh -p 29418 127.0.0.1 [13:05:48] it has the new key from the gerrit server [13:05:56] (it's now edcsa) [13:05:58] so maybe you add it added to /home/paladox/.ssh/known_host s:D [13:06:17] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:06:20] ssh -p 29418 jenkins@127.0.0.1 works [13:07:17] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:10:45] PROBLEM - Puppet errors on deployment-puppetdb01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:12:47] :-} [13:13:33] but it dosen't work in zuul it seems getting unkown ssh-rsa key [13:13:43] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:13:49] PROBLEM - Puppet errors on deployment-kafka03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:15:02] hashar this https://github.com/paramiko/paramiko/issues/67 may be related [13:15:38] https://github.com/paramiko/paramiko/issues/88 [13:16:11] PROBLEM - Puppet errors on deployment-imagescaler01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:16:35] aha [13:16:44] hashar we are using a very old version of it [13:17:02] oh [13:17:07] 1.8.0<2.0.0 [13:17:29] 10Continuous-Integration-Infrastructure, 10monitoring: tune gearman alarms - https://phabricator.wikimedia.org/T168085#3456140 (10faidon) p:05Triage>03Low [13:20:40] paladox: sorry I can't investigate it [13:20:54] ok [13:21:32] paladox: but on CI we seem to have paramiko 1.15.1 from jessie [13:21:49] oh [13:21:54] on stretch i have 2.0.0 [13:23:00] most probably zuul does not work with it ? [13:23:03] I have no idea really [13:26:00] It seems zuul only seems to work with rsa [13:26:09] as i managed to get the rsa key into known_host [13:27:17] i will file a task so that we can try to fix that for wmf (as when we upgrade to gerrit 2.14 that will be a problem if we do ssh instead of ssh -o HostKeyAlgorithms=ssh-rsa -p 29418 jenkins@127.0.0.1 [13:27:41] (that adds it to know_host you wont need to do that part again once it's added :)) [13:29:09] 10Release-Engineering-Team, 10Zuul: Add support for edcsa keys in zuul - https://phabricator.wikimedia.org/T171165#3456188 (10Paladox) [13:39:08] 10Release-Engineering-Team, 10Zuul: Add support for ecdsa keys in zuul - https://phabricator.wikimedia.org/T171165#3456245 (10Paladox) [13:39:15] 10Release-Engineering-Team, 10Zuul: Add support for ecdsa keys in zuul - https://phabricator.wikimedia.org/T171165#3456188 (10Paladox) [13:49:17] hashar aha [13:49:21] i think https://github.com/paramiko/paramiko/commit/0ddb28f3313e793cf574ed5fed42761be1adf6d5 this fixes it [13:51:44] hashar yep that fixes it [13:51:56] * paladox tested it [13:52:09] 10Release-Engineering-Team, 10Zuul: Add support for ecdsa keys in zuul - https://phabricator.wikimedia.org/T171165#3456278 (10Paladox) Fixed by https://github.com/paramiko/paramiko/commit/0ddb28f3313e793cf574ed5fed42761be1adf6d5 [13:56:49] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3456299 (10Andrew) 05Open>03Resolved a:03Andrew I resolved this by running the query in https://ask.openstack.org/en/quest... [13:57:39] 10Continuous-Integration-Infrastructure, 10Cloud-VPS: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3456302 (10hashar) I can confirm that resolved the issue completely. Thank you! [14:04:39] (03CR) 10Zfilipin: "Tested using mediawiki-core-qunit-selenium-337602-jessie job." [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) (owner: 10Zfilipin) [14:12:09] (03CR) 10Zfilipin: "One more test for core when EXT_NAME is not set." [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) (owner: 10Zfilipin) [14:12:38] (03PS4) 10Zfilipin: Run WebdriverIO tests in CI for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/366531 (https://phabricator.wikimedia.org/T164721) [14:31:10] !log deployment-prep: manually cleaned out the puppet master configuration. It was all screwed up. Notably I removed bits about the puppetdb [14:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:34:24] PROBLEM - Puppet errors on integration-puppetmaster01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:39:23] RECOVERY - Puppet errors on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:11] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:31] RECOVERY - Puppet errors on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:43] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456426 (10hashar) [14:43:00] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:43:18] RECOVERY - Puppet errors on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:50] RECOVERY - Puppet errors on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:54] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:46:00] RECOVERY - Puppet errors on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:13] 10Beta-Cluster-Infrastructure, 10Recommendation-API: recommendation_api module breaking beta labs puppet - https://phabricator.wikimedia.org/T171075#3456464 (10mobrovac) This is a general problem with `service::node` in beta, it seems, closing as duplicate. [14:48:18] RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:30] 10Beta-Cluster-Infrastructure, 10Recommendation-API: recommendation_api module breaking beta labs puppet - https://phabricator.wikimedia.org/T171075#3456466 (10mobrovac) [14:48:32] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456469 (10mobrovac) [14:49:03] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:53] RECOVERY - Puppet errors on deployment-ms-fe02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:02] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next): puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456426 (10mobrovac) [14:50:29] RECOVERY - Puppet errors on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:39] RECOVERY - Puppet errors on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:47] RECOVERY - Puppet errors on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:01] RECOVERY - Puppet errors on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:24] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:26] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:32] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:34] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:42] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:50] zeljkof: https://gerrit.wikimedia.org/r/#/c/366248/ was deployed (thanks!) but CI still does not seem to work: https://gerrit.wikimedia.org/r/#/c/365986/ [14:52:56] did I miss something? [14:53:10] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Cloud-Services, 10Operations, 10Services: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456488 (10hashar) [14:54:04] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3455468 (10hashar) Beta cluster instances have the exact same issue. Filled as T171174 [14:54:06] tgr: maybe I made a mistake while deploying, will try again, just a minute [14:54:36] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:03] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Cloud-Services, 10Operations, 10Services: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456488 (10Paladox) Now that puppet is fixed, you can either wait a few hours for puppet t... [14:55:36] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:44] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:56] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Cloud-Services, 10Operations, 10Services: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456519 (10hashar) [14:56:06] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:57:08] !log reloading Zuul to deploy 80b9d85 [14:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:57:42] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:57:44] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:58:06] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:58:11] tgr: should be fine now, sorry, looks like I did not deploy correcly [15:00:29] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next): puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456558 (10hashar) deployment-trending01.deployment-prep.eqiad.wmflabs has a similar issue: ``` (Exec[trendingedits config deploy] => Se... [15:00:41] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:00:45] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:00:47] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:17] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next): puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456565 (10mobrovac) Yeah, there seems to be something weird in the Scap3 config deploy part of `service::node`. The difference between Beta... [15:02:28] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:40] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:44] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:57] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:03:10] yeah [15:03:27] RECOVERY - Puppet errors on deployment-puppetmaster02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:04:47] RECOVERY - Puppet errors on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:06:16] RECOVERY - Puppet errors on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:00] RECOVERY - Puppet errors on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:01] PROBLEM - Puppet errors on deployment-stream is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:07:15] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:15] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:21] RECOVERY - Puppet errors on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:10] !log removed profile::recommendation_api from deployment-sca01 to try to fix the ssh access for mobrovac T171173 T171174 [15:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:08:15] T171174: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174 [15:08:15] T171173: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173 [15:08:41] RECOVERY - Puppet errors on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:49] RECOVERY - Puppet errors on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:56] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:09:00] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [15:09:20] PROBLEM - Puppet errors on deployment-db04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:09:22] PROBLEM - Puppet errors on deployment-sentry01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:09:44] RECOVERY - Puppet errors on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:03] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10MW-1.30-release-notes (WMF-deploy-2017-07-11_(1.30.0-wmf.9)), 10Patch-For-Review, 10User-zeljkofilipin: Run WebdriverIO tests in CI for extensions - https://phabricator.wikimedia.org/T164721#3456600 (10zeljkofilipin) Done: - I have c... [15:10:04] thx! [15:10:04] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:10:06] RECOVERY - Puppet errors on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:22] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next): puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456602 (10hashar) On deployment-sca01 I have removed `profile::recommendation_api` puppet then fails with: ``` Error: Failed to apply catal... [15:10:33] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:43] RECOVERY - Puppet errors on deployment-puppetdb01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:53] PROBLEM - Puppet errors on deployment-ms-fe02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:11:09] RECOVERY - Puppet errors on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:11:31] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:11:39] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:11:50] PROBLEM - Puppet errors on deployment-elastic05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:11:56] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:13:04] PROBLEM - Puppet errors on deployment-memc04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:13:24] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:13:26] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:13:36] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:13:43] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:13:43] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, 10VPS-Projects: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456611 (10bd808) [15:15:15] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, 10VPS-Projects: New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456618 (10Ottomata) [15:15:36] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:16:45] tgr: sorry for messing up, I do zuul deploys rarely, I'm not even sure what I did wrong, since it is just one command :/ [15:16:58] anyway, deploying again worked :) [15:17:04] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:17:15] zeljkof: thanks for the quick fix! [15:17:23] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next): puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456637 (10hashar) I have added `profile::recommendation_api` back on deployment-sca01. [15:18:33] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, 10VPS-Projects: New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456656 (10hashar) [15:18:35] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:18:41] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:18:48] tgr: not sure if you can deploy, but it's just `fab deploy_zuul` [15:18:56] https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Deploy_configuration [15:19:05] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:19:07] in case you have to do it yourself one of these days [15:20:09] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, 10VPS-Projects: New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456618 (10hashar) Seems the initial puppet run refuses to process for whatever rea... [15:20:44] thanks, that's good to know [15:21:23] 10Gerrit, 10Developer-Relations, 10Documentation: [[mw:Gerrit/Tutorial]] is way too much information for new contributors - https://phabricator.wikimedia.org/T161901#3456688 (10Aklapper) [15:21:39] PROBLEM - Puppet errors on deployment-zookeeper02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:21:41] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:21:43] PROBLEM - Puppet errors on deployment-fluorine02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:21:43] PROBLEM - Puppet errors on deployment-salt02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:21:49] What the fuck :( [15:22:10] Error: Failed to apply catalog: Could not find dependent Service[eventlogging/init] for File[/usr/local/lib/eventlogging/filters.py] at /etc/puppet/modules/eventlogging/manifests/plugin.pp:49 [15:23:15] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:23:28] PROBLEM - Puppet errors on deployment-poolcounter04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:23:40] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:23:42] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:23:42] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:24:26] PROBLEM - Puppet errors on deployment-puppetmaster02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:25:48] PROBLEM - Puppet errors on deployment-prometheus01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:26:46] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:27:16] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:28:16] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:29:41] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:29:49] PROBLEM - Puppet errors on deployment-kafka03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:29:59] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:30:29] !log deployment-prep : removing project wide puppet classes from https://horizon.wikimedia.org/project/puppet/ All are role::eventlogging::analytics::* [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:30:45] PROBLEM - Puppet errors on deployment-elastic06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:31:03] PROBLEM - Puppet errors on deployment-db03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:31:31] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:31:43] PROBLEM - Puppet errors on deployment-puppetdb01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:33:01] PROBLEM - Puppet errors on deployment-memc05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:33:13] PROBLEM - Puppet errors on deployment-imagescaler01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:33:13] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:33:29] PROBLEM - Puppet errors on deployment-restbase02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:33:59] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:38:22] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:42:49] hashar: do you need help from cloud team? they were dealing with the CA issues yesterday... [15:43:16] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [15:43:28] greg-g: yeah I reached out to andrew as soon as he connected and fixed up an issue with nodepool [15:43:34] and provided guidances for ssh/ldap etc issue [15:43:39] it is mostly sorted out now [15:43:44] I am filling my bits in https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap [15:44:06] cool, just saw the "WHAT THE FUCK" and was worried :) [15:45:04] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:46:29] RECOVERY - Puppet errors on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:46:39] RECOVERY - Puppet errors on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [15:46:55] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:46:59] RECOVERY - Puppet errors on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [15:48:14] greg-g: how the WTF is me being exhausted. But I found the reason, someone added some faulty puppet classes on all beta instances which broke puppet :} [15:48:34] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [15:48:40] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:48:54] hashar: heh, "great" [15:49:19] RECOVERY - Puppet errors on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:49:20] RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:49:21] greg-g: Though better for Staging than Prod. ;-) [15:49:52] James_F: indeed, then only hashar/I get upset, as opposed to all of Ops ;) [15:49:55] greg-g: https://phabricator.wikimedia.org/p/hashar/ more or less captures my day [15:50:00] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [15:50:16] Indeed. [15:50:36] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:50:50] RECOVERY - Puppet errors on deployment-ms-fe02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:48] RECOVERY - Puppet errors on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:06] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:01] RECOVERY - Puppet errors on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:27] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:27] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:33] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:35] RECOVERY - Puppet errors on deployment-zookeeper02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:43] RECOVERY - Puppet errors on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:28] RECOVERY - Puppet errors on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:42] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:44] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:59:05] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:43] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:47] RECOVERY - Puppet errors on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:47] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:17] RECOVERY - Puppet errors on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:15] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:40] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:44] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:56] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:26] RECOVERY - Puppet errors on deployment-puppetmaster02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:51] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10MW-1.30-release-notes (WMF-deploy-2017-07-11_(1.30.0-wmf.9)), 10Patch-For-Review, 10User-zeljkofilipin: Run WebdriverIO tests in CI for extensions - https://phabricator.wikimedia.org/T164721#3456823 (10Jdlrobson) > looks like a page i... [16:05:46] RECOVERY - Puppet errors on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:47] RECOVERY - Puppet errors on deployment-puppetdb01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:01] RECOVERY - Puppet errors on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:09] RECOVERY - Puppet errors on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:23] RECOVERY - Puppet errors on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:29] RECOVERY - Puppet errors on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:09:41] RECOVERY - Puppet errors on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [16:09:42] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MW-1.30-release-notes, 10MediaWiki-Core-Tests, and 5 others: Parser tests fail if default Skin for unit tests makes use of doEditSectionLink - https://phabricator.wikimedia.org/T170880#3456857 (10Jdlrobson) [16:09:48] RECOVERY - Puppet errors on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:10:46] RECOVERY - Puppet errors on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:04] RECOVERY - Puppet errors on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:33] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:10] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:14:01] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [16:18:00] (03Abandoned) 10Jdlrobson: Include Vector in phpunit tests for MobileFrontend [integration/config] - 10https://gerrit.wikimedia.org/r/366470 (https://phabricator.wikimedia.org/T170880) (owner: 10Jdlrobson) [16:19:10] (03CR) 10Umherirrender: [C: 031] Reduce false positives in ReferenceThisSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/366504 (https://phabricator.wikimedia.org/T170316) (owner: 10Legoktm) [16:26:20] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next), 10User-Joe: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456941 (10mobrovac) I can't make sense of this part: > `User[deploy-service] => Exec[recommendation_api config deploy]` I j... [16:29:48] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Wikimedia-Incident: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148#3456944 (10hashar) https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap#CI.2Fbeta [16:30:05] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Cloud-VPS, 10Wikimedia-Incident: contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded - https://phabricator.wikimedia.org/T171158#3456946 (10hashar) https://wikitech.wikimedia.org/wiki/Incident_d... [16:30:21] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456948 (10hashar) https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap#CI.2... [16:35:39] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456966 (10hashar) So the state as I understand it right now: The puppet master was broken, I h... [16:35:47] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456969 (10hashar) p:05Triage>03High [16:36:07] greg-g: I have filled my bits on the incident report ( warning it is long , you wanna skip reading https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap#CI.2Fbeta ) :D [16:36:23] the aftermath for beta is to fix up ssh on all the instances https://phabricator.wikimedia.org/T171174 [16:36:33] I had a bunch fixed by running puppet and mass restarting nslcd [16:36:41] but I havent verified whether they all work. No idea how to do that [16:36:51] at least I have left some instructions [16:37:31] hashar: I rebooted deployment-eventlog01, no luck [16:37:43] ottomata: recreate it I guess [16:37:56] hm, really? but i just created it this morning [16:38:10] can i delete it and recreate it with the same name? [16:41:33] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10Services, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456998 (10hashar) Announced on the QA list pointing back to this task [16:41:35] ottomata: I dont know [16:41:38] hah ok [16:41:45] new name it is :/ [16:41:47] ottomata: but labs / beta puppet master were all f**d up today [16:41:51] ya [16:41:52] so it does not surprise me it is broken somehow [16:41:57] I would say [16:41:58] delete it [16:42:02] wait a minute or so [16:42:10] and create with same name [16:42:10] with no class applied [16:42:22] PROBLEM - Host deployment-eventlog01 is DOWN: CRITICAL - Host Unreachable (10.68.22.64) [16:42:23] (sometime if you apply class to an instance puppet will fail the first provisionning) [16:42:47] !log How to fix ssh access on beta cluster instances: https://phabricator.wikimedia.org/T171174#3456966 [16:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:54:04] hashar as you fixed puppet on the beta cluster, you can now tell users to restart there instance once and wait 5-10 mins then restart it again. [16:54:09] They should regain access :) [16:54:31] potentiall [16:54:32] y [16:54:40] but I did that for most of them [16:54:47] the rest are instances for which puppet does not run properly [16:55:11] and hence the new CA Certificate is not provisioned, thus even a reboot would not fix it :( [16:55:21] I am heading back home [16:55:23] been a busy day [16:55:32] ok [16:55:39] hashar: magic, i'm into the new instance [16:55:39] thanks [16:56:43] ottomata: \O/ [16:57:10] ottomata: I guess the previous one had a bad initial provisioning which prevented it from running puppet [16:57:12] ottomata: I am happy to see it fixed :} [16:57:32] aye [16:57:34] ya thanks [16:59:48] hm hashar except, puppet won't run to connect to the deployment-prep puppetmaster :( [16:59:55] -;( [16:59:56] certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster02.deployment-prep.eqiad.wmflabs [17:00:00] ah yeah [17:00:24] ottomata: puppet is broken for new instances when the project has a puppet master [17:00:25] https://phabricator.wikimedia.org/T152941 [17:00:28] haha [17:00:29] that has the workaround to copy paste [17:00:30] :((( [17:00:32] ok [17:01:14] have a good afternoon! [17:01:45] laters! [17:01:48] have a good one [17:01:50] (gonna keep posting here, feel free to ignore) [17:04:57] few ok, some version of the workaround worked, but not quite the one(s) pasted [17:04:59] phew* [17:05:10] PROBLEM - Puppet errors on deployment-eventlog02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:06:49] say whaaa? [17:06:50] Provider scap3 is not functional on this host [17:06:51] ?? [17:10:17] no scap package for trusty? [17:10:29] thcipriani: ? [17:11:21] ottomata: which host? [17:11:28] deployment-eventlog02 [17:11:31] in deployment-prep [17:12:02] missing python-semver ? [17:12:16] hrm, scap is not installed on that host, but there is an available...oh [17:12:32] its a brand new instance [17:12:37] i'm trying to spin up a new beta eventlogging host there [17:12:41] since the old one seems bugsted [17:12:42] busted [17:12:46] and we've wanted to make a new one anyway [17:13:40] blerg. I thought we added that dependency a while ago, but all the trusty instances probably had scap installed by that point. [17:13:56] why a new trusty host? [17:14:48] install python-semver from xenial [17:14:50] works for me [17:14:58] lemme see where we're using semver, I can't remember...maybe we can move it to suggests [17:15:06] RainbowSprinkles: ^ [17:15:08] http://mirrors.kernel.org/ubuntu/pool/universe/p/python-semver/python-semver_2.0.1-1_all.deb [17:15:12] thcipriani: i'm replacing an old trusty host [17:15:13] because it broke [17:15:19] EL stuff still uses upstart [17:15:22] I haven't used it yet [17:15:24] big task to change [17:15:30] It was just to get the dependency in for future use [17:15:32] ottomata why not go with jessie? [17:15:47] need upstart for now [17:15:57] that's on jessie too [17:16:03] oh? [17:16:05] hmmm [17:16:10] https://packages.debian.org/jessie/upstart [17:16:25] RainbowSprinkles: can we move it to suggests for now? [17:16:32] That's fine [17:16:35] * thcipriani does [17:16:39] paladox: interesting [17:16:45] i would prefer to go with trusty if we can for now [17:16:49] but that might help us migrate faster in the future [17:16:50] it's not installed by default though [17:16:53] we're on trusty in prod [17:16:56] oh [17:17:13] and we have this instance in beta to test prod deployments (and puppet) so ya [17:18:06] iridium would have problems updating scap too [17:18:10] as it's also trusty [17:18:33] ottomata wget http://mirrors.kernel.org/ubuntu/pool/universe/p/python-semver/python-semver_2.0.1-1_all.deb [17:18:39] dpkg -i python-semver_2.0.1-1_all.deb [17:18:42] apt-get install scap [17:18:45] that should work [17:18:46] :) [17:18:47] n2it :) [17:19:16] RainbowSprinkles: could you bless https://phabricator.wikimedia.org/D724 [17:19:20] thanks paladox [17:19:23] (because python-semver is not on trusty but is in xenial. Tested on a trusty instance myself and found no conflicts) [17:19:26] your welcome :) [17:19:27] aye [17:19:46] thcipriani: Ok, I did the holy incantations and lit some incense. [17:20:07] :D [17:20:21] thcipriani we could backport https://packages.ubuntu.com/xenial/python-semver onto the trusty wikimedia apt repo. [17:20:30] it has no conflicts [17:20:32] Or we could just stop using trusty ;-) [17:20:37] yeh :) [17:20:44] (continuing to /make trusty work/ is a losing battle :)) [17:24:04] ottomata: FWIW, whenever https://phabricator.wikimedia.org/D724 makes its way through the pipes (you'll see the scap version update on beta to something that contains 20170720) you should be able to install [17:24:34] "the pipes" == "jenkins debian glue" [17:38:15] 10Release-Engineering-Team, 10Packaging, 10Release: MediaWiki 1.29 tarball comes with the wrong extensions - and misses some - https://phabricator.wikimedia.org/T171197#3457185 (10Joergi123) [17:41:04] 10Release-Engineering-Team, 10Packaging, 10Release: MediaWiki 1.29 tarball comes with the wrong extensions - and misses some - https://phabricator.wikimedia.org/T171197#3457199 (10Joergi123) Reedy wrote on IRC: SimpleAntiSpam was there in 1.22 and removed in 1.23 Vector was removed in 1.23 Looks like a bug... [17:44:23] hm, other q [17:44:27] i'm on deployment-tin [17:44:28] in [17:44:37] oh wait, i think i know [17:45:12] RECOVERY - Puppet errors on deployment-eventlog02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:54:06] Yippee, build fixed! [17:54:06] Project mediawiki-core-code-coverage build #2895: 09FIXED in 2 hr 54 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2895/ [18:02:07] 10Release-Engineering-Team, 10Release: MediaWiki 1.29 tarball comes with the wrong extensions - and misses some - https://phabricator.wikimedia.org/T171197#3457290 (10Aklapper) [18:02:57] 10Release-Engineering-Team (Kanban), 10Operations, 10Phabricator: replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3457291 (10Cmjohnson) a:05Cmjohnson>03RobH Disk has been replaced: Return shipping info is USPS 9202 3946 5301 2436 1520 81 FEDEX 96119... [18:03:08] 10Release-Engineering-Team, 10MW-1.29-release, 10Release: MediaWiki 1.29 tarball comes with the wrong extensions - and misses some - https://phabricator.wikimedia.org/T171197#3457293 (10greg) [18:09:11] hi! [18:09:26] i'm struggling to figure out why these tests are failing: https://integration.wikimedia.org/ci/job/mwext-donationinterfacecore-REL1_27-zend56-jessie/76/console [18:09:33] when they are not failing locally for me or ejegg [18:20:32] PROBLEM - Host deployment-eventlogging03 is DOWN: CRITICAL - Host Unreachable (10.68.18.111) [18:28:10] 10Scap, 10ORES, 10Scoring-platform-team-Backlog: ORES deployment finish "successfully" even when uwsgi and celery fail to successfully start up - https://phabricator.wikimedia.org/T170950#3457373 (10Ladsgroup) [18:34:23] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10VPS-Projects, 10Services (watching): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3457390 (10mobrovac) [18:35:05] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3457392 (10mobrovac) [18:37:50] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next), 10User-Joe: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3457398 (10mobrovac) p:05Triage>03High Setting to high prio, as this is now precluding us from logging into the boxes and... [18:40:56] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next), 10User-Joe: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3456426 (10Paladox) @mobrovac you could remove the puppet class from the instance. Restart the instance after that wait 5-10 m... [18:54:45] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:34:43] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:08:40] Project selenium-MinervaNeue » chrome,beta,Linux,BrowserTests build #16: 04FAILURE in 1 hr 19 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/16/ [20:15:33] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services (next), 10User-Joe: puppet dependency loop on deployment-sca hosts - https://phabricator.wikimedia.org/T171173#3457781 (10hashar) I did remove `profile::recommendation_api` on deployment-sca01 earlier but was hitting another puppet issue... [20:19:03] 10Release-Engineering-Team, 10MW-1.29-release, 10Release: MediaWiki 1.29 tarball comes with the wrong extensions - and misses some - https://phabricator.wikimedia.org/T171197#3457792 (10MacFan4000) a:03MacFan4000 [20:19:21] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3457796 (10hashar) [20:19:26] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Operations, 10VPS-Projects, 10Services (watching): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3457793 (10hashar) 05Open>03Resolved a:03Ottomata Andrew has delet... [20:23:39] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:25:15] twentyafterfour hi, question for you since your git-ssh codfw patch will probaly take alot longer to be reviewed, we could use git-ssh from eqiad instead of it being in codfw for now. The question is can traffic in codfw reach eqiad git-ssh? [20:34:29] 10Release-Engineering-Team (Kanban), 10Operations, 10Phabricator: replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3457839 (10RobH) [20:39:39] well, i just realized i installed phab1001 with jessie, and I guess i should have asked if it could be stretch [20:39:53] paladox: jessie fine or should this be strech? [20:39:54] stretch [20:40:25] robh uh, will have to forward that to releng (twentyafterfour, greg-g) [20:40:41] i pinged since you asked me about the install [20:40:43] but stretch wont work, will have to be jessie i think but will leave that up to releng [20:40:45] i assumed you were involved ;] [20:40:51] i would assume if stretch supports everything phab needs i see why not [20:40:58] itd be a nice way to find out no? [20:41:05] Zppix php7 is not supported by phabricator [20:41:12] php7.1 is but that's not in stretch [20:41:21] you could downgrade no? [20:41:26] well, jessie is being isntalled now but it could be changed to stretch [20:41:29] installed even [20:41:32] ok thanks [20:41:54] Zppix no it carn't php5 wont work on stretch [20:42:04] needs to be compiled by someone. [20:42:52] paladox: can't [20:42:55] there's never an r in it [20:43:09] woops sorry. thanks [20:44:24] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: MW 1.30.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T168050#3457863 (10demon) 05Open>03Resolved [20:44:39] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:53:48] robh: please wait until twentyafterfour gives you an answer [20:54:05] greg-g: well, i had already had jessie installing when i asked ;] [20:54:13] robh: jessie should be fine [20:54:15] so its really already done, but same amount of work to reimage [20:54:23] robh: mostly just "mukunda is authority" :) [20:54:25] ie: if stretch seems something to try im happy to reimage whenever! [20:58:41] RECOVERY - Puppet errors on deployment-sca04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:15] (03PS1) 10Thcipriani: Dockerfiles use build container pattern [integration/config] - 10https://gerrit.wikimedia.org/r/366726 (https://phabricator.wikimedia.org/T166888) [21:30:53] (03PS2) 10Thcipriani: Dockerfiles use build container pattern [integration/config] - 10https://gerrit.wikimedia.org/r/366726 (https://phabricator.wikimedia.org/T166888) [22:13:40] 10MediaWiki-Codesniffer: MediaWiki.ExtraCharacters.CharacterBeforePHPOpeningTag.Found broken on hhvm-fatal-error.php - https://phabricator.wikimedia.org/T171234#3458354 (10Reedy) [22:14:29] 10MediaWiki-Codesniffer: MediaWiki.ExtraCharacters.CharacterBeforePHPOpeningTag.Found broken on hhvm-fatal-error.php - https://phabricator.wikimedia.org/T171234#3458370 (10Reedy) [22:17:29] 10Browser-Tests-Infrastructure, 10MinervaNeue, 10Reading-Web-Backlog: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3458397 (10Jdlrobson) [22:17:32] 10Browser-Tests-Infrastructure, 10MinervaNeue, 10Reading-Web-Backlog: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3446357 (10Jdlrobson) p:05Normal>03High This is happening more and more oft... [22:37:41] (03CR) 10Thcipriani: [C: 032] "This is live now and appears working." [integration/config] - 10https://gerrit.wikimedia.org/r/366726 (https://phabricator.wikimedia.org/T166888) (owner: 10Thcipriani) [22:38:42] (03Merged) 10jenkins-bot: Dockerfiles use build container pattern [integration/config] - 10https://gerrit.wikimedia.org/r/366726 (https://phabricator.wikimedia.org/T166888) (owner: 10Thcipriani)