[00:56:28] PROBLEM - Puppet failure on wmfbranch is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:01:59] (03PS3) 10Legoktm: [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:02:05] (03CR) 10Legoktm: [C: 032] [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:02:52] 10Deployment-Systems, 6operations: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1883535 (10Dzahn) after taking a look at tin and mira (and finding more inconsistencies and "system" users with UIDs over 10000) i suggest we do UID 120 for l10nupdate [01:02:53] (03Merged) 10jenkins-bot: [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:04:31] (03PS6) 10Legoktm: [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:06:03] (03CR) 10Legoktm: [C: 032] [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:06:50] (03Merged) 10jenkins-bot: [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:15:48] 10Continuous-Integration-Infrastructure: Move commit-message-validator.py tool out of integration/jenkins repository - https://phabricator.wikimedia.org/T121609#1883558 (10Legoktm) 3NEW [01:17:48] 10Deployment-Systems, 6operations, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1856865 (10Dzahn) [01:31:26] RECOVERY - Puppet failure on wmfbranch is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:28] PROBLEM - Puppet failure on wmfbranch is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:37:31] RECOVERY - Puppet failure on wmfbranch is OK: OK: Less than 1.00% above the threshold [0.0] [04:48:45] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:45] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:43] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:43] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:44] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:45] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:47] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:47] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:48] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:49] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:23] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:33:23] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:33:41] ^ labs nfs issues, everything is down [08:33:44] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #286: 04FAILURE in 2 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/286/ [08:33:53] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:33:53] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:53] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:54] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:34:17] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #838: 04FAILURE in 2 min 5 sec: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/838/ [08:34:39] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3154: 04FAILURE in 1 hr 11 min: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3154/ [08:34:40] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #815: 04FAILURE in 2 min 6 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/815/ [08:34:43] Project browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #798: 04FAILURE in 2 min 5 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/798/ [08:34:48] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:34:49] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [08:34:49] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [08:34:50] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39736 bytes in 0.863 second response time [08:34:50] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.938 second response time [08:34:50] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.887 second response time [08:34:50] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:34:50] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.876 second response time [08:34:50] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:34:56] PROBLEM - Puppet failure on deployment-urldownloader is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:02] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:35:34] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:46] PROBLEM - Puppet failure on deployment-tmh01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:48] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:35:57] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.876 second response time [08:35:57] PROBLEM - Puppet failure on deployment-logstash2 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [08:36:35] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39737 bytes in 1.718 second response time [08:36:35] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39411 bytes in 1.389 second response time [08:37:02] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31061 bytes in 0.975 second response time [08:37:10] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:37:34] PROBLEM - Puppet failure on mira is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:38:13] PROBLEM - Puppet failure on deployment-kafka04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:38:51] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39736 bytes in 0.863 second response time [08:38:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 39413 bytes in 1.162 second response time [08:40:29] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:40:39] PROBLEM - Puppet failure on deployment-conf03 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:40:53] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [08:43:11] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:45:47] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [08:48:03] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.056 second response time [08:49:30] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.044 second response time [08:53:58] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.460 second response time [08:57:32] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39744 bytes in 0.584 second response time [08:57:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.545 second response time [08:58:01] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31061 bytes in 0.533 second response time [09:01:29] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:02:54] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:10:48] RECOVERY - Puppet failure on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:13:56] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:15:11] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [09:15:51] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:17:14] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:20:31] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:20:55] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [09:22:48] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [09:23:13] RECOVERY - Puppet failure on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [09:24:28] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:36] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:48] RECOVERY - Puppet failure on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:48] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:02] RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:25] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:41] RECOVERY - Puppet failure on mira is OK: OK: Less than 1.00% above the threshold [0.0] [09:28:06] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:54] RECOVERY - Puppet failure on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [09:30:55] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:34:34] RECOVERY - Puppet failure on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:38:23] Yippee, build fixed! [09:38:24] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #705: 09FIXED in 1 min 21 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/705/ [10:16:04] 10Deployment-Systems, 6Release-Engineering-Team: Where should we branch for Wikimedia wikis? - https://phabricator.wikimedia.org/T121570#1884027 (10hashar) We would need to speed up the make-wmf-branch. From a quick glance at the code it: - `git clone` everything afresh from Gerrit or `git pull` if --path has... [10:20:55] (03PS1) 10Hashar: Revert "Deleted failing Flow browsertests Jenkins job" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:20:59] (03PS2) 10Hashar: Revert "Deleted failing Flow browsertests Jenkins job" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:23:34] (03PS3) 10Hashar: Restore Flow browser tests on Win8.0/IE 10 [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:24:14] (03CR) 10Hashar: [C: 04-1] "Need the gem to be bumped in Flow https://gerrit.wikimedia.org/r/#/c/257338/" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [10:28:10] (03CR) 10Hashar: "With mediawiki_selenium >=1.6.3 , we can restore the IE/Android jobs." [integration/config] - 10https://gerrit.wikimedia.org/r/242880 (https://phabricator.wikimedia.org/T94151) (owner: 10Zfilipin) [11:03:26] 7Browser-Tests, 10VisualEditor, 5Patch-For-Review: Delete or fix failed VisualEditor browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94162#1884172 (10zeljkofilipin) a:3zeljkofilipin [11:17:13] PROBLEM - Host deployment-cache-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.19.197) [11:36:28] zeljkof: good morning :-} [11:36:49] zeljkof: so Flow can probably get IE added :-} [11:36:52] hashar: morning! [11:36:58] will test that today [11:37:23] need to pay a bill or two and back to T114241 [11:37:46] if you get time this afternoon, I would love to discuss about CentralAuth failing ( https://integration.wikimedia.org/ci/job/browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/ ). Somehow it does not send the mediawiki_user but some previous value of `user` variable :-( [11:38:00] and I can't reproduce locally (i.e.: it just works for me) [11:38:32] hashar: I should be working on that in 30-60 minutes [11:38:44] about to head to lunch [11:38:57] I am really curious about CentralAuth, seems to be a low hanging fruit to get it green again [11:39:14] and there is most probably a bug in the way we run the tests on CI [11:41:54] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884245 (10hashar) That... [11:43:08] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884251 (10hashar) [11:44:39] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#747903 (10hashar) [11:45:04] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884276 (10hashar) a:5... [11:57:43] 10Beta-Cluster-Infrastructure: Beta not picking up merged change - https://phabricator.wikimedia.org/T75659#1884288 (10hashar) [11:57:48] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884290 (10hashar) [12:19:10] (03PS1) 10Paladox: [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 [12:19:54] hashar: Could you merge https://gerrit.wikimedia.org/r/#/c/259481/ please. [12:22:26] paladox: will look in a few; digging logs right now :( [12:26:36] hashar: Ok thanks. [12:34:36] (03CR) 10Hashar: [C: 032] [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:34:44] (03PS1) 10Phedenskog: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) [12:34:44] paladox: great thank you :-) [12:36:18] (03Merged) 10jenkins-bot: [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:03] (03CR) 10Paladox: "Thanks." [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:17] (03CR) 10Hashar: "Deployed. Did a recheck on the last merged change https://gerrit.wikimedia.org/r/#/c/255446/ and it is all green:" [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:21] paladox: all good thanks! [12:38:28] paladox: Ok thanks. [12:46:25] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-ContentTranslation, 7WorkType-Maintenance: ContentTranslation phpunit run very slow due to inclusion of Scribunto and Wikibase - https://phabricator.wikimedia.org/T121595#1884317 (10Amire80) p:5Triage>3Normal [12:46:35] Yippee, build fixed! [12:46:36] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3155: 09FIXED in 34 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3155/ [12:49:10] (03PS2) 10Hashar: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:49:46] phedenskog: do you know how to deploy Jenkins jobs ? :-} would be happy to introduce you to the magic [12:52:12] (03CR) 10Hashar: [C: 032] "Tested on a Trusty instance with bash 4.3.11. I have refreshed both jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:53:50] (03Merged) 10jenkins-bot: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:54:50] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #691: 04FAILURE in 49 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/691/ [12:58:25] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-ContentTranslation, 7WorkType-Maintenance: ContentTranslation phpunit run very slow due to inclusion of Scribunto and Wikibase - https://phabricator.wikimedia.org/T121595#1884328 (10Amire80) > I'm not sure why this extension is loading these two... [13:15:55] !log Gerrit , on mediawiki/services/mathoid force pushed gh-pages branch from Github to Gerrit repo . Attempting to fix Gerrit replication issue ( https://phabricator.wikimedia.org/T121635 ) [13:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:18:00] !log Gerrit created https://github.com/wikimedia/operations-debs-bloomd | https://phabricator.wikimedia.org/T121635 [13:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:18:57] !log Gerrit created https://github.com/wikimedia/thumbor-svg-engine | https://phabricator.wikimedia.org/T121635 [13:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:37:43] hashar: there is nothing like list of unit tests in order of time spent, to kill some low hanging fruit? [13:48:38] PROBLEM - Puppet failure on deployment-conf03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:54:44] Nikerabbit: hey [13:55:02] Nikerabbit: the Jenkins job runs phpunit with JUnit output which capture the time to run each test [13:55:16] Nikerabbit: that is parsed by Jenkins and show up on each build under 'Test results' [13:55:48] so from https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/9524/console [13:55:56] on the left side bar click 'Test Results' [13:56:20] you get a summary of all tests with a '(root)' link that gives further details [13:56:44] ie https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/9524/testReport/(root)/ [13:56:49] you can then sort by "Duration" [13:57:17] For Zend, I have extracted the parsertests to their own Jenkins job https://integration.wikimedia.org/ci/job/mediawiki-phpunit-parsertests-zend/250/testReport/(root)/ [13:57:52] hashar: wow [13:57:58] there one can see that a test takes at a minimum 140ms and we have more than 1300 [13:58:16] a bunch of them takes 200-300 ms [13:58:33] one of the trouble is the overhead caused by our phpunit wrapper [13:58:45] we do a lot of things before/after each parser tests that slow down stuff :-( [13:58:54] hashar: https://integration.wikimedia.org/ci/job/mwext-testextension-zend/17047/testReport/(root)/ here the thing to blame is very obvious [13:59:14] Nikerabbit: yup [13:59:35] Nikerabbit: though the tests probably consume a dataprovider [13:59:44] so it is really a lot more tests being run behind the hood [14:00:04] hashar: naturally, but what is the empty named test? [14:00:17] the 50 seconds one Scribunto_LuaUriLibraryTests::testLua apparently has ~300 tests https://integration.wikimedia.org/ci/job/mwext-testextension-zend/17047/testReport/(root)/Scribunto_LuaUriLibraryTests__testLua/ [14:00:35] the empty test, I have no idea :( [14:00:44] something is bugged somewhere :-(((( [14:01:20] 1000 tests 30 ms each [14:01:50] that is how you end up with 30 sec delay :-} [14:02:30] iirc Scribunto shells out to a binary [14:03:30] Hashar: no I don't yes please. let setup a time so we can talk. [14:08:01] phedenskog: I guess anytime tomorrow between 11am and 5pm CET (UTC+1) would do [14:08:44] phedenskog: either over hangouts or IRC I dont mind. We could even talk about having Jenkins to run the webperf tests in parallel to provide earlier feedback or run them more often [14:09:09] hashar: great, I'll send an invite, how much time do you think we should have? [14:09:27] Nikerabbit: for core I found out we update the Sqlite/Mysql SearchEngine on each tests even when there is no reason to do so. https://gerrit.wikimedia.org/r/#/c/247560/ [14:09:49] Nikerabbit: and I have a very lame patch to enable the profiler when running phpunit.php https://gerrit.wikimedia.org/r/#/c/247554/ :} [14:10:11] phedenskog: if you are familiar with python, 30 - 40 minutes would do [14:10:17] phedenskog: else 1 hour or so. [14:10:28] hashar: ok! [14:10:29] phedenskog: we can have several sessions as needed [14:11:45] phedenskog: my calendar is on amusso @ wikimedia.org and should be up-to-date [14:11:53] hashar: perfect, thanks:) [14:13:47] yeah 12 hours should covers everything 11:00 – 23:45 :D [14:14:45] if you feel adventurous this afternoon, there is a basic tutorial at https://www.mediawiki.org/wiki/CI/JJB [14:14:48] hashar: thanks for all the work again ;) [14:14:54] if you get some spare time, might be worth looking at [14:15:34] Nikerabbit: maybe I should fill a tracking tasks to improve the phpunit.php wrapper speed and a bunch of child tasks with ideas/low hanging fruits [14:18:23] I have no idea how much overhead the wrapper adds [14:28:34] Nikerabbit: well each time we do an article edit, the search engine is updated. But in almost case we never look at the search db so those inserts/deletes are unneeded [14:28:47] Nikerabbit: the parser test also keep trying and deleting a bunch of files on each test :/ [14:34:19] (03PS3) 10Hashar: Mathoid: Use mathoid-deploy-npm for the deploy repo patches [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:35:39] (03CR) 10Hashar: [C: 032] "\O/ thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:36:55] (03Merged) 10jenkins-bot: Mathoid: Use mathoid-deploy-npm for the deploy repo patches [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:40:23] mobrovac: I have screwed up some beta cluster puppet patches apparently :( [14:40:36] ah? [14:40:44] mobrovac: /var/lib/git/operations/puppet on deployment-puppetmaster had a dirty space [14:40:44] so I did a rebase --abort [14:41:10] hashar: not a pb, my rb patches have been merged today :) [14:41:14] http://paste.openstack.org/show/482081/ [14:41:14] oh [14:41:17] yeah that is the reason [14:41:18] will rebase [14:41:46] !log beta puppetmaster: drop cherry pick 02c2006 RESTBase: Switch to service::node -- got merged [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:43:26] !log beta puppetmaster is rebased and up-to-date (upstream at d3e1f70 ) [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:43:30] mobrovac: solved! [14:43:37] :) [14:53:42] RECOVERY - Puppet failure on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:53:54] (03PS4) 10Hashar: Restore Flow browser tests on Win8.0/IE 10 [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [14:55:06] (03CR) 10Sbisson: [C: 031] "Sounds good, but stabilizing the tests in IE is not gonna be a walk in the park..." [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [14:56:46] (03CR) 10Hashar: "Well we can hold on this until Flow developers stabilize the tests on their local machine? There is little point in having a job spammin" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [15:11:42] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884499 (10zeljkofilipin) [15:19:01] hashar: is there some kind of hidden queue not shown on https://integration.wikimedia.org/zuul/ ? [15:19:14] I don't see https://gerrit.wikimedia.org/r/#/c/259509/ there [15:20:11] ok, now it appeared there in back of the test queue [15:21:33] (03CR) 10Hashar: [C: 04-1] Enable submodules for operations/mediawiki-config phpunit tests (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [15:22:05] Nikerabbit: there is a 5 - 10 secs delay [15:23:02] hashar: more like 25 seconds ;) [15:29:12] 10MediaWiki-Releasing, 6Developer-Relations, 10Wikimedia-Blog-Content, 3DevRel-December-2015, 5MW-1.26-release: Write blog post announcing MW 1.26 - https://phabricator.wikimedia.org/T112842#1884527 (10Qgil) 5Open>3Resolved It was published on Monday: http://blog.wikimedia.org/2015/12/14/new-mediawik... [15:34:20] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884537 (10zeljkofilipin) [15:34:41] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1689201 (10zeljkofilipin) [15:42:27] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884554 (10zeljkofilipin) [15:50:12] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884581 (10zeljkofilipin) [15:51:00] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1689201 (10zeljkofilipin) [15:56:19] thcipriani: we’re still seeing evidence of a bunch of puppet breakage in staging. Did you get a change to look through those? [15:56:27] latest list is at the bottom here: https://etherpad.wikimedia.org/p/remaining-ldap [16:05:21] hey, am reading new (to me) scap3 docs, thanks so much for updating! docs looking very nice :) [16:06:39] i have a few qs, lemme know if someone is around who might answer (mostly about scap/ directory) [16:07:53] andrewbogott: I'll check it today. I thought I got them all, but I must've missed some. There's not too much use of the staging project at the moment, so it's not terribly critical. Thanks for the heads up. [16:08:20] ottomata: I could probably help out, what's up? [16:08:48] just not sure what the best practice is around where to put that...maybe there isn't one yet [16:08:53] i don't really want to add it to my repo [16:09:01] i suppose a separate git repo would be fine, that I just clone on tin. [16:09:04] but, it would be nice if that was puppetized [16:09:05] so [16:09:16] mabye I should puppetize a git::clone on tin at $deploypath/scap [16:09:16] ? [16:10:54] yeah, one thing that may not be mentioned in the docs is that scap is agnostic about how you include the directory in the deploy path [16:11:02] yeah, it kinda says that [16:11:16] maybe i'll just do a git repo and manually clone for now? hmm. [16:11:32] that would work, sure. [16:11:40] here is a quick easy q: [16:11:44] doc says "When you specify a service_port, the port specified will be checked to see if it is accepting connections. " [16:11:47] waht does accespting connections mean? [16:11:48] just tcp? [16:12:11] yeah, basically does: nc -vz localhost port [16:12:16] k [16:12:19] but fancier and python-y [16:12:36] you can also specify the timeout for that (not sure if that's in docs or not) [16:12:58] i think i tis [16:13:01] also custom checks [16:13:16] thcipriani: is scap3 used in prod with any other service yet? [16:13:18] (or in beta?) [16:13:30] yeah, marxarelli did a lot of cool work with those checks. [16:14:13] ottomata: no, we're working on getting service things moved over. RESTBase used it briefly, then reverted back to trebuchet as it diverged too much with production. [16:14:37] (using scap for the config files diverged too much with production) [16:15:56] that being said, the way scap works from an end-user perspective is unlikely to change, mostly internal refactoring, docs, puppet work being done at this point. [16:16:04] k cool [16:16:12] gonna try in in beta now [16:16:54] ottomata: let me know how it goes, if anything fails for you I'm happy to troubleshoot. The more usage feedback we get the better the tool becomes. [16:17:45] thcipriani: if I made an environments/beta/mockbase file [16:17:49] that had targets for deployment in beta [16:17:59] would that override a plain mockbase file in scap/ [16:17:59] ? [16:18:27] only if you do: deploy -e beta [16:18:31] aye [16:18:36] so it will override dsh targets too? [16:18:46] you have to pass the environment specifically at that point. [16:18:52] then my scap.cfg wouldn't have to specify different server_groups? [16:19:08] let me doublecheck real quick. I did make a stink about that on the patch that added this functionality :P [16:20:28] hm, how does ssh_user work? [16:20:33] does it need a special key? [16:21:42] so yes to your question: an environments/beta/mockbase file should override a base mockbase file if you pass -e on the command line. [16:22:03] so ssh_user should be a user that can ssh into each target from tin. [16:23:02] 10Deployment-Systems, 6operations, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1884698 (10bd808) [16:23:02] since port forwarding is disallowed in production, keyholder is likely the only way to make it happen at this point. [16:23:08] ok, so management of that user isn't magic [16:23:24] keyholder? as in i need to puppetize keys for that user appropriately? [16:23:56] HMmMMM https://wikitech.wikimedia.org/wiki/Keyholder [16:23:59] not heard of this... [16:24:04] indeed. I outlined a lot of what I've done thus-far for services here: https://doc.wikimedia.org/mw-tools-scap/scap3/ssh-access.html [16:24:34] awesome, reading... [16:29:04] cool thcipriani, and if i wanted to puppetize the git::clone of my scap config repo, i could add it to role::deployment::mything, since that is only applied on deploy servers, ja? [16:29:30] 6Release-Engineering-Team, 3Scap3, 7Security-General: Scap should be aware of security patches - https://phabricator.wikimedia.org/T118477#1884740 (10hashar) git apply trials ------- D80 is straightforward. It attempts to dry-run revert the security patches on the branch and will error out whenever it can... [16:30:40] ottomata: yeah, I believe that you also then have to add role::deployment::mything to role::deployment::server otherwise it wouldn't get applied anywhere, IIRC. [16:30:51] right [16:31:06] aye [16:31:36] 6Release-Engineering-Team, 3Scap3, 7Security-General: Scap should be aware of security patches - https://phabricator.wikimedia.org/T118477#1884753 (10hashar) If you could find a way to record a patch has been published in Gerrit, it would be helpful to propose to get rid of the associated .patch file. That... [16:32:22] (03PS7) 10Hashar: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:39:18] (03PS8) 10Hashar: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:41:00] thcipriani: [16:41:04] " 1. Adds the keyfile at puppet:///private/ssh/tin/deploy-mockbase_rsa to the keyholder" ? [16:41:09] on deployment-bastion as well? [16:41:19] i see there /etc/puppet/private/files/ssh/tin, which is a little funny? [16:41:21] !log updating operations-mw-config-phpunit to have it process submodules [16:41:23] sorry [16:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:41:26] deployment-puppetmaster? [16:41:29] (03CR) 10Hashar: "I have added a comment as to why the git definition is overridden. Made submodule to be explicit." [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:41:40] thcipriani: puppet master was crashed on staging-palladium and also there’s a merge error in /var/lib/git/operations/puppet which prevents puppet runs. [16:41:47] So, fix that and probably everything else will fall in line :) [16:41:57] * andrewbogott stops meddling [16:43:55] ottomata: you'd add keys to: deployment-puppetmaster:/var/lib/git/labs/private/files/ssh [16:45:04] (03CR) 10Hashar: [C: 032] "I did a recheck on https://gerrit.wikimedia.org/r/#/c/255135/" [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:45:29] ebernhardson: tyler pointed me to your Jenkins patch to get submodule processed on mw-config ( https://gerrit.wikimedia.org/r/#/c/256979/ ) [16:45:30] hashar: thanks! i was kinda taking a stab in the dark with that patch [16:45:35] ebernhardson: kudos, deployed and working fine! [16:45:54] ebernhardson: it was fine. I just added a comment and explicitly defined submodule: disabled: false [16:46:23] (03Merged) 10jenkins-bot: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:46:30] ebernhardson: it is deployed. So I guess now you can write some more tests that rely on whatever is in the submodules :-)))) Thank you a ton! [16:46:48] (03CR) 10Hashar: "For your tirelessly work on Jenkins, CI, tests, Librarifizification, here is the official Continuous integration barnstar!" [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:46:56] ebernhardson: here is your CI barnstorm ^^^^ [16:47:00] :) [16:47:00] barn star [16:47:10] thcipriani: aye, but why is there a tin directly in there? [16:47:14] :D [16:47:34] /var/lib/git/labs/private/files/ssh/tin [16:48:07] thcipriani, ottomata: we haven't used trebuchet in about a year [16:49:01] we used custom shell loops for a while, but then switched to ansible [16:49:14] ottomata: that is where puppetmaster on labs look for $passwords:: [16:49:30] and I guess tin is hardcoded in the puppet manifest [16:49:33] :( [16:50:03] but maybe some hiera data let us vary the file name :-} [16:51:54] thcipriani: why would you set recurse => true on file { '/srv/deployment/mockbase': [16:51:54] in the deployment target? [16:52:03] wouldn't that cause puppet to delete the deployed files? [16:52:09] since they are not managed by puppet? [16:53:10] ottomata: the intent was to ensure that permissions for all files in that directory are the same. [16:54:40] thcipriani: have you tried it? i think it deletes any non puppet managed file in that dir [16:55:27] ottomata: this was in a cherry-picked manifest on deployment-puppetmaster and I hadn't observed that [16:55:42] hm [16:56:20] This also enables the purge attribute, which can delete unmanaged files from a directory. See the description of purge for more details. [16:56:29] purge [16:56:29] Whether unmanaged files should be purged. This option only makes sense when ensure => directory and recurse => true. [16:56:47] maybe the defeault for purge is false [16:57:03] and 'enables the purge attribute' just means you can now use the purge attribute [16:57:20] guess i will try it :) [16:58:08] ottomata: if it is the case that the unmanaged files are purged, let me know and I'll update docs. [16:58:51] k [16:58:53] Niharika will send a patch to add a new extension (PageAssessment ) in integration/config [16:59:02] straightforward, should just change Zuul layout config. [16:59:33] patch and her should show up here in a few :} [16:59:37] I am off! *wave* [17:05:17] (03PS1) 10Niharika29: Setup CI for PageAssessments extension [integration/config] - 10https://gerrit.wikimedia.org/r/259529 [17:19:33] thcipriani: does the data.yaml group deploy-mockbase need to be added to the deploy server ? [17:19:39] "Finally, you’ll want to modify: modules/admin/data/data.yaml in the operations/puppet repo to create the deploy-mockbase group and add users to that group. [17:19:39] " [17:19:46] just adding the group isn't enough, right? [17:19:57] or, is the group membership looked up elsewhere than posix group membership on the deploy server? [17:23:23] so data.yaml is only used for production, so you'll have to add the group and the group members to ldap [17:23:44] but data.yaml does handle group and group membership. [17:23:58] to ldap? [17:24:01] hm [17:24:19] thcipriani: data.yaml handles it? [17:24:21] ok well [17:24:22] in prod [17:24:24] i'm asking [17:24:30] does the posix group need to exist on tin [17:24:36] with its members [17:24:37] or [17:24:47] is just the presense of the group config in data.yaml enough? [17:24:48] e.g. [17:25:05] the group needs to exist on tin with its members. [17:25:21] ok, so somewhere in puppet tin will need to declare all of the groups it needs [17:25:22] hm. [17:25:50] via admin::groups: [17:25:53] i guess [17:26:22] ergh, which probably means that every deployment target will need a new group created, eh? [17:27:01] i guess i'll defer that until production...although i'm not sure what group to use for keyholder trusted_group [17:27:22] every deployed thing will likely need a group of deployers. [17:27:49] aye, yeah, we have a few eventlogging groups right now, i can use one of those i think [17:27:59] we'll have to figure out some smarter sudo rules [17:29:44] thcipriani: how do I add a group in ldap/labs? [17:30:27] also, what is the proper way to add a private key to labs private? [17:30:34] just manually put it in there, and git commit ? [17:30:39] on deployment-puppetmaster? [17:30:57] and the proper path is /var/lib/git/labs/private/ssh/tin/deploy-mockbase_rsa [17:30:58] ? [17:31:20] ottomata: I was just digging through wikitech to try to figure that out if there's a way to add a group via that interface. This may be a question for -labs :\ [17:31:34] yes, that is the correct way to add a key for deployment-prep. [17:31:42] ook [17:38:19] thcipriani: oh, i see you added a passphrase file too? [17:38:24] for servicedeploy [17:38:32] or well, somewone did [17:38:35] should I do that? [17:38:42] or shoudl the key be passphrase less? [17:49:07] ottomata: yeah, I think the best practice is to have a passphrase although I'm not sure that's a hard-requirement. I just did it like that because it was done that way for the others. [17:53:13]