[00:56:28] PROBLEM - Puppet failure on wmfbranch is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:01:59] (03PS3) 10Legoktm: [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:02:05] (03CR) 10Legoktm: [C: 032] [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:02:52] 10Deployment-Systems, 6operations: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1883535 (10Dzahn) after taking a look at tin and mira (and finding more inconsistencies and "system" users with UIDs over 10000) i suggest we do UID 120 for l10nupdate [01:02:53] (03Merged) 10jenkins-bot: [FEAT] Report multiple errors on each line [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247047 (owner: 10XZise) [01:04:31] (03PS6) 10Legoktm: [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:06:03] (03CR) 10Legoktm: [C: 032] [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:06:50] (03Merged) 10jenkins-bot: [FEAT] Require phabricator bug ids [integration/jenkins] - 10https://gerrit.wikimedia.org/r/247048 (owner: 10XZise) [01:15:48] 10Continuous-Integration-Infrastructure: Move commit-message-validator.py tool out of integration/jenkins repository - https://phabricator.wikimedia.org/T121609#1883558 (10Legoktm) 3NEW [01:17:48] 10Deployment-Systems, 6operations, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1856865 (10Dzahn) [01:31:26] RECOVERY - Puppet failure on wmfbranch is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:28] PROBLEM - Puppet failure on wmfbranch is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:37:31] RECOVERY - Puppet failure on wmfbranch is OK: OK: Less than 1.00% above the threshold [0.0] [04:48:45] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:45] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:43] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:43] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:44] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:45] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:47] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:47] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:48] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:49] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:23] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:33:23] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:33:41] ^ labs nfs issues, everything is down [08:33:44] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #286: 04FAILURE in 2 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/286/ [08:33:53] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:33:53] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:53] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:54] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:34:17] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #838: 04FAILURE in 2 min 5 sec: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/838/ [08:34:39] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3154: 04FAILURE in 1 hr 11 min: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3154/ [08:34:40] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #815: 04FAILURE in 2 min 6 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/815/ [08:34:43] Project browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #798: 04FAILURE in 2 min 5 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/798/ [08:34:48] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [08:34:48] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:34:49] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [08:34:49] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [08:34:50] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39736 bytes in 0.863 second response time [08:34:50] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.938 second response time [08:34:50] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.887 second response time [08:34:50] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:34:50] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:34:50] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.876 second response time [08:34:50] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:34:56] PROBLEM - Puppet failure on deployment-urldownloader is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:02] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:35:34] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:46] PROBLEM - Puppet failure on deployment-tmh01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:35:48] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:35:57] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.876 second response time [08:35:57] PROBLEM - Puppet failure on deployment-logstash2 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [08:36:35] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39737 bytes in 1.718 second response time [08:36:35] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39411 bytes in 1.389 second response time [08:37:02] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31061 bytes in 0.975 second response time [08:37:10] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:37:34] PROBLEM - Puppet failure on mira is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:38:13] PROBLEM - Puppet failure on deployment-kafka04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:38:51] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39736 bytes in 0.863 second response time [08:38:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 39413 bytes in 1.162 second response time [08:40:29] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [08:40:39] PROBLEM - Puppet failure on deployment-conf03 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [08:40:53] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [08:43:11] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:45:47] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [08:48:03] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.056 second response time [08:49:30] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.044 second response time [08:53:58] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:58] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.460 second response time [08:57:32] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39744 bytes in 0.584 second response time [08:57:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39410 bytes in 0.545 second response time [08:58:01] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31061 bytes in 0.533 second response time [09:01:29] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:02:54] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:10:48] RECOVERY - Puppet failure on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:13:56] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:15:11] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [09:15:51] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:17:14] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:20:31] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:20:55] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [09:22:48] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [09:23:13] RECOVERY - Puppet failure on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [09:24:28] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:36] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:48] RECOVERY - Puppet failure on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:48] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:02] RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:25] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [09:27:41] RECOVERY - Puppet failure on mira is OK: OK: Less than 1.00% above the threshold [0.0] [09:28:06] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:54] RECOVERY - Puppet failure on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [09:30:55] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:34:34] RECOVERY - Puppet failure on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:38:23] Yippee, build fixed! [09:38:24] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #705: 09FIXED in 1 min 21 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/705/ [10:16:04] 10Deployment-Systems, 6Release-Engineering-Team: Where should we branch for Wikimedia wikis? - https://phabricator.wikimedia.org/T121570#1884027 (10hashar) We would need to speed up the make-wmf-branch. From a quick glance at the code it: - `git clone` everything afresh from Gerrit or `git pull` if --path has... [10:20:55] (03PS1) 10Hashar: Revert "Deleted failing Flow browsertests Jenkins job" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:20:59] (03PS2) 10Hashar: Revert "Deleted failing Flow browsertests Jenkins job" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:23:34] (03PS3) 10Hashar: Restore Flow browser tests on Win8.0/IE 10 [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [10:24:14] (03CR) 10Hashar: [C: 04-1] "Need the gem to be bumped in Flow https://gerrit.wikimedia.org/r/#/c/257338/" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [10:28:10] (03CR) 10Hashar: "With mediawiki_selenium >=1.6.3 , we can restore the IE/Android jobs." [integration/config] - 10https://gerrit.wikimedia.org/r/242880 (https://phabricator.wikimedia.org/T94151) (owner: 10Zfilipin) [11:03:26] 7Browser-Tests, 10VisualEditor, 5Patch-For-Review: Delete or fix failed VisualEditor browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94162#1884172 (10zeljkofilipin) a:3zeljkofilipin [11:17:13] PROBLEM - Host deployment-cache-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.19.197) [11:36:28] zeljkof: good morning :-} [11:36:49] zeljkof: so Flow can probably get IE added :-} [11:36:52] hashar: morning! [11:36:58] will test that today [11:37:23] need to pay a bill or two and back to T114241 [11:37:46] if you get time this afternoon, I would love to discuss about CentralAuth failing ( https://integration.wikimedia.org/ci/job/browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/ ). Somehow it does not send the mediawiki_user but some previous value of `user` variable :-( [11:38:00] and I can't reproduce locally (i.e.: it just works for me) [11:38:32] hashar: I should be working on that in 30-60 minutes [11:38:44] about to head to lunch [11:38:57] I am really curious about CentralAuth, seems to be a low hanging fruit to get it green again [11:39:14] and there is most probably a bug in the way we run the tests on CI [11:41:54] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884245 (10hashar) That... [11:43:08] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884251 (10hashar) [11:44:39] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#747903 (10hashar) [11:45:04] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884276 (10hashar) a:5... [11:57:43] 10Beta-Cluster-Infrastructure: Beta not picking up merged change - https://phabricator.wikimedia.org/T75659#1884288 (10hashar) [11:57:48] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 7Jenkins, 7Upstream: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#1884290 (10hashar) [12:19:10] (03PS1) 10Paladox: [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 [12:19:54] hashar: Could you merge https://gerrit.wikimedia.org/r/#/c/259481/ please. [12:22:26] paladox: will look in a few; digging logs right now :( [12:26:36] hashar: Ok thanks. [12:34:36] (03CR) 10Hashar: [C: 032] [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:34:44] (03PS1) 10Phedenskog: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) [12:34:44] paladox: great thank you :-) [12:36:18] (03Merged) 10jenkins-bot: [BlueSpiceFoundation] Update Jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:03] (03CR) 10Paladox: "Thanks." [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:17] (03CR) 10Hashar: "Deployed. Did a recheck on the last merged change https://gerrit.wikimedia.org/r/#/c/255446/ and it is all green:" [integration/config] - 10https://gerrit.wikimedia.org/r/259481 (owner: 10Paladox) [12:38:21] paladox: all good thanks! [12:38:28] paladox: Ok thanks. [12:46:25] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-ContentTranslation, 7WorkType-Maintenance: ContentTranslation phpunit run very slow due to inclusion of Scribunto and Wikibase - https://phabricator.wikimedia.org/T121595#1884317 (10Amire80) p:5Triage>3Normal [12:46:35] Yippee, build fixed! [12:46:36] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3155: 09FIXED in 34 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3155/ [12:49:10] (03PS2) 10Hashar: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:49:46] phedenskog: do you know how to deploy Jenkins jobs ? :-} would be happy to introduce you to the magic [12:52:12] (03CR) 10Hashar: [C: 032] "Tested on a Trusty instance with bash 4.3.11. I have refreshed both jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:53:50] (03Merged) 10jenkins-bot: If one batch fails for WPT, fail the whole job. [integration/config] - 10https://gerrit.wikimedia.org/r/259484 (https://phabricator.wikimedia.org/T120365) (owner: 10Phedenskog) [12:54:50] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #691: 04FAILURE in 49 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/691/ [12:58:25] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-ContentTranslation, 7WorkType-Maintenance: ContentTranslation phpunit run very slow due to inclusion of Scribunto and Wikibase - https://phabricator.wikimedia.org/T121595#1884328 (10Amire80) > I'm not sure why this extension is loading these two... [13:15:55] !log Gerrit , on mediawiki/services/mathoid force pushed gh-pages branch from Github to Gerrit repo . Attempting to fix Gerrit replication issue ( https://phabricator.wikimedia.org/T121635 ) [13:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:18:00] !log Gerrit created https://github.com/wikimedia/operations-debs-bloomd | https://phabricator.wikimedia.org/T121635 [13:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:18:57] !log Gerrit created https://github.com/wikimedia/thumbor-svg-engine | https://phabricator.wikimedia.org/T121635 [13:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:37:43] hashar: there is nothing like list of unit tests in order of time spent, to kill some low hanging fruit? [13:48:38] PROBLEM - Puppet failure on deployment-conf03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:54:44] Nikerabbit: hey [13:55:02] Nikerabbit: the Jenkins job runs phpunit with JUnit output which capture the time to run each test [13:55:16] Nikerabbit: that is parsed by Jenkins and show up on each build under 'Test results' [13:55:48] so from https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/9524/console [13:55:56] on the left side bar click 'Test Results' [13:56:20] you get a summary of all tests with a '(root)' link that gives further details [13:56:44] ie https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/9524/testReport/(root)/ [13:56:49] you can then sort by "Duration" [13:57:17] For Zend, I have extracted the parsertests to their own Jenkins job https://integration.wikimedia.org/ci/job/mediawiki-phpunit-parsertests-zend/250/testReport/(root)/ [13:57:52] hashar: wow [13:57:58] there one can see that a test takes at a minimum 140ms and we have more than 1300 [13:58:16] a bunch of them takes 200-300 ms [13:58:33] one of the trouble is the overhead caused by our phpunit wrapper [13:58:45] we do a lot of things before/after each parser tests that slow down stuff :-( [13:58:54] hashar: https://integration.wikimedia.org/ci/job/mwext-testextension-zend/17047/testReport/(root)/ here the thing to blame is very obvious [13:59:14] Nikerabbit: yup [13:59:35] Nikerabbit: though the tests probably consume a dataprovider [13:59:44] so it is really a lot more tests being run behind the hood [14:00:04] hashar: naturally, but what is the empty named test? [14:00:17] the 50 seconds one Scribunto_LuaUriLibraryTests::testLua apparently has ~300 tests https://integration.wikimedia.org/ci/job/mwext-testextension-zend/17047/testReport/(root)/Scribunto_LuaUriLibraryTests__testLua/ [14:00:35] the empty test, I have no idea :( [14:00:44] something is bugged somewhere :-(((( [14:01:20] 1000 tests 30 ms each [14:01:50] that is how you end up with 30 sec delay :-} [14:02:30] iirc Scribunto shells out to a binary [14:03:30] Hashar: no I don't yes please. let setup a time so we can talk. [14:08:01] phedenskog: I guess anytime tomorrow between 11am and 5pm CET (UTC+1) would do [14:08:44] phedenskog: either over hangouts or IRC I dont mind. We could even talk about having Jenkins to run the webperf tests in parallel to provide earlier feedback or run them more often [14:09:09] hashar: great, I'll send an invite, how much time do you think we should have? [14:09:27] Nikerabbit: for core I found out we update the Sqlite/Mysql SearchEngine on each tests even when there is no reason to do so. https://gerrit.wikimedia.org/r/#/c/247560/ [14:09:49] Nikerabbit: and I have a very lame patch to enable the profiler when running phpunit.php https://gerrit.wikimedia.org/r/#/c/247554/ :} [14:10:11] phedenskog: if you are familiar with python, 30 - 40 minutes would do [14:10:17] phedenskog: else 1 hour or so. [14:10:28] hashar: ok! [14:10:29] phedenskog: we can have several sessions as needed [14:11:45] phedenskog: my calendar is on amusso @ wikimedia.org and should be up-to-date [14:11:53] hashar: perfect, thanks:) [14:13:47] yeah 12 hours should covers everything 11:00 – 23:45 :D [14:14:45] if you feel adventurous this afternoon, there is a basic tutorial at https://www.mediawiki.org/wiki/CI/JJB [14:14:48] hashar: thanks for all the work again ;) [14:14:54] if you get some spare time, might be worth looking at [14:15:34] Nikerabbit: maybe I should fill a tracking tasks to improve the phpunit.php wrapper speed and a bunch of child tasks with ideas/low hanging fruits [14:18:23] I have no idea how much overhead the wrapper adds [14:28:34] Nikerabbit: well each time we do an article edit, the search engine is updated. But in almost case we never look at the search db so those inserts/deletes are unneeded [14:28:47] Nikerabbit: the parser test also keep trying and deleting a bunch of files on each test :/ [14:34:19] (03PS3) 10Hashar: Mathoid: Use mathoid-deploy-npm for the deploy repo patches [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:35:39] (03CR) 10Hashar: [C: 032] "\O/ thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:36:55] (03Merged) 10jenkins-bot: Mathoid: Use mathoid-deploy-npm for the deploy repo patches [integration/config] - 10https://gerrit.wikimedia.org/r/259282 (owner: 10Mobrovac) [14:40:23] mobrovac: I have screwed up some beta cluster puppet patches apparently :( [14:40:36] ah? [14:40:44] mobrovac: /var/lib/git/operations/puppet on deployment-puppetmaster had a dirty space [14:40:44] so I did a rebase --abort [14:41:10] hashar: not a pb, my rb patches have been merged today :) [14:41:14] http://paste.openstack.org/show/482081/ [14:41:14] oh [14:41:17] yeah that is the reason [14:41:18] will rebase [14:41:46] !log beta puppetmaster: drop cherry pick 02c2006 RESTBase: Switch to service::node -- got merged [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:43:26] !log beta puppetmaster is rebased and up-to-date (upstream at d3e1f70 ) [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:43:30] mobrovac: solved! [14:43:37] :) [14:53:42] RECOVERY - Puppet failure on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:53:54] (03PS4) 10Hashar: Restore Flow browser tests on Win8.0/IE 10 [integration/config] - 10https://gerrit.wikimedia.org/r/259470 [14:55:06] (03CR) 10Sbisson: [C: 031] "Sounds good, but stabilizing the tests in IE is not gonna be a walk in the park..." [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [14:56:46] (03CR) 10Hashar: "Well we can hold on this until Flow developers stabilize the tests on their local machine? There is little point in having a job spammin" [integration/config] - 10https://gerrit.wikimedia.org/r/259470 (owner: 10Hashar) [15:11:42] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884499 (10zeljkofilipin) [15:19:01] hashar: is there some kind of hidden queue not shown on https://integration.wikimedia.org/zuul/ ? [15:19:14] I don't see https://gerrit.wikimedia.org/r/#/c/259509/ there [15:20:11] ok, now it appeared there in back of the test queue [15:21:33] (03CR) 10Hashar: [C: 04-1] Enable submodules for operations/mediawiki-config phpunit tests (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [15:22:05] Nikerabbit: there is a 5 - 10 secs delay [15:23:02] hashar: more like 25 seconds ;) [15:29:12] 10MediaWiki-Releasing, 6Developer-Relations, 10Wikimedia-Blog-Content, 3DevRel-December-2015, 5MW-1.26-release: Write blog post announcing MW 1.26 - https://phabricator.wikimedia.org/T112842#1884527 (10Qgil) 5Open>3Resolved It was published on Monday: http://blog.wikimedia.org/2015/12/14/new-mediawik... [15:34:20] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884537 (10zeljkofilipin) [15:34:41] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1689201 (10zeljkofilipin) [15:42:27] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884554 (10zeljkofilipin) [15:50:12] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1884581 (10zeljkofilipin) [15:51:00] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 6Security, 5MW-1.27-release-notes, and 2 others: Update all repositories that use mediawiki_selenium Ruby gem to version 1.6.x - https://phabricator.wikimedia.org/T114241#1689201 (10zeljkofilipin) [15:56:19] thcipriani: we’re still seeing evidence of a bunch of puppet breakage in staging. Did you get a change to look through those? [15:56:27] latest list is at the bottom here: https://etherpad.wikimedia.org/p/remaining-ldap [16:05:21] hey, am reading new (to me) scap3 docs, thanks so much for updating! docs looking very nice :) [16:06:39] i have a few qs, lemme know if someone is around who might answer (mostly about scap/ directory) [16:07:53] andrewbogott: I'll check it today. I thought I got them all, but I must've missed some. There's not too much use of the staging project at the moment, so it's not terribly critical. Thanks for the heads up. [16:08:20] ottomata: I could probably help out, what's up? [16:08:48] just not sure what the best practice is around where to put that...maybe there isn't one yet [16:08:53] i don't really want to add it to my repo [16:09:01] i suppose a separate git repo would be fine, that I just clone on tin. [16:09:04] but, it would be nice if that was puppetized [16:09:05] so [16:09:16] mabye I should puppetize a git::clone on tin at $deploypath/scap [16:09:16] ? [16:10:54] yeah, one thing that may not be mentioned in the docs is that scap is agnostic about how you include the directory in the deploy path [16:11:02] yeah, it kinda says that [16:11:16] maybe i'll just do a git repo and manually clone for now? hmm. [16:11:32] that would work, sure. [16:11:40] here is a quick easy q: [16:11:44] doc says "When you specify a service_port, the port specified will be checked to see if it is accepting connections. " [16:11:47] waht does accespting connections mean? [16:11:48] just tcp? [16:12:11] yeah, basically does: nc -vz localhost port [16:12:16] k [16:12:19] but fancier and python-y [16:12:36] you can also specify the timeout for that (not sure if that's in docs or not) [16:12:58] i think i tis [16:13:01] also custom checks [16:13:16] thcipriani: is scap3 used in prod with any other service yet? [16:13:18] (or in beta?) [16:13:30] yeah, marxarelli did a lot of cool work with those checks. [16:14:13] ottomata: no, we're working on getting service things moved over. RESTBase used it briefly, then reverted back to trebuchet as it diverged too much with production. [16:14:37] (using scap for the config files diverged too much with production) [16:15:56] that being said, the way scap works from an end-user perspective is unlikely to change, mostly internal refactoring, docs, puppet work being done at this point. [16:16:04] k cool [16:16:12] gonna try in in beta now [16:16:54] ottomata: let me know how it goes, if anything fails for you I'm happy to troubleshoot. The more usage feedback we get the better the tool becomes. [16:17:45] thcipriani: if I made an environments/beta/mockbase file [16:17:49] that had targets for deployment in beta [16:17:59] would that override a plain mockbase file in scap/ [16:17:59] ? [16:18:27] only if you do: deploy -e beta [16:18:31] aye [16:18:36] so it will override dsh targets too? [16:18:46] you have to pass the environment specifically at that point. [16:18:52] then my scap.cfg wouldn't have to specify different server_groups? [16:19:08] let me doublecheck real quick. I did make a stink about that on the patch that added this functionality :P [16:20:28] hm, how does ssh_user work? [16:20:33] does it need a special key? [16:21:42] so yes to your question: an environments/beta/mockbase file should override a base mockbase file if you pass -e on the command line. [16:22:03] so ssh_user should be a user that can ssh into each target from tin. [16:23:02] 10Deployment-Systems, 6operations, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1884698 (10bd808) [16:23:02] since port forwarding is disallowed in production, keyholder is likely the only way to make it happen at this point. [16:23:08] ok, so management of that user isn't magic [16:23:24] keyholder? as in i need to puppetize keys for that user appropriately? [16:23:56] HMmMMM https://wikitech.wikimedia.org/wiki/Keyholder [16:23:59] not heard of this... [16:24:04] indeed. I outlined a lot of what I've done thus-far for services here: https://doc.wikimedia.org/mw-tools-scap/scap3/ssh-access.html [16:24:34] awesome, reading... [16:29:04] cool thcipriani, and if i wanted to puppetize the git::clone of my scap config repo, i could add it to role::deployment::mything, since that is only applied on deploy servers, ja? [16:29:30] 6Release-Engineering-Team, 3Scap3, 7Security-General: Scap should be aware of security patches - https://phabricator.wikimedia.org/T118477#1884740 (10hashar) git apply trials ------- D80 is straightforward. It attempts to dry-run revert the security patches on the branch and will error out whenever it can... [16:30:40] ottomata: yeah, I believe that you also then have to add role::deployment::mything to role::deployment::server otherwise it wouldn't get applied anywhere, IIRC. [16:30:51] right [16:31:06] aye [16:31:36] 6Release-Engineering-Team, 3Scap3, 7Security-General: Scap should be aware of security patches - https://phabricator.wikimedia.org/T118477#1884753 (10hashar) If you could find a way to record a patch has been published in Gerrit, it would be helpful to propose to get rid of the associated .patch file. That... [16:32:22] (03PS7) 10Hashar: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:39:18] (03PS8) 10Hashar: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:41:00] thcipriani: [16:41:04] " 1. Adds the keyfile at puppet:///private/ssh/tin/deploy-mockbase_rsa to the keyholder" ? [16:41:09] on deployment-bastion as well? [16:41:19] i see there /etc/puppet/private/files/ssh/tin, which is a little funny? [16:41:21] !log updating operations-mw-config-phpunit to have it process submodules [16:41:23] sorry [16:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:41:26] deployment-puppetmaster? [16:41:29] (03CR) 10Hashar: "I have added a comment as to why the git definition is overridden. Made submodule to be explicit." [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:41:40] thcipriani: puppet master was crashed on staging-palladium and also there’s a merge error in /var/lib/git/operations/puppet which prevents puppet runs. [16:41:47] So, fix that and probably everything else will fall in line :) [16:41:57] * andrewbogott stops meddling [16:43:55] ottomata: you'd add keys to: deployment-puppetmaster:/var/lib/git/labs/private/files/ssh [16:45:04] (03CR) 10Hashar: [C: 032] "I did a recheck on https://gerrit.wikimedia.org/r/#/c/255135/" [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:45:29] ebernhardson: tyler pointed me to your Jenkins patch to get submodule processed on mw-config ( https://gerrit.wikimedia.org/r/#/c/256979/ ) [16:45:30] hashar: thanks! i was kinda taking a stab in the dark with that patch [16:45:35] ebernhardson: kudos, deployed and working fine! [16:45:54] ebernhardson: it was fine. I just added a comment and explicitly defined submodule: disabled: false [16:46:23] (03Merged) 10jenkins-bot: Enable submodules for operations/mediawiki-config phpunit tests [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:46:30] ebernhardson: it is deployed. So I guess now you can write some more tests that rely on whatever is in the submodules :-)))) Thank you a ton! [16:46:48] (03CR) 10Hashar: "For your tirelessly work on Jenkins, CI, tests, Librarifizification, here is the official Continuous integration barnstar!" [integration/config] - 10https://gerrit.wikimedia.org/r/256979 (owner: 10EBernhardson) [16:46:56] ebernhardson: here is your CI barnstorm ^^^^ [16:47:00] :) [16:47:00] barn star [16:47:10] thcipriani: aye, but why is there a tin directly in there? [16:47:14] :D [16:47:34] /var/lib/git/labs/private/files/ssh/tin [16:48:07] thcipriani, ottomata: we haven't used trebuchet in about a year [16:49:01] we used custom shell loops for a while, but then switched to ansible [16:49:14] ottomata: that is where puppetmaster on labs look for $passwords:: [16:49:30] and I guess tin is hardcoded in the puppet manifest [16:49:33] :( [16:50:03] but maybe some hiera data let us vary the file name :-} [16:51:54] thcipriani: why would you set recurse => true on file { '/srv/deployment/mockbase': [16:51:54] in the deployment target? [16:52:03] wouldn't that cause puppet to delete the deployed files? [16:52:09] since they are not managed by puppet? [16:53:10] ottomata: the intent was to ensure that permissions for all files in that directory are the same. [16:54:40] thcipriani: have you tried it? i think it deletes any non puppet managed file in that dir [16:55:27] ottomata: this was in a cherry-picked manifest on deployment-puppetmaster and I hadn't observed that [16:55:42] hm [16:56:20] This also enables the purge attribute, which can delete unmanaged files from a directory. See the description of purge for more details. [16:56:29] purge [16:56:29] Whether unmanaged files should be purged. This option only makes sense when ensure => directory and recurse => true. [16:56:47] maybe the defeault for purge is false [16:57:03] and 'enables the purge attribute' just means you can now use the purge attribute [16:57:20] guess i will try it :) [16:58:08] ottomata: if it is the case that the unmanaged files are purged, let me know and I'll update docs. [16:58:51] k [16:58:53] Niharika will send a patch to add a new extension (PageAssessment ) in integration/config [16:59:02] straightforward, should just change Zuul layout config. [16:59:33] patch and her should show up here in a few :} [16:59:37] I am off! *wave* [17:05:17] (03PS1) 10Niharika29: Setup CI for PageAssessments extension [integration/config] - 10https://gerrit.wikimedia.org/r/259529 [17:19:33] thcipriani: does the data.yaml group deploy-mockbase need to be added to the deploy server ? [17:19:39] "Finally, you’ll want to modify: modules/admin/data/data.yaml in the operations/puppet repo to create the deploy-mockbase group and add users to that group. [17:19:39] " [17:19:46] just adding the group isn't enough, right? [17:19:57] or, is the group membership looked up elsewhere than posix group membership on the deploy server? [17:23:23] so data.yaml is only used for production, so you'll have to add the group and the group members to ldap [17:23:44] but data.yaml does handle group and group membership. [17:23:58] to ldap? [17:24:01] hm [17:24:19] thcipriani: data.yaml handles it? [17:24:21] ok well [17:24:22] in prod [17:24:24] i'm asking [17:24:30] does the posix group need to exist on tin [17:24:36] with its members [17:24:37] or [17:24:47] is just the presense of the group config in data.yaml enough? [17:24:48] e.g. [17:25:05] the group needs to exist on tin with its members. [17:25:21] ok, so somewhere in puppet tin will need to declare all of the groups it needs [17:25:22] hm. [17:25:50] via admin::groups: [17:25:53] i guess [17:26:22] ergh, which probably means that every deployment target will need a new group created, eh? [17:27:01] i guess i'll defer that until production...although i'm not sure what group to use for keyholder trusted_group [17:27:22] every deployed thing will likely need a group of deployers. [17:27:49] aye, yeah, we have a few eventlogging groups right now, i can use one of those i think [17:27:59] we'll have to figure out some smarter sudo rules [17:29:44] thcipriani: how do I add a group in ldap/labs? [17:30:27] also, what is the proper way to add a private key to labs private? [17:30:34] just manually put it in there, and git commit ? [17:30:39] on deployment-puppetmaster? [17:30:57] and the proper path is /var/lib/git/labs/private/ssh/tin/deploy-mockbase_rsa [17:30:58] ? [17:31:20] ottomata: I was just digging through wikitech to try to figure that out if there's a way to add a group via that interface. This may be a question for -labs :\ [17:31:34] yes, that is the correct way to add a key for deployment-prep. [17:31:42] ook [17:38:19] thcipriani: oh, i see you added a passphrase file too? [17:38:24] for servicedeploy [17:38:32] or well, somewone did [17:38:35] should I do that? [17:38:42] or shoudl the key be passphrase less? [17:49:07] ottomata: yeah, I think the best practice is to have a passphrase although I'm not sure that's a hard-requirement. I just did it like that because it was done that way for the others. [17:53:13] 7Browser-Tests, 5Patch-For-Review, 3Reading Web Sprint 62 - DJ-Jazzy-Jeff-and-the-Fresh-Sprints: Investigate QuickSurveys browser tests failures - https://phabricator.wikimedia.org/T113534#1884983 (10KLans_WMF) [18:04:16] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1885039 (10daniel) Here is my current take on what sessions we should have to cover the Software Engineering (aka Co... [18:54:25] PROBLEM - Puppet failure on wmfbranch is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [19:15:54] thcipriani: k another q [19:16:04] if i want to do different deployments to different targets [19:16:10] in production, from the same codebase [19:16:22] would it be better to make a new deployment source and target [19:16:31] or, to use scap config files (maybe environments?) to manage this? [19:17:23] so you mean that you would be using a different dsh_targets? [19:18:21] yeah, either with that [19:18:28] or just different dsh files in different environments [19:18:32] although, maybe that is an abuse [19:18:42] i think just a list of different dsh_targets is wrong, since that does staged deploys, right? [19:19:05] that's not what I want, i want to manage deployments of the same codebase for different purposes (e.g. analytics, production production (eventbus), etc.) [19:19:15] i'm thinking a different deploy target is more appropriate, eh? [19:19:25] so, if you have different dsh_targets in the same scap.cfg file, then, yeah, that's staged deploys [19:19:53] if you want a completely different set of dsh_target for deployment-prep vs production, then I'd say that's the use-case we built environments for. [19:20:03] hmm, it smore than that [19:20:04] in production [19:20:09] i wnat different dsh_targets [19:20:47] as there are different many services that are run out of the same repo [19:21:10] i really only need 2 right now [19:21:17] analytics, and uhhh let's say 'eventbus' [19:21:27] but, i want to be able to deploy new code to each of those separately [19:21:43] so, environments woudl work [19:21:48] but, maybe that is an abuse of environments [19:22:05] yeah, that seems like it may not be the cleanest way to do things in that instance. [19:22:14] as the clone of the repo on the deploy server would have to be checked out at certain commits or tags to deploy the desired commit [19:22:20] rather than the usual git pull and deploy the latest [19:22:43] yeah, will make another entry in deployments.yaml for the new deployment case [19:22:50] and configure scap3 for that one for now [19:23:29] for any deploy target, all that is necessary is that the configured user can ssh there and copy files to a deploy path, right? [19:23:47] there's no special config needed from scap.cfg or from puppet other than that, ja? [19:24:16] scap will need to be installed on the target as well. [19:24:24] oh, what does that? [19:24:52] one second, there's a puppet class for this, lemme find it. [19:24:54] ah [19:24:55] i see it [19:24:57] scap [19:24:58] module [19:24:59] ja? [19:25:00] just include scap? [19:25:07] oh maybe not [19:25:28] that looks like old scap [19:26:52] so there are two classes: scap::init and role::scap::target. First installs scap + dependencies, second makes sure there's a ferm rule for scap. [19:28:03] # Using trebuchet provider while scap service deployment is under [19:28:03] # development—chicken and egg things [19:28:03] # [19:28:03] # This should be removed once scap3 is in a final state (i.e. packaged [19:28:03] # or deployed via another method) [19:28:03] package { 'scap': [19:28:06] that is the proper scap? [19:28:25] indeed. [19:28:53] oh i see, it ijust using trebuchet's git clone to scap3 code to /srv/deployment/scap/scap? [19:29:10] ok [19:29:16] we have done the work for packaging; however, we'd like to get more things deployed in beta before commiting to using the package exclusively. [19:29:20] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:35] why doesn't role::scap::target just include scap [19:29:36] ? [19:30:05] or, better yet, why not put scap::target in the scap module? [19:30:20] role doesn't quite make sense for that i thinkk... (I also dont' like including roles from other modules) [19:30:37] anyone having trouble cloning from diffusion? i thought git clone https://phabricator.wikimedia.org/diffusion/APAW should work but i'm getting a 503 [19:31:09] nevermind. just magically started working now :| [19:31:46] ottomata: yeah, the puppet stuff for deployment is in need of some work. Most of this split was done to get all the scap stuff out of the mediawiki module. [19:31:58] aye k [19:33:31] thcipriani: i'm going to make a patch, and add you as reviewer, please add whoever else you think shoudl look at it [19:33:38] (for that role thing :) ) [19:33:41] shoudl be really easy [19:33:45] ottomata: awesome! Thanks. [19:34:23] RECOVERY - Puppet failure on wmfbranch is OK: OK: Less than 1.00% above the threshold [0.0] [19:36:23] thcipriani: do you know if there is a specific reason (role::)scap::target doesn't just include scap itself? [19:36:34] i'd thikn if you wanted the ferm rule, you'd also want the package, otherwise, no point? [19:37:18] hm, i guess because some of the old mediawiki stuff doesn't use the new scap3? hmmm [19:37:56] iirc it's mostly to do with role::deployment::mediawiki [19:38:21] yeah ok, i won't touch that [19:38:23] will just add a comment [19:40:20] kk, yeah, this is definitely something that should be addressed; however, pulling on that thread can definitely eat some time. [19:41:12] yeah [19:41:12] hehe [19:41:36] thcipriani: what is your gerrit nick? [19:41:43] oh same [19:41:43] nm [19:41:45] just typed it wrong [19:41:47] found it :) [19:41:49] https://gerrit.wikimedia.org/r/#/c/259542/ [19:41:57] yeah, I'm thcipriani or tcipriani everywhere :) [19:57:04] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [20:00:41] 10MediaWiki-Releasing, 6Developer-Relations, 10Wikimedia-Blog-Content, 3DevRel-December-2015, 5MW-1.26-release: Write blog post announcing MW 1.26 - https://phabricator.wikimedia.org/T112842#1885513 (10hashar) Thank you @Qgil (and other writers) [20:00:42] don't think thats me...but i'm the only one making changes sooOoo.. [20:00:42] hm [20:01:05] hmm maybe its my new deployment target [20:01:58] thcipriani: typo in your ssh-access docs [20:01:59] source => puppet://modules/mockbase/deploy-test_rsa.pub, [20:02:02] should have 3 slashes [20:02:06] source => puppet:///modules/mockbase/deploy-test_rsa.pub, [20:04:33] ottomata: right. good catch—will update. [20:18:17] hmmMm ok not sure what i'm doing wrong atm [20:18:19] so [20:18:23] i defined a new deployment target here [20:18:23] https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [20:18:26] eventlogging/eventbus: [20:18:27] upstream: https://gerrit.wikimedia.org/r/eventlogging [20:18:27] checkout_submodules: true [20:18:44] shouldn't a puppet run make that show up at /srv/deployment/eventlogging/eventbus on deployment-bastion [20:18:45] ? [20:19:48] thcipriani: ^? [20:22:10] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [20:23:46] oh myabe i need a puppet run elsewhere... [20:24:02] ahh yes, on deployment-salt [20:28:12] Cool ja there it is [20:32:11] ok thcipriani trying my first deploy with beta environment dsh_targets [20:32:17] it iddn't pick up the proper dsh file [20:32:19] still around? [20:32:23] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #724: 04FAILURE in 1 min 46 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/724/ [20:33:01] PROBLEM - Puppet staleness on deployment-logstash2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [20:33:03] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [20:35:33] ottomata: yup, I'm around [20:35:48] k, btw, not a question, but yes, keyholder keys need a passphrase [20:35:52] /etc/keyholder.d/eventlogging_rsa is not an acceptable key. Does it have a passphrase? [20:35:54] thcipriani: ja so [20:36:01] on deployment-bastion [20:36:15] look at my scap config in /srv/deployment/eventlogging/eventbus/scap [20:36:58] (oh it looks like keyholdearm it added my new key anyway) [20:37:00] guess it was just a warning? [20:37:23] Agent admitted failure to sign using the key. [20:37:24] maybe not [20:37:24] :) [20:37:34] thcipriani: anyway, with my scap/ config [20:37:35] i have [20:37:39] thcipriani: ostriches: Reedy: I'm thinking of bringing back the weekly mails about log warnings, broken down by project. We used to do this a while back. We need to re-introduce this sense of responsibility. [20:37:45] scap/environments/beta/eventbus [20:37:51] which lists the beta host i want to deploy to [20:37:53] when I do [20:37:56] Krinkle: :) [20:37:56] deploy -e beta [20:38:02] it pulls the hosts out of scap/eventbus [20:38:04] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [20:38:05] I'm also thinking of proposing a zero-tolerance where projects are not allowed to roll out new functionality before fixing warnings from previous train rides. [20:38:24] Krinkle: +1000 [20:38:59] The error count is so massive right now that it makes it impossible to find regressions that stand out from the large amount of noise, and also hard to find out what error might have caused a hard-to-pinpoint regression. FOr example, this week our Save Timing metric regressed by 200ms. Being able to quickly see (or rule out) using error logs would've been [20:39:00] useful. [20:39:11] ottomata: hmm, lemme double check something, might be a scap out of date thing. [20:39:17] k [20:39:22] Thanks, I'll write one up then (warnings) to start it off again [20:39:40] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [20:40:20] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [20:41:22] Krinkle: You're familiar with https://grafana.wikimedia.org/dashboard/db/production-logging, right? We should use those graphs to back up our e-mails. [20:41:37] ie: "Log volume has increased by 10% this week. See? That's bad" [20:41:38] Etc. [20:42:06] Yippee, build fixed! [20:42:07] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #725: 09FIXED in 1 min 32 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/725/ [20:43:56] (03Merged) 10jenkins-bot: Setup CI for PageAssessments extension [integration/config] - 10https://gerrit.wikimedia.org/r/259529 (owner: 10Niharika29) [20:44:37] (03CR) 10Hashar: "Well done. I have deployed the configuration change :-}" [integration/config] - 10https://gerrit.wikimedia.org/r/259529 (owner: 10Niharika29) [20:47:50] !log update scap on deployment-prep [20:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:49:25] ottomata: give that a shot, if that doesn't work, I'll have to take a look at the scap.cfg (can't see it right now since it's in your ~ :)) [20:49:49] :p [20:49:59] i can fix that, i was doing that before i had things working better [20:53:23] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [20:53:59] 10Beta-Cluster-Infrastructure, 7Easy, 7Mobile, 5Patch-For-Review: MobileFrontEnd on Beta Cluster should display a different logo so that it is clearly not a production site - https://phabricator.wikimedia.org/T115078#1885775 (10Jdlrobson) [20:54:29] naw thcipriani no good [20:54:34] PROBLEM - Puppet failure on mira is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [20:54:36] it still grabbed the main dsh file [20:54:38] not the -e beta one [20:54:49] :( [20:54:52] $ deploy -e beta [20:54:52] 20:54:17 Started deploy_eventlogging/eventbus [20:54:52] Updated tag 'scap/sync/2015-12-16/0003' (was 0000000) [20:54:52] Entering 'config/schemas' [20:54:52] 20:54:17 [20:54:52] == DEFAULT == [20:54:52] :* kafka1001.eqiad.wmnet [20:54:53] :* kafka1002.eqiad.wmnet [20:55:17] yeah, I was running `deploy-log` so I could see that :) [20:55:36] my keyholder thing isn't working yet iether, so i'll have to figure that out too [20:55:41] lemme check the scap.cfg real quick... [20:55:43] ok [20:55:50] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:56:48] hm, i can switch that weird git_deploy_dir back [20:56:54] i'm not worried about that now that I have a new target [20:59:25] ottomata: give it another try, I'll watch deploy-log, this is a bug, I just tried to fix ad-hoc. [21:00:07] k just a few, got a couple of things i need to fix too [21:00:10] in puppet stuff [21:07:00] ok thcipriani trying now! [21:07:12] hmm locked? [21:07:44] someone deploying mw? [21:07:51] can only one deploy happen at once? [21:07:58] oh jenkins! [21:07:58] hm! [21:08:15] I thought we merged something for this, hang on. [21:09:34] ottomata: oh, right, just add: lock_file: /tmp/eventbus to your config [21:09:46] hm ok [21:10:04] maybe the default should be ./scap/lock ? [21:10:04] it's configurable and defaults to the scap lockfile. There is a bug for this. [21:10:23] can I set a relative path? :p [21:10:29] * bd808 needs to get rid of his "scap" stalk word [21:10:40] bd808: :D [21:10:48] ottomata: no, not at this time. [21:10:57] ok [21:11:06] can I refer to other configs within the config file? [21:11:18] e.g. $git_repo/$git_deploy_dir ? [21:11:18] :p [21:11:31] Krinkle: Warning: data error in /srv/mediawiki/php-1.27.0-wmf.9/includes/Revision.php on line 1325 [21:11:31] Warning: Revision::decompressRevisionText: gzinflate() failed [Called from Revision::decompressRevisionText in /srv/mediawiki/php-1.27.0-wmf.9/includes/Revision.php at line 1328] in /srv/mediawiki/php-1.27.0-wmf.9/includes/debug/MWDebug.php on line 300 [21:11:37] bad gzip? [21:11:41] on a rev [21:11:53] (i assume not :) ) [21:12:02] ottomata: you assume correctly :) [21:12:42] * Krinkle edits https://grafana.wikimedia.org/dashboard/db/production-logging [21:12:50] added history view and added a per-channel breakdown on the bottom [21:13:58] rats [21:14:00] thcipriani: [21:14:00] $ deploy -e beta [21:14:00] 21:13:46 Started deploy_eventlogging/eventbus [21:14:00] Updated tag 'scap/sync/2015-12-16/0004' (was 0000000) [21:14:00] Entering 'config/schemas' [21:14:00] 21:13:46 [21:14:00] == DEFAULT == [21:14:01] :* kafka1001.eqiad.wmnet [21:14:01] :* kafka1002.eqiad.wmnet [21:14:55] sigh. ok, lemme check my fix again here. [21:15:09] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [21:15:14] ottomata: can I try deploys to test some stuff? [21:15:48] Krinkle: looks good, thx for the improvements [21:15:53] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [21:18:14] yes thcipriani go ahead [21:18:21] i haven't got keyholder to work yet [21:18:27] kk [21:18:31] but you can at least attempt the deploy [21:18:34] to see what hosts get chosen [21:24:59] 10Deployment-Systems, 3Scap3: scap environment-specific host file not working - https://phabricator.wikimedia.org/T121705#1885896 (10thcipriani) 3NEW [21:25:56] ^ ottomata I filed a bug, for the time being for beta, it'll be easier to use an absolute path. I'll try to get that bug fixed fairly quickly, should be quick. [21:26:31] absoulte path to dsh file? [21:27:29] yep. That will work. [21:27:50] oh or to env dir? [21:28:20] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [21:28:31] wrt to the keyholder, I just checked that auth-d yaml file for keyholder. It seems you set the authorized group to project_deployment-prep. I'm in project-deployment-prep [21:28:53] if you use an absolute path, you can set it to anywhere. [21:30:50] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [21:32:22] hmm [21:33:06] thcipriani: i saw that too [21:33:08] have fixed [21:33:24] still same result though...oh maybe keey needs reamred? [21:33:48] hmm, no that shouldn't matter, right? [21:33:59] yeah, changing the auth needs a restart of keyholder then a key rearm IIRC. [21:34:10] loads up perms at service start [21:34:17] oh reastart [21:34:18] ok [21:34:30] twentyafterfour: ostriches https://gerrit.wikimedia.org/r/#/c/259560/ [21:34:34] let's get real ip's [21:34:52] thcipriani: the st-*.services-testbed instances also look unmaintained. May I delete them? [21:35:28] thcipriani: rats [21:35:29] restarted [21:35:31] re-armed [21:35:33] with passphrases [21:35:37] lookgs good in keyholder status [21:35:39] but still [21:35:41] chasemp: +1'd [21:35:42] Agent admitted failure to sign using the key. [21:36:04] andrewbogott: looking [21:36:38] ottomata: which instance are you trying to get to? [21:36:52] deployment-eventlogging04 [21:36:59] in auth.log there [21:37:09] i don't see it tryint ot use the key fingerprint from keyholder [21:37:17] i think keyholder isn't allowing me access to the agent [21:38:37] ha, not much helpful in upstart logs [21:38:37] debug1: type 11 [21:38:38] debug1: XXX shrink: 3 < 4 [21:38:46] andrewbogott: I'd say you can delete everything except st-trusty and proba, you'd have to check with services about those, everything else I set up for scap3 things. May want to doublecheck in -services about all those, though. [21:39:04] thcipriani: sounds good, thank you [21:39:37] ottomata: which user? [21:39:46] eventlogging [21:41:47] hmm, yes, that is strange... [21:43:42] 10Deployment-Systems, 7I18n: Translation from translatewiki.net (MediaWiki:View-foreign/nl) don't show - https://phabricator.wikimedia.org/T66230#1885965 (10Nemo_bis) [21:43:53] 10Deployment-Systems, 7I18n: Translation from translatewiki.net (MediaWiki:View-foreign/nl) don't show - https://phabricator.wikimedia.org/T66230#1885966 (10Nemo_bis) 5Open>3Invalid [21:46:16] thcipriani: fyi i can confirm that deploy -e /srv/deployment/eventlogging/eventbus/scap/environments/beta does the right thing [21:47:36] kk, looking at this keyholder thing, strange that it's not working [21:49:50] ja afaict everything is in line... [21:50:52] indeed. [21:51:43] thcipriani: maybe the keyholder-proxy needs restarted too? [21:52:45] ottomata: yep [21:53:00] did you restart it? [21:53:02] if not i will try now [21:53:10] I did not, please do [21:53:14] i just noticed that keyholder restart only restarts the main agent [21:53:54] better! [21:53:59] now it is using the proper key [21:53:59] 10Beta-Cluster-Infrastructure, 10Staging, 6Collaboration-Team-Backlog, 10DBA: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1885983 (10Mattflaschen) [21:54:08] buuut not yet allowed to get into the target [21:54:17] Failed publickey for eventlogging from 10.68.16.58 port 57033 ssh2: RSA 02:9b:99:e2:f0:16:70:a3:d2:5a:e6:02:a3:73:0e:b0 :) [21:54:27] thcipriani: should keyholder restart restart the proxy then too? [21:54:38] "/sbin/${command}" keyholder-proxy [21:54:39] ? [21:54:53] if so I can make patch... [21:55:30] I can't offhand think of a circumstance in which you would restart one without restarting the other. [21:55:52] ja if you are trying to reload configs, the proxy will need restarted in order to pick up the auth dir changes [21:55:58] k submitting patch, i'll add you and ori as reviewers [21:58:30] thcipriani: https://gerrit.wikimedia.org/r/#/c/259596/ [22:00:03] SO CLOSE! [22:00:19] now what is wrong with this keyYyYYYy [22:00:20] hm [22:01:24] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:01:40] thcipriani: is it possible I have to do something with this eventlogging user in ldap in labs? [22:01:54] i see it doing error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1 [22:01:57] pam_access(sshd:account): access denied for user `eventlogging' from `deployment-bastion.deployment-prep.eqiad.wmflabs' [22:02:08] but, it is trying to use the proper key [22:02:23] and, the public key is at /etc/ssh/userkeys/eventlogging [22:02:45] um [22:03:00] Does andrewbogott know you're talking about this? [22:03:10] me? :) [22:03:17] i suppose now he will :) [22:03:20] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:03:36] i aint changin no pam things, don't you worry [22:03:43] ok :) [22:04:09] i *think* for this case it shouldn't matter, ssh is just trying to lookup a username in ldap and it is failing (because it doesn't exist there) [22:04:13] so, it doesn't find a public key for it [22:04:15] which is fine [22:04:17] i think it shoudln't matter [22:04:22] that does sound like a pam thing [22:04:29] yeah, its trying to use pam to do that [22:04:32] as it will for all logins i guess [22:04:39] but, i'm trying to log in with a systemuser account that is not in ldap [22:04:44] but configuring ssh keys via keyholder in puppet [22:05:01] so, there is a private key in an agent on deployment-bastion, and a public key deployed via puppet on my target host [22:05:03] I think this might be a known problem [22:05:22] hmmm yeah, maybe hte pam thing is causing a problem [22:05:39] is labs configured in such a way that ONLY ldap public keys are allowed? [22:06:05] ottomata: as opposed to…? [22:06:18] as opposed to this https://wikitech.wikimedia.org/wiki/Keyholder [22:06:50] andrewbogott: i'm trying to use scap3 to deploy [22:06:51] in beta [22:07:03] i've set up private key in keyholder on deployment-bastion [22:07:12] and a public key in /etc/ssh/userkeys [22:07:13] is this something that you think ever worked in labs? [22:07:14] i think its all set [22:07:19] Yes [22:07:22] andrewbogott: no idea! i'm doing this for the first time [22:07:33] ok... [22:07:37] it seems like it should, according to these instructions, but maybe I need to add a public key for a user in ldap? [22:07:44] I don’t know. It’s quite possible that pam changes broke it [22:07:45] that would be unfortunate if so :) [22:07:52] this user doesn't exist in ldap [22:07:56] it is a system user [22:07:59] managed by puppet [22:08:13] I’d suggest you open a ticket, cc: faidon, and make sure you include something along the lines of when/how it used to work [22:08:23] thcipriani: do you know if it used to work? [22:08:32] or are all your scap3 system users in ldap in labs? [22:08:35] with public keys? [22:09:18] they are, I haven't tried just using a key in userkeys, just assumed it should work. [22:09:46] mediawiki deployments still appear to work from deployment-bastion [22:10:05] hm, what is a user that scap3 deploys with from -bastion? [22:10:08] mwdeploy [22:10:09] ? [22:10:11] yeah, mwdeploy [22:10:35] hmm, /usr/sbin/ssh-key-ldap-lookup mwdeploy doesn't seem to find a public key..>. [22:10:36] I don't know about scap3, but normal scap uses keyholder [22:10:50] Krenair: yeah, to get the private key, i thikn that is working [22:11:01] same mechanism for scap v. scap3 [22:11:05] its that pam may not be configured auth with a public key not in ldap [22:11:10] on the target host [22:11:52] do either of you know how those keys are added to ldap? [22:12:35] via wikitech? [22:12:57] hmmm nawww, can't be :) [22:13:14] wikitech controls the ssh keys in ldap, yes [22:13:20] but for system users? [22:13:22] i guess: [22:13:26] system users shouldn't be in ldap [22:13:29] does mwdeploy have a public key in ldap, and if so, how did it get there? [22:14:01] no it doesn't [22:14:04] ok [22:14:09] so that must not be my issue then [22:14:32] hmm, thcipriani is there somethign else deployed with scap3 in beta right now? maybe i can check auth.log on a target for that and compare [22:15:13] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [22:15:15] restbase01 and restbase02 were deployed (as recently as a few weeks ago) [22:15:25] using deploy-service [22:15:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [22:16:24] ok via the deploy-service user? [22:16:29] yup [22:17:33] ok cool, i see the same ldap error for that [22:17:35] so that is not my problem [22:18:20] Hmmmm [22:18:30] no public key in /etc/ssh/userkeys for deploy-service [22:18:32] HMmMM [22:19:00] Hmmm where could it be?! [22:20:37] for what it is worth deployment-aqs01 logs "Error: Cannot find module '/srv/deployment/restbase/deploy/restbase/server.js'" [22:21:17] and [22:21:18] fatal: Access denied for user eventlogging by PAM account configuration [preauth] [22:21:30] (instances syslog are sent to logstash) [22:22:01] aye yeah [22:22:01] k [22:22:31] dunno anything about that aqs one, but the pam eventlogging one is what i'm trying to solve [22:23:15] is that `eventlogging` user in the deployment-prep project? [22:23:36] hm, no [22:23:37] but I am [22:24:05] does it have to be? [22:24:26] might be wong [22:24:33] but IIRC pam is configured to lookup in ldap [22:24:35] i guess i can just manually add and try [22:24:39] yah, i can see it doing that [22:24:42] and if the user is not in the project, it is rejected magically [22:24:55] but then, some changes have been made to pam recently so... [22:25:07] i thiiiink that is not my problem, but i will try it... [22:25:36] ottomata: I am wrong [22:25:44] oh? [22:25:44] the mwdeploy user is not in the group :D [22:26:37] aye [22:26:39] hm yeah [22:26:47] ok, right now, i'm looking at the deploy-service user, because that one certainly uses scap3 [22:27:04] deployment-aqs01:~$ getent passwd deploy-service [22:27:04] deploy-service:x:998:998::/var/lib/scap:/bin/bash [22:27:08] and, it was used to deploy restbase to deployment-restbase01 recently [22:27:09] but [22:27:19] deployment-aqs01:~$ getent passwd eventlogging [22:27:20] $ [22:27:21] and i see the same ldap key lookup error [22:27:26] right hashar [22:27:28] i am not deploying to aqs [22:27:31] deployment-eventlogging04 [22:27:45] aqs is totally unrelated [22:28:10] oh sorry ignore me [22:28:15] I am not any useful :-/ [22:28:17] hehe [22:28:17] :) [22:28:30] well, almost, righ tnow, i need to find where the deploy-service public key is on restbase01 [22:28:40] that may be a clue [22:28:53] or, maybe its possible whatever crazy pam changes happened over the weekend just broke scap3 [22:29:07] the last restbase01 deploy i'm looking happened on dec 8 [22:29:11] so, it seems like, the deploy-service user for restbase01 no longer works [22:29:30] ah! [22:29:30] this could be because of the un-cherry-picked puppetmaster patch [22:29:31] ok [22:29:34] * hashar blames puppet [22:29:39] oh? [22:29:44] by me? no [22:29:47] right? [22:30:22] * thcipriani walks home from café [22:30:30] I'll be back in a minute [22:30:32] :) [22:30:33] k [22:40:14] thcipriani: ok, so, i guess, if deploy-service user isn't working, then neither will this, eh? [22:40:21] i still am not sure how the deploy-service thing works at all anyway [22:40:31] since I can't find its public key on restbase01 [22:41:06] ah [22:41:07] odules/beta/manifests/deployaccess.pp [22:41:16] that is the crazy rule I was looking for [22:41:48] that tells pam to allow deploy-service and mwdeploy users from the deployment-bastion [22:41:57] bwaaa [22:42:04] bwaaaaaa [22:42:15] I am not sure how / where it is applied though [22:43:00] looks lke in wikitech ui [22:43:02] 2015-11-03 22:05 bd808: applied ::beta::deployaccess on deployment-bastion via Special:NovaInstance [22:43:02] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=088a0575-c09b-4c42-88a5-4ef57d8705c0&project=deployment-prep®ion=eqiad [22:43:05] oooook [22:43:22] hmmm i guess i should just add my user there, althouhg, seems a little crazy! [22:44:21] jenkins-deploy is not listed there though [22:44:32] yeah iuNnoo man lots of stuff going on here..> [22:44:34] that is the user used by Jenkins to ssh in a few instances [22:44:43] hm i can manually add access.conf.d entry and just try it i guess [22:44:45] on my target [22:45:33] but then jenkins-deploy user is in ldap [22:45:39] (created via wikitech) [22:45:49] and it is a member of the group project-deployment-prep [22:45:56] so pam let it in [22:46:03] hm ja ok [22:46:04] trying this then [22:46:38] for mwdeploy , we had it added to admins data.yam with a 6xx uid [22:46:48] and a corresponding entry has been added to ldap iirc [22:46:57] so we have the same UID on all instances [22:47:16] but the mwdeploy is unknown to wikitech and thus is not in the project-deployment-prep group [22:47:23] probably why there is the beta::access class [22:48:10] Accepted publickey for eventlogging from 10.68.16.58 port 58491 ssh2: RSA 02:9b:99:e2:f0:16:70:a3:d2:5a:e6:02:a3:73:0e:b0 :) [22:48:11] :D [22:48:20] ok so this is def the problem [22:48:21] cool [22:48:32] oh [22:48:40] !!!!!!!!!!!!!!!! [22:48:42] now [22:48:56] you get a fully qualified ticket [22:49:03] going to add this? [22:49:04] security::access::config { 'beta-allow-eventlogging': [22:49:04] content => "+ : eventlogging : ${bastion_ip}\n", [22:49:05] time to estate to level 3 support :D [22:49:44] hmmmm, this is pretty cumbersom though [22:49:47] estate -> escaladate [22:49:55] so many steps to take in order to deploy a new repo in beta [22:50:21] what is the eventlogging user for anyway? [22:50:25] for every new deploy user, one must add a special pam exception [22:50:28] for eventlogging! [22:50:34] we were deploying with trebuchet [22:50:37] oh [22:50:38] which is root for everything [22:50:42] i'm trying to do scap3 [22:50:48] and will now deploy using scap3 as eventlogging ? [22:50:56] so that would be the equivalent of service-deploy right ? [22:50:58] well, the eventlogging-service for eventbus [22:51:06] next quarter we'll migrate the other stuff over [22:51:07] but ja [22:51:09] yes [22:51:10] exactly [22:51:26] so dont quote me [22:51:44] but i would assign eventlogging-service a UID in 6xx via data.yaml [22:51:50] add it in LDAP with matching uid [22:52:10] then do the security::access::config { 'beta-allow-eventlogging': [22:52:21] hmm, i dunno, the eventlogging user already exists in prod and on a few beta hosts [22:52:23] it is a system user [22:52:28] managed by puppet [22:52:42] why would i need a ldap user? [22:52:47] fixed uid ? [22:53:06] why though? [22:53:10] nfs? [22:53:10] :) [22:53:12] no nfs here :) [22:53:29] hashar: i would do it if that was the standard for system users, but i don't think it is :/ [22:54:37] id' prefer to not have to deal with ldap here, especially since we won't be doing that in prod anyway [22:54:38] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:54:42] so maybe security::access::config { 'beta-allow-eventlogging': is enough [22:54:46] heah [22:54:49] it will let me in [22:54:52] pretty sure [22:54:57] you can craft a puppet patch + cherry pick on beta cluster puppetmaster [22:55:07] then add the puppet class on the node and see what happens [22:55:10] should do [22:55:33] would be worth asking on ops whether the system user should be given a unique id across the fleet [22:55:44] i think i need to give it a real shell though, instead of /bin/false [22:56:01] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:56:13] that cause other troubles with puppet though. If puppet can't reach ldap while doing User[johndoe] puppet creates a local johndoe user with a local uid [22:56:45] naw, its a systemuser [22:56:55] so it should create a local user [22:56:57] i want it to [22:57:02] ok ok :-} [22:57:31] so seems security::access::config is the unblocker [22:57:34] yeah [22:57:35] thanks! [22:57:37] patching now [22:57:45] it seems crazy though that I have to do all that! :) but ok! [22:57:59] thcipriani: is gonna have to update his https://doc.wikimedia.org/mw-tools-scap/scap3/ssh-access.html docs :) [22:58:05] yeah that is a constraint due to labs [22:58:22] yeah, i guess it is just a labs thing, but mehHhH :0 [22:58:23] :) [22:58:26] all labs projects share the same underlying infra (same ldap / no network isolation etc) [22:58:50] so the isolation has been done at the instance level using trick such as the pam rules [22:59:14] I think coren and andrewbogott are working on better isolating labs projects and definitely reworked the labs pam system [22:59:24] ottomata: there have been mucho changes in this arena as of late [22:59:35] so it is getting better. 2015 largely improved labs for sure [22:59:39] chasemp: aye i have heard only alittle :) [22:59:43] and some really weird consequences fyi, if you wouldn't mind making a ticket for what you are seeing [22:59:49] but, i think that we ahve found the problem [22:59:51] only because, I'm sure whatever it is, isn't just you [22:59:59] chasemp: i think its just a new scap3 related thing [23:00:17] i'm the first non releng person trying to use it [23:00:19] (i think) [23:00:39] scap3 tries to ssh as a configured system user [23:00:41] * hashar looks for a scap3 barn start to offer [23:00:56] still worth a ticket :) ...if you can [23:00:57] and labs doesn't allow that [23:00:59] oooOk [23:01:04] otehrwise good odds we stomp on the fix man [23:01:08] as we are known to do [23:01:16] should I assign to you? what project? [23:01:25] labs project and you can leave traige [23:01:29] and I'll talk to ppls [23:02:12] cc Bryan Davis to it since apparently he has hit that wall previously and introduced the magic pam rule [23:02:45] which one? The deploy keys in beta cluster? [23:03:19] bd808: sshing to an instance with a user not being part of deployment-prep project [23:03:37] worked around via modules/beta/manifests/deployaccess.pp [23:03:51] yeah. there is deep puppet magic I copied from somewhere else [23:03:56] which tells pam to allow deploy-service / mwdeploy [23:04:05] andrew is trying to add another user for eventlogging deployment via scap3 [23:04:41] ah, yes. that would be the same issue [23:05:09] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [23:05:29] jenkins-deploy (used by Jenkins to ssh to instances) is a member of the labs project [23:05:36] It seems like there used to be a pam hack too [23:05:37] so not impacted :-} [23:05:46] but maybe that's not needed anymore [23:05:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [23:05:56] chasemp: https://phabricator.wikimedia.org/T121721 [23:05:58] danke [23:06:20] for now i'm going to add an exception so i can test my deploys, but hopefully we wont' have to eventually :) [23:07:14] * thcipriani just got home [23:07:22] oh no [23:07:25] tyler missed all the fun [23:07:34] evidently. [23:07:46] just catching up on the scrollback. [23:07:47] TL;DR: modules/beta/manifests/deployaccess.pp [23:08:05] assuming pam auto reject users not being part of deployment-prep [23:08:26] we need a custom pam rule to whitelist users not being members of the project [23:08:37] jenkins-deploy (used by Jenkins) works just fine because it is a member [23:08:49] mwdeploy does thanks to the pam rule [23:09:10] eventlogging lacks a proper passport/visa and is rejected at the border [23:10:11] ah, that makes sense. [23:10:19] thcipriani: so so so close! [23:10:26] i need sudo rules? [23:10:30] eventlogging is not allowed to run sudo on deployment-eventlogging04. This incident will be reported. [23:11:05] ottomata: I think I know what this is, let me verify [23:11:13] ah sudo [23:11:16] Command 'mkdir -p '/srv/deployment/eventlogging/eventbus-cache/cache'' returned non-zero exit status 1 [23:11:44] hm oh or do I need to make /srv/deployment/eventlogging ahve proper perms? [23:12:18] i have to go very SOoOoON :) [23:12:35] deployment-prep has a bunch of sudo rules on https://wikitech.wikimedia.org/wiki/Special:NovaSudoer which I never really understood :/ [23:12:36] https://phabricator.wikimedia.org/P2431 [23:12:49] some might be provisioned via puppet as well [23:12:55] so the eventlogging user needs a sudo rule to sudo as itself [23:13:02] AH HA [23:13:03] ook [23:13:04] :p [23:13:19] wrong test! [23:13:37] thcipriani: i think it will fail anyway [23:13:38] since [23:13:46] /srv/deployment/eventlogging [23:13:48] is owned by root [23:14:04] because it was originally created by trebuchet [23:14:13] but, i think it shoudlnt' ahve to be owned by the deploy user [23:14:37] what if I deployed different services as different users into that dir (unlikely, but maybe!) [23:14:45] (in labs maybe it is likely ) [23:15:06] is the location of the cache dir configurable? [23:15:37] i have 5 mins! Ahhhh i so wanted to make this work before I left today, oh well! lots of progress though! [23:15:45] man, so many thanks to all of you [23:15:48] so much help all day here today [23:15:49] heck you can ssh to it! [23:16:00] progress [#####___________] [23:16:19] closer than that! [23:16:19] damn it is past midnight already [23:16:29] where is the midnightkickbot when i need it [23:16:43] 10Continuous-Integration-Config, 10Fundraising-Backlog: Continuous integration: wikimedia/fundraising/tools/DjangoBannerStats needs V+2 jobs - https://phabricator.wikimedia.org/T121723#1886315 (10awight) [23:16:47] ottomata: the cache dir is always: git_deploy_dir / git_repo '_cache' [23:17:08] ottomata: thanks for trying it out! :) [23:17:59] thcipriani: i think that will have problems, since git_deploy_dir is maanged by scap, and git_deploy_dir/eventlogging/eventbus is managed by scap [23:18:04] git_deploy_dir/eventlogging is not [23:18:30] so attempting to create git_deploy_dir/eventlogging/eventbus_cache could easily fail [23:18:53] i think either put cache in git_deploy_dir/git_repo/.cache (or something) [23:18:55] or make it configurable [23:18:58] or put it in /tmp [23:18:59] :p [23:19:24] or [23:19:31] configurable is not a bad idea. Generally the thinking was that the transition would need some initial puppet work. [23:19:34] git_deploy_dir/.cache/git_repo [23:19:49] well, one that that is really coll about this [23:20:14] is that techincally, is that in puppet the target doesn't even have to know final deploy_dir (if scap can create it) [23:20:20] so, if I were to set [23:20:32] git_deploy_dir: /home/otto/path/to/anything/i/want [23:20:44] (03PS1) 10Krinkle: Set up test pipeline for rcstream repository [integration/config] - 10https://gerrit.wikimedia.org/r/259616 [23:20:57] (03CR) 10Krinkle: [C: 032] Set up test pipeline for rcstream repository [integration/config] - 10https://gerrit.wikimedia.org/r/259616 (owner: 10Krinkle) [23:20:57] i think scap shoudl work, as long as deploy user can write in /home/otto/path/to/anything/i/want [23:21:09] it shoudln't require puppet work to do it :) [23:21:37] SOOoOOO, all places that deploy user needs to write to should be known by scap [23:21:40] git_deploy_dir [23:21:41] is fine [23:21:44] git_deploy_dir/git_repo [23:21:45] is fine [23:21:46] but [23:21:56] git_deploy_dir/something/something_cache isn't [23:22:10] becuase git_deploy_dir/something might be owned by who knows what :) [23:22:28] OK [23:22:31] yes, time to go [23:22:38] (03Merged) 10jenkins-bot: Set up test pipeline for rcstream repository [integration/config] - 10https://gerrit.wikimedia.org/r/259616 (owner: 10Krinkle) [23:22:40] thank you So mUch! I'll probably bug you some more tomorrow [23:22:42] we are really close [23:22:43] :D [23:22:51] so the thinking was that in most cases the git_deploy_dir is /srv/deployment and since we were trying to scope deployments of a particular service to a particular user it would be easier to ensure the remote user could write to /srv/deployment/[repo]-cache than /srv/deployment/.cache/[repo] [23:23:06] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/259616 [23:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:23:18] thcipriani: i think that would be fine, if repo was not two directories [23:23:23] e.g. eventlogging/eventbus [23:24:14] repo being enforced as two directories is sort of an arbitrary decision by trebuchet that's reasoning is a bit opaque. Wanted to move away from enforcing that, generally. [23:26:41] sleepy time. *wave* [23:28:03] kudos on scap3 / eventbus progressing somehow! [23:28:26] toodles hashar [23:48:01] I'd like to go on the record that I hate PEP8, 80 character lines are for suckers [23:52:58] twentyafterfour: does that not make your code better? [23:54:05] short lines make code better? Not necessarily. Often it makes code worse, because it forces me to do stupid things [23:54:43] most of pep8 is ok but wtf is wrong with #comment or ## comment [23:55:11] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [23:56:11] and I haven't used an 80 column editor since wide-screen monitors became standard and unavoidable. Even if I run my screen in portrait orientation it's a lot more than 80 cols. [23:56:36] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [23:56:43] pep8 related https://youtu.be/wf-BqAjZb8M [23:57:07] (my comment was sarcasm, by the by :))