[00:05:16] marxarelli: Hi it says this is down integration-trusty-1026 [00:05:19] ^^ [00:06:50] (03PS1) 10Paladox: [GoogleMaps] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280359 [00:07:06] Or legoktm or jzerebecki: Or Krinkle PROBLEM - Host integration-trusty-1026 is DOWN: CRITICAL - Host Unreachable (10.68.17.98) [00:07:55] paladox: strange. i never created a trusty-1026 [00:08:41] This is normal when a host is first created or shortly after deletion. [00:09:05] but i never created it or deleted it :) [00:09:17] marxarelli: Yes strange since that is not shown in https://integration.wikimedia.org/ci/ and i have seen those errors related to that over the last few days. [00:09:42] i did recreate a trusty-1025 instance, however [00:09:54] It was never created or is not publicly viewable since i saw that error a few days ago and so i looked on https://integration.wikimedia.org/ci/ and it did not show so strange [00:12:19] oh! hashar created it on 2/11 [00:12:43] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL#2016-02-11 [00:13:27] (03PS1) 10Paladox: [GooglePlusOne] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280362 [00:14:13] so perhaps it's still around in ldap? [00:14:15] hrm. [00:14:30] Oh [00:17:00] marxarelli: I think they were created because we switched to php 5.5 and then alot of tests went through causing zuul to fill up taking hours after pressing c+2 for patches to merge. [00:18:50] paladox: i see. the instance has since been deleted (over a month ago) so it shouldn't be cause for worry. i've asked in #-labs about the ldap records [00:19:06] Oh ok, thanks. [00:20:56] (03PS1) 10Paladox: [GroupsSidebar] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280365 [00:38:12] (03CR) 10Krinkle: [C: 04-1] "Pushing commits is only allowed by authorised accounts. Jenkins job runners do not and must not have that authorisation." [integration/config] - 10https://gerrit.wikimedia.org/r/279738 (owner: 10Paladox) [00:48:54] (03CR) 10Paladox: "Yes but woulden it fail if it try's to push. So we can use the script generate the files we need and push to doc.wikimedia.org." [integration/config] - 10https://gerrit.wikimedia.org/r/279738 (owner: 10Paladox) [01:26:39] (03CR) 10Krinkle: ""sync-gh-pages.sh" is specifically for the purpose of building the demos and pushing them to gh-pages." [integration/config] - 10https://gerrit.wikimedia.org/r/279738 (owner: 10Paladox) [01:52:46] 10Continuous-Integration-Infrastructure: Switch Gerrit submit type of integration repos to merge if necessary - https://phabricator.wikimedia.org/T131008#2159779 (10Krinkle) I agree with @Legoktm. While unintended side-effects from unconflicted merges are unlikely. There is no question that deploying something w... [02:02:23] 10Beta-Cluster-Infrastructure, 10Staging, 10DBA, 3Collaboration-Team-Current, and 3 others: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#2159788 (10Mattflaschen) @demon The script should work now. Do you want to pair again on this, or should I just do it? [02:09:48] 6Release-Engineering-Team, 6Developer-Relations, 6Team-Practices, 15User-greg: Set up Code Review office hours - https://phabricator.wikimedia.org/T128371#2159792 (10Dereckson) [02:28:44] 6Release-Engineering-Team, 6Developer-Relations, 6Team-Practices, 15User-greg: Set up Code Review office hours - https://phabricator.wikimedia.org/T128371#2159817 (10greg) Thanks @dereckson [03:05:42] PROBLEM - Host deployment-mediawiki01 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:26] 10MediaWiki-Codesniffer, 13Patch-For-Review: Position of boolean operators inside an if condition - https://phabricator.wikimedia.org/T116561#2159879 (10Aashaka) I understand. At that time, I had forgotten that Reviewer-bot asked reviewers to do patch review too. [03:23:25] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #1027: 04FAILURE in 41 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/1027/ [04:23:20] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #766: 04FAILURE in 31 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/766/ [06:34:31] PROBLEM - Puppet run on integration-slave-trusty-1025 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:46:06] RECOVERY - Puppet run on deployment-ores-web is OK: OK: Less than 1.00% above the threshold [0.0] [06:50:55] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #773: 04FAILURE in 25 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/773/ [07:14:39] RECOVERY - Puppet run on integration-slave-trusty-1025 is OK: OK: Less than 1.00% above the threshold [0.0] [07:29:34] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [07:35:35] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [07:40:33] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [07:45:51] blah good morning [07:46:16] (03PS3) 10Nikerabbit: Utility script for trimming i18n files. [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [07:47:44] (03CR) 10Hashar: "James: maybe you will be interested in porting such i18n feature to the banana lint checker?" [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [07:48:29] (03CR) 10Nikerabbit: [C: 032] Utility script for trimming i18n files. [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [07:51:34] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [07:56:34] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [08:41:01] zeljkof: and I learned about rspec "subject" :D [08:42:25] hashar: I will have to refresh my rspec-fu :) [08:42:32] (03CR) 10Hashar: "Next patch will be way nicer" (032 comments) [selenium] - 10https://gerrit.wikimedia.org/r/280283 (owner: 10Hashar) [08:47:05] (03PS3) 10Hashar: (WIP) Test rake task (WIP) [selenium] - 10https://gerrit.wikimedia.org/r/280283 [08:47:10] zeljkof: ^^^^ [08:48:28] the thing I was struggling with was to override MediawikiSelenium.load_default() which is invoked by the Rakefile [08:48:52] spent like 2 hours figuring out a very bad solution and Dan pointed I could simply use: expect(MediawikiSelenium::Environment).to receive(:load_default).and_return(env) [08:48:53] :D [08:49:11] looks good! [08:49:17] (where 'env' is a hash of default settings merged with some context hash) [08:49:41] so one can easily inject additional env/settings with: let(:extra_env) {  "SOME_ENV": "value" } [08:49:49] it { is_magic() } [08:51:41] zeljkof: also I got some comment on PS7 which are not addressed https://gerrit.wikimedia.org/r/#/c/275820/7..9/lib/mediawiki_selenium/rake_task.rb,cm :D [08:52:06] we probably want to squash my lame rspec test into your change [08:52:14] can do it myself if you want [08:52:23] hashar: go ahead [08:52:30] just make sure the tests are green ;) [08:54:35] (03Abandoned) 10Hashar: (WIP) Test rake task (WIP) [selenium] - 10https://gerrit.wikimedia.org/r/280283 (owner: 10Hashar) [08:54:40] (03PS10) 10Hashar: Provide Rake task to serve as a CI entrypoint [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [08:55:06] (03CR) 10Hashar: "Squashed in the rspec I wrote with the help of Dan on https://gerrit.wikimedia.org/r/#/c/280283/" [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [08:55:11] {done} [08:59:06] PROBLEM - Free space - all mounts on deployment-fluorine is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine.diskspace._srv.byte_percentfree (<33.33%) [09:05:12] zeljkof: and I am refactoring to have the rake file inject the test_directory relatively to the Rakefile [09:05:22] cool [09:17:36] bah [09:17:47] undefined method `test_directory=' for # [09:18:01] but I did define that method via: attr_accessor :test_directory [09:24:43] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:31] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29510 bytes in 0.522 second response time [09:33:45] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #763: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/763/ [09:35:32] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL - No data received from host [09:37:01] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #818: 04FAILURE in 1.5 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/818/ [09:38:24] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL - No data received from host [09:39:17] probably my fault ^ [09:39:40] (load testing http connection pooling) [09:59:57] \O/ [10:01:51] dcausse: if you stress load varnish, the varnish cache can route traffic to a dedicated app server [10:02:12] dcausse: one would set the header X-Wikimedia-Security-Audit: 1 , and that will make Varnish to route the traffic to deployment-mediawiki03 [10:02:33] which is not serving any traffic beside requests flagged with that header [10:02:35] hashar: I used X-Wikimedia-Debug: true [10:02:45] I am not sure how that one behave on beta cluster [10:03:03] oh [10:03:16] dcausse: X-Wikimedia-Debug ends up being routed to deployment-mediawiki01 [10:03:35] which is serving regular traffic together with deployment-mediawiki02 [10:04:12] so in short: default => [01, 02] , X-Wikimedia-Debug => [ 01 ], X-Wikimedia-Security-Audit => [ 03 ] [10:04:30] (found via hieradata/labs.yaml ) [10:04:34] hashar: thanks (looking) [10:07:46] dcausse: it really depends on what you want to test ;-D if you are overloading the varnish cache, there is little you can do :( [10:08:27] hashar: I wanted to test http connection pooling between mediawiki and elastic [10:08:36] but looks like I broke something else [10:11:29] PROBLEM - Free space - all mounts on deployment-sentry2 is CRITICAL: CRITICAL: deployment-prep.deployment-sentry2.diskspace._var.byte_percentfree (<10.00%) [10:16:55] hashar: Hi could you merge https://gerrit.wikimedia.org/r/#/c/279529/ and https://gerrit.wikimedia.org/r/#/c/279709/ and https://gerrit.wikimedia.org/r/#/c/280236/ please. [10:17:02] PROBLEM - Puppet run on deployment-ores-web is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:22:30] !log cherry-picking 280403 to beta puppetmaster and manually running puppet agent in deployment-ores-web [10:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:24:25] (03PS11) 10Paladox: Fix dirty VisualEditor submodule [integration/config] - 10https://gerrit.wikimedia.org/r/262432 (https://phabricator.wikimedia.org/T121479) [10:24:53] (03CR) 10Paladox: "Ok done." [integration/config] - 10https://gerrit.wikimedia.org/r/262432 (https://phabricator.wikimedia.org/T121479) (owner: 10Paladox) [10:29:59] hmm, this error is strange: E: Version '3.0.3-1' for 'scap' was not found [10:31:02] hashar: Ive almost gone through half of the extensions i think converting them to npm. I think soon we will have completed the migration to npm. [10:35:54] ladsgroup@deployment-ores-web:~$ apt-cache madison scap [10:35:54] scap | 3.1.0-1 | http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages [10:35:54] scap | 3.1.0-1 | http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main Sources [10:36:50] should we change this in puppet and use scap 3.1 instead? or add 3.0.3-1 to keep compatibility? [10:37:01] any of releng team around? [10:37:12] (03PS1) 10Paladox: [HashTables] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280410 [10:37:31] hashar: hey, ^ I'm trying to make ores in beta (so we have ores.beta.wmflabs.org) [10:37:53] the worker is in sca01, we have a web node and a redis instance [10:37:59] as ops told us [10:50:19] 10Browser-Tests-Infrastructure, 13Patch-For-Review: Simplify creating of Jenkins jobs for running browser tests daily - https://phabricator.wikimedia.org/T128190#2160339 (10zeljkofilipin) [10:56:48] 10Deployment-Systems, 3Scap3, 10scap: Update Debian Package for Scap3 v3.1.0 - https://phabricator.wikimedia.org/T130902#2160348 (10mobrovac) 5Resolved>3Open a:5fgiunchedi>3mobrovac Reopening as it needs a version bump in `ops/puppet` too. [10:57:32] 10Deployment-Systems, 3Scap3, 10scap: Update Debian Package for Scap3 v3.1.0 - https://phabricator.wikimedia.org/T130902#2160355 (10mobrovac) [10:57:34] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2160354 (10mobrovac) [10:58:27] lunch [11:04:01] !log cherry-picked 280413/1 in beta puppetmaster, manually running puppet agent in deployment-ores-web [11:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:24:38] (03CR) 10Nikerabbit: "Where's Jenkins?" [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [11:29:05] (03PS1) 10Paladox: [HeaderTabs] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280417 [11:29:49] (03CR) 10Paladox: "Need A C+2 again." [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [11:33:04] (03CR) 10Paladox: "Needs v+2 since this repo only is running php53lint in jenkins. Maybe php53lint is not allowed in gate and submit. Maybe adding composer w" [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [11:33:33] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [11:46:40] (03PS1) 10Paladox: [HelpCommons] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280427 [11:52:53] (03PS1) 10Paladox: [HidePrefix] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280429 [11:53:06] hashar: Hi [11:57:03] Project browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #437: 04FAILURE in 2.9 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/437/ [12:01:09] Project selenium-RelatedArticles » chrome,beta-desktop,Linux,,contintLabsSlave && UbuntuTrusty build #9: 04FAILURE in 9.1 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/9/ [12:01:11] Project selenium-RelatedArticles » chrome,beta-mobile,Linux,,contintLabsSlave && UbuntuTrusty build #9: 04FAILURE in 11 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/9/ [12:16:58] !log deployment-prep restarting varnish on deployment-cache-text04 [12:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:18:24] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 40562 bytes in 0.864 second response time [12:20:34] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29514 bytes in 4.449 second response time [12:24:01] 10Continuous-Integration-Config, 10Dumps-Generation, 6Operations, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2160482 (10ArielGlenn) All files in production use are now in operations/dumps/xmldumpsbackup (not the subdirec... [12:34:39] paladox: sorry busy doing ruby/rake/rspec stuff :d [12:34:52] hashar: Oh ok. [12:35:14] zeljkof: got a patch for Environment to look up environments.yaml in a given dir [12:35:19] it is totally bad though [12:38:21] zeljkof: my idea is when creating the Environment, to be able to pass the base directory from which to look up for environments.yml , which is then also passed to cucumber for it look up features [12:38:33] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:40:34] hashar: will take a look [12:40:44] just to finish something MMV related [12:44:34] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [12:52:12] (03PS1) 10Paladox: [BaseHooks] Add jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/280439 [12:56:47] Project beta-scap-eqiad build #95985: 04FAILURE in 2 min 1 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/95985/ [13:05:56] Yippee, build fixed! [13:05:57] Project beta-scap-eqiad build #95986: 09FIXED in 1 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/95986/ [13:09:32] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [13:17:22] hashar: Hi is it possible for ci-jessie-wikimedia-* to have three nodes like integration-slave-trusty-* have. [13:17:25] please [13:17:33] PROBLEM - Host cache-rsync is DOWN: CRITICAL - Host Unreachable (10.68.23.165) [13:17:35] paladox: what do you mean? [13:17:51] there is a pool of ten of thems [13:18:11] hashar: Well they have 1 for ci-jessie but three for trusty. https://integration.wikimedia.org/ci/ [13:20:06] (03PS1) 10Paladox: [HostStats] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280441 [13:20:33] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [13:20:37] paladox: there are ten ci-jessie-* instances right now [13:21:24] hashar: Oh, i mean 3 per instances like trusty have. Or is it because ci-jessie has pool which is different to how we do it for trusty. [13:25:25] hashar: There might be a new windows 10 build today since it is microsoft build in an hour. :) [13:25:33] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [13:26:32] it seems that we are using HHVM 3.12 on production, but 3.6 in beta cluster. What should I do to upgrade it? [13:28:19] gehel: apt-get dist-upgrade [13:28:19] The following packages have been kept back: [13:28:19] hhvm hhvm-dbg hhvm-luasandbox hhvm-tidy hhvm-wikidiff2 [13:29:51] Inst hhvm [3.6.5+dfsg1-1+wm8] (3.12.1+dfsg-1 Wikimedia:14.04/trusty-wikimedia [amd64]) [13:30:54] strange, actually, on deployment-mediawiki01 adn 02 we already have hhvm 3.12, but not on deployment-mediawiki03 [13:31:05] heh [13:31:08] I was just poking 3 [13:31:26] me too, dcausse is smarter than me and checked on 01 ... [13:31:30] Removed some old kernels [13:31:49] I'd be tempted to just dist-upgrade and reboot it then [13:32:00] Reedy: I'm tempted too... [13:32:14] Want me to do it then it's my fault when it breaks? :P [13:32:26] but I think I've reached my quota of broken things for today... [13:32:46] sigh, need to login again to wikitech [13:32:59] I'd be happy to let you do it. You can always say I'm the one who asked you ... [13:33:49] paladox: I dont understand what you mean sorry. It is non sense [13:34:15] paladox: there is a pool of 10 instances ci-jessie-wikimedia-* which are deleted whenever a build completes [13:34:40] paladox: and the pool is replenished by Nodepool (which spawn instances on demand to meet the target of 10 available instances) [13:35:08] paladox: and I dont see how Microsoft Windows 10 build relate to that.. [13:35:09] !log deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot [13:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:35:18] Reedy: thanks! [13:35:27] hashar: Oh, if you have a look at https://integration.wikimedia.org/ci/ and look at ci-jessie-wikimedia-* it shows 1. idle and then look at integration-slave-trusty-* which has 1. idle 2. idle 3. idle [13:36:00] hashar: Yes but would it be possible to one 3 per pool. [13:39:26] gehel: There's a few other boxes on 3.6.5 too [13:39:41] root@deployment-salt:/home/reedy # salt '*' cmd.run 'hhvm --version' | grep HipHop [13:39:41] HipHop VM 3.12.1 (rel) [13:39:41] HipHop VM 3.6.5 (rel) [13:39:41] HipHop VM 3.6.5 (rel) [13:39:41] HipHop VM 3.6.5 (rel) [13:39:42] HipHop VM 3.12.1 (rel) [13:39:44] HipHop VM 3.12.1 (rel) [13:39:46] HipHop VM 3.6.5 (rel) [13:39:48] HipHop VM 3.12.1 (rel) [13:41:08] Reedy: hhvm is still 3.6 on teribum (prod) so maybe good to keep some nodes with 3.6 on beta ? [13:41:21] tmh01 [13:41:46] root@deployment-salt:/home/reedy # salt '*' cmd.run 'hhvm --version' | grep "HipHop VM 3\.6" -B 1 [13:41:46] deployment-tin.deployment-prep.eqiad.wmflabs: [13:41:46] HipHop VM 3.6.5 (rel) [13:41:46] -- [13:41:46] deployment-tmh01.deployment-prep.eqiad.wmflabs: [13:41:47] HipHop VM 3.6.5 (rel) [13:41:49] -- [13:41:51] mira.deployment-prep.eqiad.wmflabs: [13:41:54] HipHop VM 3.6.5 (rel) [13:41:56] -- [13:41:58] deployment-puppetmaster.deployment-prep.eqiad.wmflabs: [13:42:00] HipHop VM 3.6.5 (rel) [13:42:15] looks like Debian unattended upgrade is no more enabled [13:42:20] it is supposed to magically upgrade package for us [13:42:26] hashar: They're dist-upgrade upgrades [13:42:41] paladox: I finally understood. You were asking for the ci-jessie-wikimedia nodes to have 3 executors instead of just 1 [13:42:42] The unattended upgrade just meant a fucktonne of kernels were installed [13:43:06] * Reedy checks if tmh can be upgraded [13:43:07] paladox: the answer is no. The ci-jessie-wikimedia are put offline and deleted as soon as a build is complete so it makes no sense to have more than 1 executor on them [13:43:31] Yeah, I'll do that too then [13:43:36] hashar: Oh, but if there are three woulden that increase the reasources for them. [13:43:38] paladox: we also want the build to start on a fresh / empty environment. So if you had two executors, the second build might have an environment corrupted by the first build [13:43:52] hashar: Oh ok [13:44:07] Reedy: kernels are auto upgraded via a specific apt.conf / different from unattended upgrade [13:44:16] hashar: ah. still stupid :) [13:45:07] including the puppet class apt::unattendedupgrades would enable it [13:45:15] not sure why it is not (or no more) the case [13:45:24] modules/base/manifests/labs.pp: include apt::unattendedupgrades, [13:46:23] dcausse: Only obvious reason I could see terbium not being upgraded is due to some issue with hhvm running the cronjobs [13:46:45] Or ops haven't got round to it [13:46:53] Reedy: I have no idea :/ [13:47:10] hashar: Question: As php is installed on nodepool it dosen't matter currently that it runs php 5.6 that only matters when we are converting our other tests that need to use php 5.5 like qunit tests and unit tests. But would we be able to enable nodepool to be able to find the php dir and use it. Since we could install composer manually per test by using composer.phar for now. [13:47:12] * Reedy asks opsen [13:48:17] !log deployment-prep upgrade hhvm on deployment-mediawiki01 and reboot [13:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:48:20] ffs [13:48:30] !log deployment-prep Make that deployment-tmh01 [13:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:51:03] hashar: How would we make the php dir visable. We can install php5.5 sometime later. [13:52:14] 10Deployment-Systems, 3Scap3, 10scap, 13Patch-For-Review: Update Debian Package for Scap3 v3.1.0 - https://phabricator.wikimedia.org/T130902#2160579 (10mobrovac) p:5Triage>3High [13:52:47] Reedy: thanks for those upgrades! [13:52:55] (03PS1) 10Paladox: [Hovergallery] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280444 [13:53:21] mira, deployment-puppetmaster and deployment-tin to be done [13:53:31] Why has the puppetmaster got hhvm? [13:53:33] lol [13:54:03] Reedy: someone is probably trying to rewrite puppet in php instead of ruby ... [13:54:24] phppet [13:57:10] PROBLEM - Host integration-dev is DOWN: CRITICAL - Host Unreachable (10.68.17.81) [14:01:29] dcausse: gehel: Terbium wasn't upgraded due to long running scripts still running [14:01:51] So just needs a heads up, and they'll upgrade it [14:01:54] Reedy: thanks for all the others... [14:02:26] Reedy: oh ok, thanks [14:04:57] So we can upgrade it if we want [14:04:57] (on labs) [14:04:57] Reedy: (unrelated), I think we're ok to bump elastica version to 2.3.1, maybe you can rebase your patch and we'll merge it next week after the branch cut (so we'll have 1 week to test) [14:04:57] I rebased it yesterday [14:04:57] oh thanks [14:04:57] :) [14:04:57] We've only just branched ;) [14:05:10] It'll probably be James_F|Away that'll cause me to have to rebase again [14:05:22] :) [14:05:37] https://gerrit.wikimedia.org/r/#/c/260159/ [14:08:41] Reedy: I need to +2 this one but also this one: https://gerrit.wikimedia.org/r/#/c/279998/ right? [14:08:56] Yup, exactly [14:09:02] ok [14:33:38] hey, I'm so stuck, I'm adding this patch to the puppetmaster of beta and I run it in one of the targets using "sudo puppet agent --test --verbose" (I opened permission of files so I can test with my account, running the scap under "deploy-service" user directly is not easy) [14:33:38] it fails with this [14:33:50] Error: Execution of '/usr/bin/deploy-local --repo ores/deploy -D log_json:False' returned 70: http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git [14:34:03] then everything fails as failed dependency [14:34:19] but "/usr/bin/deploy-local --repo ores/deploy -D log_json:False" works fine for me! [14:37:15] Computers suck [14:38:05] oh I forgot the patch: https://gerrit.wikimedia.org/r/#/c/280403/ [14:40:51] (03PS11) 10Hashar: Provide Rake task to serve as a CI entrypoint [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [14:42:07] (03CR) 10jenkins-bot: [V: 04-1] Provide Rake task to serve as a CI entrypoint [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [14:43:41] (03CR) 10Hashar: "Points from PS7 got addressed" (033 comments) [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [14:44:51] (03CR) 10Hashar: "PS11 make the rake task to no more depends on relative paths and normalize them based on Rake.original_dir . So one can run 'bundle exec r" [selenium] - 10https://gerrit.wikimedia.org/r/275820 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [14:47:45] :( [15:00:58] 10Beta-Cluster-Infrastructure, 6Operations, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#2160841 (10Nikerabbit) [15:06:08] hashar: It seems to be slow at https://integration.wikimedia.org/zuul/ [15:07:15] Project selenium-MultimediaViewer » firefox,mediawiki,Linux,,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 2 min 47 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=mediawiki,PLATFORM=Linux,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:07:35] Project selenium-MultimediaViewer » internet_explorer,beta,Windows 7,9.0,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 3 min 7 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=internet_explorer,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%207,VERSION=9.0,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:07:45] Project selenium-MultimediaViewer » internet_explorer,beta,Windows 7,11.0,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 3 min 17 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=internet_explorer,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%207,VERSION=11.0,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:07:46] Project selenium-MultimediaViewer » internet_explorer,beta,Windows 8.1,11.0,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 3 min 17 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=internet_explorer,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%208.1,VERSION=11.0,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:07:46] Project selenium-MultimediaViewer » internet_explorer,beta,Windows 8,10.0,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 3 min 17 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=internet_explorer,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%208,VERSION=10.0,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:09:18] zeljkof: can you turn off irc notifications for that job until it's stable/your done testing? [15:09:25] you're* [15:09:25] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:44] Project selenium-MultimediaViewer » firefox,beta,Linux,,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 5 min 16 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:10:26] greg-g: spoil sport [15:11:04] greg-g: argh, will do [15:11:14] :) [15:11:25] PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:26] I did that already, but had to recreate job and then forgot to manually turn of irc [15:11:32] I jsut remember yesterday(?) where it was spamming in an uholy way :) [15:11:37] zeljkof: ah, gotcha [15:11:39] there was a big storm I think yesterday [15:12:38] it is getting better, MMV job used to create 100 or so child jobs, now just 8 :) [15:13:08] soooo much better :) [15:13:14] that'll be better anyways then [15:14:16] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 40223 bytes in 1.054 second response time [15:17:32] 10Browser-Tests-Infrastructure, 13Patch-For-Review: Simplify creating of Jenkins jobs for running browser tests daily - https://phabricator.wikimedia.org/T128190#2160948 (10zeljkofilipin) [15:21:02] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2160960 (10mobrovac) [15:25:03] PROBLEM - Puppet run on deployment-stream is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:26:08] Project beta-scap-eqiad build #96000: 04FAILURE in 1 min 22 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96000/ [15:26:12] Project selenium-MultimediaViewer » chrome,beta,OS X 10.9,,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:26:29] PROBLEM - Puppet run on deployment-kafka02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:35:34] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [15:35:56] Yippee, build fixed! [15:35:57] Project beta-scap-eqiad build #96001: 09FIXED in 1 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96001/ [15:37:36] !log Gerrit has trouble sending emails T131189 [15:37:37] T131189: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189 [15:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:48:10] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2160995 (10mmodell) 5Open>3Resolved a:3mmodell [15:51:27] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,,contintLabsSlave && UbuntuTrusty build #1: 04FAILURE in 46 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,VERSION=,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [15:51:31] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2160998 (10mmodell) a:5mmodell>3thcipriani [15:55:26] 10Continuous-Integration-Infrastructure, 7Regression: Jobs sometimes fail with "/usr/local/bin/npm: No such file or directory" - https://phabricator.wikimedia.org/T129617#2161025 (10hashar) p:5Unbreak!>3High Lowering priority. Having an outdated npm version is currently not a concern for the repositories... [16:00:32] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [16:04:31] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2161044 (10mobrovac) 5Resolved>3Open Reopening as the provider bit is still missing. [16:05:03] RECOVERY - Puppet run on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:33] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [16:11:29] RECOVERY - Puppet run on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:32] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [16:19:18] 3Scap3, 6Labs, 10Tool-Labs, 13Patch-For-Review: Setup a proper deployment strategy for Kubernetes - https://phabricator.wikimedia.org/T129311#2101987 (10mmodell) p:5Triage>3High [16:19:57] 3releng-201617-q4, 10Scap3 (Scap3-MediaWiki-MVP): Use scap3's canary deploys for MediaWiki - https://phabricator.wikimedia.org/T131120#2161062 (10mmodell) [16:20:05] 3releng-201617-q4, 10Scap3 (Scap3-MediaWiki-MVP): Use scap3's canary deploys for MediaWiki - https://phabricator.wikimedia.org/T131120#2156589 (10mmodell) p:5Triage>3Normal [16:20:26] 3Scap3: Support multiple service restart by supporting one service_name per service group - https://phabricator.wikimedia.org/T130361#2133742 (10mmodell) p:5Triage>3Normal [16:20:35] 3Scap3: Scap3 should support virtualenv for deployment of python packages - https://phabricator.wikimedia.org/T130205#2161082 (10mmodell) p:5Triage>3Normal [16:20:55] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2161084 (10mmodell) p:5Triage>3Normal [16:21:08] 10Deployment-Systems, 3Scap3: Scap3 Service Restart Permissions - https://phabricator.wikimedia.org/T129897#2161086 (10mmodell) p:5Triage>3Normal [16:21:31] 10Deployment-Systems, 3Scap3, 10scap: Scap3 checks.yaml should be environment specific - https://phabricator.wikimedia.org/T130558#2161089 (10mmodell) p:5Triage>3Normal [16:24:43] Project beta-code-update-eqiad build #98091: 04FAILURE in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/98091/ [16:32:31] paladox: off again. Will process your npm/composer related changes in bulk tomorrow morning ;-} [16:32:51] hashar: Ok thanks. [16:33:11] hashar: parsoid patch that updates npm to 4.3 and removes npm 0.8 [16:33:27] paladox: they still want 0.8 / 0.10 iirc [16:33:51] hashar: I irc them and they said they are using npm 4.3 localy so we can drop npm 0.8 [16:33:56] They doint need it any more [16:34:43] Yippee, build fixed! [16:34:44] Project beta-code-update-eqiad build #98092: 09FIXED in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/98092/ [16:35:05] hashar: https://gerrit.wikimedia.org/r/#/c/279529/ comment 2. [16:36:07] paladox: great ;-) [16:36:11] will be for tomorrow! I am off [16:36:13] ;- [16:36:18] hashar: Ok :) [17:19:14] how do I add a group on deployment-prep? ldap...something? [17:21:12] puppet [17:21:44] modules/admin/files/data.yaml? [17:22:13] no [17:22:19] deployment-prep, mobrovac [17:22:22] i.e. not production [17:22:24] so no admin module [17:22:44] ah right [17:23:06] doesn't jenkins-deploy have some setup like this? maybe check how that works [17:29:02] Amir1: deployment-ores-web sshd[11438]: Invalid user service-deploy from 10.68.17.240 so there are a couple problems. [17:30:17] thcipriani: how we can fix that? [17:30:31] deploy inside the target fails too [17:30:43] I'm trying to get that worked out [17:31:41] if you use service-deploy as your ssh user, you need to: add the service-deploy user to the target, make sure that the public key is in that user's authorized key file [17:32:59] if you want to use keyholder, you'll need to add the private half of the public key used for authorized_key's for the service-deploy user to keyholder [17:33:12] thcipriani: https://gerrit.wikimedia.org/r/#/c/280403/ [17:33:18] and make sure you're in a group that has access to use that keyholder key [17:33:19] do you mean something like this? [17:33:29] I added it to the beta's puppetmaster [17:34:43] * thcipriani catches up [17:37:38] thcipriani: when I log in to the target, do puppet agent, it fails [17:37:53] yeah, I was just looking at that. [17:38:17] saying scap3 failed with status 70 "'/usr/bin/deploy-local --repo ores/deploy -D log_json:False'" [17:38:29] when I do the command under my username it works [17:39:10] but now I am able to run it under deploy-service user name and it gives me the error I'm looking for [17:42:28] (03PS1) 10Dduvall: Support browser version as part of `BROWSER` [selenium] - 10https://gerrit.wikimedia.org/r/280470 (https://phabricator.wikimedia.org/T128190) [17:43:33] (03CR) 10Dduvall: [C: 04-2] "The rake task tests are currently failing but let's wait until I627d0603487ab88e375fe5aa4fca2f8bb2a07790 is merged before attempting to re" [selenium] - 10https://gerrit.wikimedia.org/r/280470 (https://phabricator.wikimedia.org/T128190) (owner: 10Dduvall) [17:43:42] (03CR) 10jenkins-bot: [V: 04-1] Support browser version as part of `BROWSER` [selenium] - 10https://gerrit.wikimedia.org/r/280470 (https://phabricator.wikimedia.org/T128190) (owner: 10Dduvall) [17:44:34] (03CR) 1020after4: "is there a task for this? If you need help with deployment I will do what I can to help out." [integration/config] - 10https://gerrit.wikimedia.org/r/277563 (owner: 10Awight) [17:45:10] Amir1: seems that there may be submodules that ar ein a weird state on deployment-tin [17:45:26] exactly [17:45:31] I think I fixed it now [17:45:35] sweet. [17:45:44] testing [17:46:12] (might have to use --force for deploy-local) [17:46:32] deleted the revs folder [17:46:34] :D [17:46:35] easier [17:46:39] or that [17:46:46] works [17:46:52] running the puppet agent [17:47:10] failed [17:47:37] boo! [17:47:50] deploy-local failed in the puppet run? [17:48:34] https://www.irccloud.com/pastebin/7J7z9pWq/ [17:48:38] yup ^ [17:49:20] running with verbose makes too many lines that even I can't scroll to find. (I proabbly use > log.log" [17:49:29] if you need that [17:51:17] Command 'ln -sfT 'deploy-cache/revs/a8f4d6f27cd8bc21893e96e33d7fc0b1d0b01b7c' 'deploy'' returned non-zero exit status 1 [17:51:42] so it can't link the final location :\ [17:52:52] it was a bug [17:53:04] I deleted that part several hours ago [17:53:19] deleted the revs folder and re-ran [17:54:00] It was last line of checks [17:55:34] 10Continuous-Integration-Config, 6Release-Engineering-Team, 10MediaWiki-extensions-DonationInterface: jjb: run composer install in DonationInterface - https://phabricator.wikimedia.org/T131264#2161549 (10mmodell) [17:55:48] (03PS4) 1020after4: Run composer during wacky Fundraising back-compat [integration/config] - 10https://gerrit.wikimedia.org/r/277563 (https://phabricator.wikimedia.org/T131264) (owner: 10Awight) [17:55:54] Amir1: give it a try now [17:56:16] puppet I mean [17:56:22] sure [17:57:25] can I ask what you did? I want to learn [17:57:25] I think this may depend on: https://gerrit.wikimedia.org/r/#/c/279415/ [17:57:26] so /srv/ores was owned by www-data [17:57:46] since scap is configured to use service-deploy, it was trying to link: /srv/ores/deploy-cache/revs/[sha1] to /srv/ores/deploy [17:57:52] as service-deploy [17:58:03] and since /srv/ores was owned by www-data, that failed. [17:58:35] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [17:58:47] the way I came to find that out is: I did switched to the service-deploy user [17:58:57] (03PS5) 10Greg Grossmeier: Run composer during wacky Fundraising back-compat [integration/config] - 10https://gerrit.wikimedia.org/r/277563 (https://phabricator.wikimedia.org/T131264) (owner: 10Awight) [17:58:59] I already changed ownership of /srv/ores/deploy to service-deploy [17:59:05] in the puppet [17:59:29] failed :( [18:00:17] heh, /srv/ores is owned by www-data again :P [18:00:25] no, I didn't [18:00:29] RECOVERY - Host cache-rsync is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [18:00:34] puppet must be doing it...not sure why [18:00:43] (03CR) 1020after4: "JanZerebecki: this was brought to my attention on today's scrum of scrums. I created a task so that further work can be tracked and to be " [integration/config] - 10https://gerrit.wikimedia.org/r/277563 (https://phabricator.wikimedia.org/T131264) (owner: 10Awight) [18:00:46] https://gerrit.wikimedia.org/r/#/c/280403/10/modules/ores/manifests/base.pp [18:01:04] It's mentioned in puppet [18:01:10] ores pupet [18:01:23] thcipriani: fixing it is easy [18:01:43] but it's something that breaks other instance (if we merge the patch) [18:02:09] hmmm [18:02:36] we are not merging yet [18:02:36] so whatever [18:02:36] :D [18:05:32] PROBLEM - Host cache-rsync is DOWN: CRITICAL - Host Unreachable (10.68.23.165) [18:05:59] so the deploy_user has to be able to manipulate files in under the directory into which it deploys (i.e., so it can make symlinks like /srv/ores/deploy) [18:06:15] not sure the most equitable way to square that is... [18:10:49] thcipriani: it failed even though /srv/ores is owned by deploy-service [18:11:40] (I made changes in puppet and added it to the puppetmaster) [18:11:57] yup, failing differently now :) [18:12:00] Check 'setup_virtualenv' failed: bash: /srv/ores/ores-wikimedia-config/scap/cmd.sh: No such file or directory [18:12:14] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2161648 (10mmodell) So the provider needs to call `deploy-local --init` after cloning the repository. That seems straightforward enough but for some reason I'm afraid that I'm missing something. [18:12:48] also: Check 'restart_worker' failed: Failed to restart celery-ores-worker.service: Unit celery-ores-worker.service failed to load: No such file or directory. [18:15:34] I'm getting this just doing: su deploy-service; /usr/bin/deploy-local -f --repo ores/deploy -D log_json:False by the way [18:15:34] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2161666 (10mmodell) doh! totally had that backwards. We need the provider to clone the repo on deployment hosts and then run `deploy --init` on the master.. [18:15:57] thcipriani: fixed now :) [18:16:08] Amir1: \o/ [18:19:05] 10Deployment-Systems, 3Scap3: Create `deploy-init` command for scap3 - https://phabricator.wikimedia.org/T129906#2161697 (10mmodell) So the obvious question without an obvious answer is this: How do we coordinate the master and target? Right now we don't use the provider directly on the master. We now need t... [18:20:58] 6Release-Engineering-Team, 6Developer-Relations, 6Team-Practices, 15User-greg: Set up Code Review office hours - https://phabricator.wikimedia.org/T128371#2161703 (10ksmith) I would click on a poll option along the lines of "I'm in, as an observer/facilitator/advisor", if there were one. [18:22:29] why it tries to run worker checks, I defined them not to be in flower (and web) group [18:29:17] thcipriani: the puppet still fails but my guess that's okay since the checks tries to run checks in another group as well and it expectedly fails [18:29:18] (03CR) 10Legoktm: [C: 032] "..." [tools/code-utils] - 10https://gerrit.wikimedia.org/r/190825 (owner: 10Daniel Kinzler) [18:29:59] back to the real issue, how to connect from tin to web nodes [18:30:49] Amir1: indeed. One good way to test keyholder is to use: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l [user] [host] [18:31:08] PROBLEM - Host integration-dev is DOWN: CRITICAL - Host Unreachable (10.68.17.81) [18:31:13] you can also see what keys are available in keyholder that way [18:32:24] ladsgroup@deployment-tin:/srv/deployment/ores/deploy$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-ores-web [18:32:24] Permission denied (publickey). [18:34:53] hmmm /var/log/auth.log on the target shows Failed publickey for deploy-service from 10.68.17.240 port 37670 ssh2: RSA e6:d0:61:5e:e5:c7:5d:2d:3e:8e:c8:a5:eb:f3:c2:63 [18:36:00] ladsgroup@deployment-tin:/srv/deployment/ores/deploy$ sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh-add -l [18:36:00] 4096 /etc/keyholder.d/servicedeploy_rsa (RSA) [18:36:46] looks like they key fingerprint on the ores box doesn't match [18:37:11] ssh-keygen -l -f /etc/ssh/userkeys/deploy-service [18:37:18] 2048 6d:54:92:8b:39:10:f5:9b:84:40:36:ef:3c:9a:6d:d8 deploy-service (RSA) [18:37:57] is it in target or tin? [18:38:14] that's the public key on the target [18:39:34] 4096 e6:d0:61:5e:e5:c7:5d:2d:3e:8e:c8:a5:eb:f3:c2:63 /etc/keyholder.d/servicedeploy_rsa (RSA) [18:39:42] yeah, they don't match [18:39:50] what should we do? [18:41:12] gotta figure out where both the keys are coming from, get a matching key set in keyholder and on the target [18:43:05] hmm [18:43:08] https://gerrit.wikimedia.org/r/#/c/279198/8 is my attempt to clean up the key management so that it's not spread all over in a bunch of different places within operations/puppet [18:43:24] hopefully that can merge soon [18:43:41] I think we can simply copy paste public key from puppet master key holder into the file in target [18:44:07] Amir1: the public key for service deploy is in private/ssh/tin/servicedeploy_rsa.pub on deployment-puppetmaster looks like [18:44:18] yes [18:44:22] that key [18:44:36] we just copy that into the target [18:44:43] let me give it a try [18:45:07] so should just be: puppet:///private/ssh/tin/servicedeploy_rsa.pub [18:46:20] that's for prod I guess [18:46:45] Project beta-scap-eqiad build #96018: 04FAILURE in 11 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96018/ [18:46:50] and I have no access to that [18:48:36] (03PS1) 10Paladox: [HSTS] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280485 [18:48:48] ^ that beta-scap-eqiad failure is a full disk during sync [18:49:00] * twentyafterfour isn't sure which node is full. investigating [18:49:06] Amir1: you should have sudo access on deployment-prep [18:49:21] yup [18:49:27] and I just copied that key [18:49:29] testing [18:50:25] Agent admitted failure to sign using the key. [18:50:33] fingerprints are the same [18:50:47] I think need to delete that fingerprint from known hosts [18:50:55] Project beta-scap-eqiad build #96019: 04STILL FAILING in 2 min 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96019/ [18:51:35] Host key verification failed. [18:51:45] I think I need to do the yes [18:52:29] Amir1: are you in the group associated with the key? keyholder won't sign unless you are in the configured group [18:52:41] Amir1: so the "Agent admitted failure to sign using the key" message is what you get if you aren't in a group that is allowed to use that key according to keyholder. [18:52:46] yeah, what twentyafterfour said :) [18:53:34] I'm not [18:53:45] should I add myself? [18:54:03] yep [18:55:52] thcipriani: what is the name of group? [18:56:43] it's defined in /etc/keyholder-auth.d/deploy-service.yml looks like, "deploy-service" [18:56:51] confusingly enough :) [18:57:13] Project beta-scap-eqiad build #96020: 04STILL FAILING in 2 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96020/ [18:57:19] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Generate code coverage reports for extensions - https://phabricator.wikimedia.org/T71685#2161873 (10Jdlrobson) [18:58:30] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Generate code coverage reports for extensions - https://phabricator.wikimedia.org/T71685#738524 (10Jdlrobson) [19:04:07] stupid VMs with tiny root partitions [19:04:09] thcipriani: I'm in the group now, deploy stil fails [19:04:22] fatal: Access denied for user deploy-service by PAM account configuration [preauth] [19:04:31] in /var/log/auth.log [19:04:47] Amir1: hmmm, you may need to add beta::deployaccess to the target [19:04:58] (this is a beta-only thing) [19:05:11] ok [19:05:14] on it [19:07:02] Project beta-scap-eqiad build #96021: 04STILL FAILING in 2 min 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96021/ [19:07:16] added [19:07:23] rebooting them [19:08:28] why is nutcracker logging like mad on beta mediawiki servers? [19:09:24] (03PS1) 10Paladox: [HTMLets] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280492 [19:10:04] Project beta-scap-eqiad build #96022: 04STILL FAILING in 2 min 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96022/ [19:12:45] fails with this: [19:12:52] Mar 30 19:11:54 deployment-ores-web sshd[2051]: error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1 [19:12:52] Mar 30 19:11:54 deployment-ores-web sshd[2051]: Failed publickey for deploy-service from 10.68.17.240 port 38577 ssh2: RSA 68:e2:51:d4:33:fc:46:3f:9e:a4:25:51:eb:b4:f8:84 [19:12:52] Mar 30 19:11:54 deployment-ores-web sshd[2051]: Connection closed by 10.68.17.240 [preauth] [19:14:14] oh, fingerprint changed again [19:15:35] fixed [19:15:45] the old issue again [19:18:51] Yippee, build fixed! [19:18:52] Project beta-scap-eqiad build #96023: 09FIXED in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/96023/ [19:25:00] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:27:36] thcipriani: any ideas? [19:27:56] Amir1: sorry, deploying, got distracted [19:28:07] oh, okay [19:28:17] tell me when you have some time [19:28:35] thank you :) [19:28:45] looks like fingerprints are still different between keyholder and /etc/ssh/userkeys/deploy-service [19:29:09] I fixed that [19:29:13] still the sam [19:30:48] https://www.irccloud.com/pastebin/56CwWirH/ [19:30:48] ^ [19:30:48] that's crazy every time we want to reboot the key changes [19:30:48] (03PS1) 10Paladox: [HTMLTags] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280496 [19:31:19] !log deleted some nutcracker and hhvm log files on deployment-mediawiki01 to free space [19:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:37:39] Amir1: ssh works now [19:37:39] public key was different on the target than what was in keyholder [19:37:39] (I updated it manually just now to test if that was it) [19:38:15] very very strange [19:39:13] I copied that from /var/lib/git/labs/private/files/ssh/tin/servicedeploy_rsa.pub [19:39:18] in puppet master [19:39:31] I need to copy that to another server [19:46:12] even though ssh works, scap doesn't [19:46:28] let me check keyholder [19:48:35] what is scap doing? [19:50:39] 19:50:32 ['/usr/bin/deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'flower', 'fetch'] on deployment-ores-web.deployment-prep.eqiad.wmflabs returned [255]: Host key verification failed. [19:51:25] PROBLEM - Host Generic Beta Cluster is DOWN: check_ping: Invalid hostname/address - en.wikipedia.beta.wmflabs.org [19:51:30] thcipriani: ^ [19:52:16] Amir1: give it another try [19:52:36] working [19:52:39] wow [19:52:41] why? [19:52:41] I just accepted the hostkey as your user [19:53:14] we need to do it in deployment-ores-worker [19:53:20] Amir1: https://phabricator.wikimedia.org/P2835 [19:53:20] sorry [19:53:54] awesome [19:54:03] let me duplivate the work in another target [19:54:08] *duplicate [20:03:42] (03PS1) 10Paladox: [IframePage] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280504 [20:04:19] http://en.wikipedia.beta.wmflabs.org/ doesn't resolve anymore .. what is the domain name for it? [20:08:18] subbu: http://en.wikipedia.beta.wmflabs.org/ works for me. And it is a beta version of wikipedia to test things before they go out. [20:08:46] maybe have been a temporary outage then. it resolves for me as well now. [20:09:29] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [20:33:34] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [20:34:43] anybody wants to review https://gerrit.wikimedia.org/r/#/c/280494/ ? search team is mostly in israel so someone needs to review this for me or production will remain broken [20:36:54] 6Release-Engineering-Team, 6Developer-Relations, 6Team-Practices, 15User-greg: Set up Code Review office hours - https://phabricator.wikimedia.org/T128371#2162563 (10Dereckson) Would have thought it were included in CR+1 role. [20:38:00] MaxSem: done [20:38:10] (didn't realize you needed me to review :)) [20:38:28] :) [20:39:34] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [20:40:22] heh, tyler beat lego by less than a minute [20:40:41] cherrypicking... [20:40:59] (03CR) 10Ejegg: "JanZerebecki:" [integration/config] - 10https://gerrit.wikimedia.org/r/277563 (https://phabricator.wikimedia.org/T131264) (owner: 10Awight) [20:41:09] legoktm: Hi do you know how we could fix https://phabricator.wikimedia.org/T131309 please. [20:48:54] thcipriani: and now everything fails [20:48:59] https://www.irccloud.com/pastebin/NXUvGns9/ [20:49:26] https://www.irccloud.com/pastebin/rUUJBHL4/ [20:51:23] Amir1: oh good :) [20:51:33] Krenair: looks like there https://gethttpsforfree.com/ is which creates the certificates with letsencrypt [20:51:49] why? :( [20:52:06] not sure just yet. [20:52:17] okay [20:52:18] paladox, we were in the private letsencrypt beta, we just never got around to using it. there is a private task blocked on response from Brandon Black in the operations team on this subject [20:52:43] Oh ok. [20:55:04] I have a couple of ideas to move it forward [20:55:54] Amir1: deploy fails because you're running `deploy` from /srv/deployment/ores and not from /srv/deployment/ores/deploy; ssh fails because the key on deployment-ores-web is incorrect [20:56:39] phew [20:56:42] okay [20:56:43] thanks [20:56:48] yw :) [20:56:54] I'm over working [20:56:57] making mistakes [20:57:00] sorry for that [20:57:43] it happens, no worries :) [20:57:50] thcipriani: where is public key of deploy-service [20:58:06] /var/lib/git/labs/private/files/ssh/tin/servicedeploy_rsa.pub [20:58:13] I copied this from puppetmaster [20:58:17] ^ yup on puppetmaster [21:00:46] so if you change your scap::target public_key_source to: puppet:///private/ssh/tin/servicedeploy_rsa.pub it _should_ work [21:02:19] Ok [21:04:25] thcipriani: it worked [21:04:29] one last thing [21:04:39] I have three different groups [21:04:52] one is worker, flower and web [21:05:12] flower and web have one target (and they are the same) [21:05:25] so my scap looks like this [21:05:49] https://github.com/wiki-ai/ores-wikimedia-config/blob/master/scap/scap.cfg [21:06:26] but it only worked on flower and worker group and didn't touch the default (web group) [21:06:31] https://www.irccloud.com/pastebin/BXPBftVz/ [21:08:56] this is something we built in, if you have two groups that contain the same host the first group wins. Also, if a group is empty (which is what would happen if the first group won the host), it doesn't run that group. [21:09:44] why are you running two deploys on the same host? Is this to use hooks? [21:10:59] the flower only runs at the first web node [21:11:10] other web node doesn't need flower [21:11:30] but in this case since we have only one web node, you guess what happens [21:12:01] thcipriani: is there a chance to run deployment of the default group explicitly? [21:12:15] I think it has an option in deploy [21:13:02] nope it doesn't [21:14:02] bd808, thcipriani: When either of you get a chance, https://gerrit.wikimedia.org/r/#/c/280480/ could use a +1 or -1 [21:17:50] okay, s [21:17:51] so [21:17:53] upload cache is broken [21:18:03] 15 FetchError c no backend connection [21:19:35] RECOVERY - SSH on deployment-ms-be02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:20:40] So... [21:20:47] Why is deployment-upload shutoff? [21:20:49] andrewbogott? [21:20:52] https://horizon.wikimedia.org/project/instances/eba7ec1f-8fcf-4ab3-a616-33f486cfb099/ [21:21:10] Stop March 23, 2016, 10:06 p.m. [21:21:14] Start March 23, 2016, 10:09 p.m. novaadmin Error [21:21:15] huh [21:21:18] useful log [21:21:44] this is fun [21:21:44] Success: Started Instance: deployment-upload [21:21:48] go to action log [21:22:03] no action listed [21:22:36] Hm, were there security reboots on the 23rd? [21:23:17] I think that one of the virt hosts went briefly oom during the reboots and shut some things down [21:23:22] it should be safe to just start it now [21:23:28] (unless you already tried that?) [21:24:04] Well my actions show in the log now [21:24:04] but [21:24:05] req-6f146145-4b20-4242-8b8e-ad97a76387c7 Start March 30, 2016, 9:21 p.m. krenair Error [21:24:05] req-c8f000cf-977e-4054-bf7d-b728333a5141 Start March 30, 2016, 9:21 p.m. krenair Error [21:24:29] Message is Error... After it displayed "Success: Started Instance: deployment-upload" to me in the UI [21:25:32] PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [22:12:03] 10Browser-Tests-Infrastructure, 10MobileFrontend: Net::ReadTimeout in MobileFrontend browser tests when visiting Watchlist page - https://phabricator.wikimedia.org/T129328#2163400 (10Jdlrobson) [22:12:19] 7Browser-Tests, 10MobileFrontend: `Generic special page features.Search from Watchlist` test failing - https://phabricator.wikimedia.org/T130971#2163407 (10Jdlrobson) [22:17:19] 10Beta-Cluster-Infrastructure, 6Labs: deployment-upload won't start, upload.beta.wmflabs.org down - https://phabricator.wikimedia.org/T131322#2163472 (10Krenair) [22:27:12] (03PS1) 10Paladox: [IfTemplates] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280594 [22:36:47] (03PS1) 10Paladox: [ImageLink] Add npm test and composer-test [integration/config] - 10https://gerrit.wikimedia.org/r/280597 [23:06:44] 10Continuous-Integration-Infrastructure: Investigate crashing mysql on integration slaves - https://phabricator.wikimedia.org/T130951#2163685 (10greg) [23:07:13] 10Continuous-Integration-Infrastructure: Investigate crashing mysql on integration slaves - https://phabricator.wikimedia.org/T130951#2151857 (10greg) 5Open>3Resolved a:3dduvall This was fixed later that day, right? @dduval I think fixed it. [23:44:18] 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM: CI: wikimedia/fundraising/crm/civicrm repo should automatically submit and merge after CR+2 V+2 - https://phabricator.wikimedia.org/T131330#2163845 (10awight)