[00:04:30] (03PS1) 10Legoktm: make-wmf-branch: Stop branching Gather (undeployed) [tools/release] - 10https://gerrit.wikimedia.org/r/289342 (https://phabricator.wikimedia.org/T128568) [00:04:42] ostriches: ^ [00:06:21] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2303541 (10Dzahn) [00:06:45] legoktm: go ahead and self merge. I'm on my phone right now [00:07:28] ok [00:07:40] (03CR) 10Legoktm: [C: 032] "Approved by ^d" [tools/release] - 10https://gerrit.wikimedia.org/r/289342 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [00:08:44] (03Merged) 10jenkins-bot: make-wmf-branch: Stop branching Gather (undeployed) [tools/release] - 10https://gerrit.wikimedia.org/r/289342 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [00:18:37] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2303604 (10Dzahn) This would add a new group for unprivilege... [00:22:42] bd808: during a full scap: cannot delete non-empty directory: php-1.27.0-wmf.19/cache/l10n [00:23:14] bd808: meanwhile on Tin, we only have php-1.27.0-wmf.20+ [00:27:21] Yippee, build fixed! [00:27:21] Project selenium-Flow » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #22: 09FIXED in 11 min: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/22/ [00:28:44] Project selenium-Flow » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #22: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/22/ [00:29:42] Dereckson: hrm. this is probably some cleanup gone awry. That directory is full of cdb files, which we don't rsync from tin, which is probably why rsync is freaking out about them. If it's not causing any breakage I wouldn't worry too much about it for now. [00:30:32] this may be left over files from the train today is my best guess. [00:31:15] twentyafterfour: ^ did you notice this as part of the train? [00:31:18] It's not causing breakage. [00:31:41] that's good. I *think* all that needs to happen is: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Remove_left_over_files_from_expired_branches [00:32:16] but I haven't seen that problem before. [01:19:22] 06Release-Engineering-Team: Prune /srv/mediawiki/php-1.27.0-wmf.19 - https://phabricator.wikimedia.org/T135580#2303731 (10Dereckson) [03:00:19] I did attempt to clean up some older branches during the train today. [03:00:41] and I _have_ seen that sort of problem before [03:01:04] in my experience the l10n cache directories are always problematic in that way [09:22:48] PROBLEM - Free space - all mounts on deployment-cache-upload04 is CRITICAL: CRITICAL: deployment-prep.deployment-cache-upload04.diskspace.root.byte_percentfree (<40.00%) [09:26:41] umm. no. Can anyone fix logstash-beta, I saved cxserver as default home. [09:27:15] mobrovac: we need to do git pull/git submodule update before scap deploy in tin, right? [09:27:34] yes kart_ [09:27:43] OK! [09:27:54] I'll update cxserver at my evening time. [09:28:21] Beta seems OK with new node_modules update. [10:04:26] 07Browser-Tests, 10MediaWiki-extensions-UniversalLanguageSelector, 10ULS-CompactLinks, 03Language-Q4-2016-Sprint 3: Cross browser test Compact language links - https://phabricator.wikimedia.org/T135586#2304321 (10Danny_B) [10:46:18] 05Gerrit-Migration, 10Differential: Find way to use Differential with plain git (i.e.: without requiring arc) - https://phabricator.wikimedia.org/T127#2304394 (10Qgil) From @faidon in wikitech-l: > If we have spare budget for the FY, a good start, I think, would be (properly) implementing https://secure.phabri... [10:56:04] 03Scap3, 10scap, 13Patch-For-Review: scap::target shouldn't allow users to redefine the user's key - https://phabricator.wikimedia.org/T132747#2304413 (10mobrovac) >>! In T132747#2303453, @thcipriani wrote: > @mobrovac the immediate problem of not being able to deploy on beta should be resolved temporarily u... [10:56:14] 10Beta-Cluster-Infrastructure, 03Scap3, 10Citoid, 06Services, 10VisualEditor: Can't deploy Citoid in Beta - https://phabricator.wikimedia.org/T132666#2304428 (10mobrovac) [10:56:16] 03Scap3, 10scap, 13Patch-For-Review: scap::target shouldn't allow users to redefine the user's key - https://phabricator.wikimedia.org/T132747#2304429 (10mobrovac) [10:56:44] 10Beta-Cluster-Infrastructure, 03Scap3, 10Citoid, 06Services, 10VisualEditor: Can't deploy Citoid in Beta - https://phabricator.wikimedia.org/T132666#2206203 (10mobrovac) 05Open>03Resolved a:03mobrovac [11:29:49] !log Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor [11:31:24] 03Scap3, 06Operations: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304520 (10mobrovac) [11:33:31] bah [11:36:27] qa-morebots: [11:36:27] I am a logbot running on tools-exec-1206. [11:36:27] Messages are logged to https://tools.wmflabs.org/sal/releng. [11:36:27] To log a message, type !log . [11:36:30] !log Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor [11:36:34] !log Restarted qa-morebots [11:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:59:31] 06Release-Engineering-Team: Prune /srv/mediawiki/php-1.27.0-wmf.19 - https://phabricator.wikimedia.org/T135580#2303731 (10hashar) Might be my fault. When doing the deployment train I am reluctant to delete files because I dont quite understand the impact or how to rollback/recreate them in case of mistake. [12:21:11] (03PS1) 10Zfilipin: WIP selenium-Wikidata Jenkins job [integration/config] - 10https://gerrit.wikimedia.org/r/289396 (https://phabricator.wikimedia.org/T128190) [12:22:06] * Luke081515 waves [12:25:47] (03CR) 10Zfilipin: "The job is tested here:" [integration/config] - 10https://gerrit.wikimedia.org/r/289396 (https://phabricator.wikimedia.org/T128190) (owner: 10Zfilipin) [12:46:08] !log (re)cherry-picking c/284078 to deployment-prep [12:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:57:10] 03Scap3, 06Operations: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304817 (10fgiunchedi) I can't reproduce the systemctl/service behaviour you mentioned on a different service, `restart` seems to do the right thing on a jessie labs instance even on a just... [12:57:26] 10Continuous-Integration-Infrastructure: Running composer-php55-trusty and composer-hhvm-trusty giving out warnnings - https://phabricator.wikimedia.org/T135338#2304818 (10hashar) I have mailed the package maintainer László Böszörményi (GCS) asking how php5-xhprof can be updated on Jessie stable. [13:03:30] 10Continuous-Integration-Infrastructure: Running composer-php55-trusty and composer-hhvm-trusty giving out warnnings - https://phabricator.wikimedia.org/T135338#2304833 (10hashar) And for Trusty: https://bugs.launchpad.net/ubuntu/+source/xhprof/+bug/1583162 [13:14:42] (03PS1) 10Hashar: [Scribunto] enable composer test [integration/config] - 10https://gerrit.wikimedia.org/r/289404 [13:15:06] (03CR) 10Hashar: [C: 032] [Scribunto] enable composer test [integration/config] - 10https://gerrit.wikimedia.org/r/289404 (owner: 10Hashar) [13:16:03] (03Merged) 10jenkins-bot: [Scribunto] enable composer test [integration/config] - 10https://gerrit.wikimedia.org/r/289404 (owner: 10Hashar) [13:16:43] !log deploying a05e830 to ores nodes (sca01 and ores-web) [13:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:24:55] and it errored [13:34:25] releng people, is this normal: [13:34:30] https://www.irccloud.com/pastebin/oU2cm0HG/ [13:38:28] 03Scap3, 06Operations: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304941 (10mobrovac) Indeed @fgiunchedi, using the commands directly on the target yields a different result: ``` root@deployment-mathoid:~# systemctl stop mathoid root@deployment-mathoid... [13:52:33] mobrovac: http://pastebin.com/A0wE0r8c - seen during cxserver/deploy patch submission [13:52:54] mobrovac: lots of unavoidable conflict Not de-duplicating [13:53:52] kart_: that's normal, no worries [13:55:48] mobrovac: okay. thanks. [13:58:50] 10Continuous-Integration-Infrastructure: Running composer-php55-trusty and composer-hhvm-trusty giving out warnnings - https://phabricator.wikimedia.org/T135338#2304973 (10Paladox) @hashar thanks. [14:04:42] paladox: I will try to figure out a fix for that php5-xhprof ini file [14:04:56] paladox: probably have a puppet recipe that provision a proper ini file. that should be straightforward [14:32:18] 03Scap3, 06Operations: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2305049 (10Joe) p:05Triage>03Normal [14:39:51] mobrovac: question: I did not update node_modules but https://gerrit.wikimedia.org/r/#/c/289421/ contains it. Why? [14:39:59] mobrovac: is it always updated? [14:40:43] brb. [14:44:05] hasharMetting: Ok thanks. [14:44:06] :) [14:50:38] 03Scap3, 06Operations: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2305071 (10Joe) So, I watched the auth log on reployment-mathoid while running scap and it seems that when mathoid is up before we run scap we do issue a restart: ``` May 18 14:47:22 deplo... [14:51:21] 03Scap3, 06Operations: Scap3 calls the checker script before restarting the service, not able to restart a service if it's down. - https://phabricator.wikimedia.org/T135609#2305072 (10Joe) [14:52:30] ostriches: Hi ive tested using what you said doing it manually and it worked. :). [14:53:05] Im thinking of how to do it easyer for phabricator. Since i did it about 10-20 folder per time. [14:53:11] So what about following https://review.typo3.org/plugins/replication/Documentation/config.html [14:53:28] for push = +refs/changes/*:refs/changes/* [14:53:46] replacing * with the folder number so it pushes each folder differently. [15:01:37] hashar: It seems https://integration.wikimedia.org/zuul/ has frozen unless someone forced merged. [15:01:44] Because jessie tests are not loading. [15:02:03] oh [15:02:12] it is out of instances [15:02:50] Oh [15:06:58] mobrovac: let me know when around. I should be back after dinner. [15:12:32] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10hashar) [15:13:50] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:14:33] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305194 (10hashar) `openstack server delete 6f07110f-4f2f-4f46-bddc-1ea30192ab02` worked fine though :) [15:19:40] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305257 (10hashar) I have stopped nodepool on labnodepool1001.eqiad.wmnet in case it is adding load to the OpenStack labs. To restart it: $ ssh l... [15:23:06] hashar: jzerebecki: Ive filled https://phabricator.wikimedia.org/T135635 [15:23:10] James_F ^^ [15:23:36] Breaks php 5.6 due to https://github.com/wikimedia/mediawiki-vendor/commit/4bd8ba35627e7d70036de02eeec5e52ba64b9250 [15:24:10] jzerebecki: Can we ask upstream to support some type of blocking for generating the autoload_static file if we doint want to, to keep compatibility with php 5.5. [15:27:55] jzerebecki: Ive opened an issue upstream https://github.com/composer/composer/issues/5354 about autoload_static [15:28:07] bd808 ^^ [15:29:07] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305341 (10hashar) The first failure was apparently at 14:35 UTC ``` 2016-05-18 14:35:17,112 INFO nodepool.NodePool: Need to launch 1 ci-jessie-wik... [15:31:16] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305351 (10hashar) Maybe that is the images that weren't correct I have deleted them ``` hashar@labnodepool1001:/var/log/nodepool$ nodepool image-li... [15:32:40] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:33:37] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2305375 (10Luke081515) [15:33:59] twentyafterfour: FYI, please look at that task, before deploying to group1 ^ [15:36:32] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305405 (10hashar) I have restarted Nodepool, it is supposed to spawn instances out of yesterday snapshots: ``` $ nodepool image-list +-----+-------... [15:39:03] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2265934 (10Paladox) Should we add https://phabricator.wikimedia.org/T135635 since php 5.6 is currently broken. [15:40:46] 07Browser-Tests, 10MobileFrontend, 10Reading-Web-Backlog: `Generic special page features.Search from Watchlist` test failing - https://phabricator.wikimedia.org/T130971#2152450 (10Jhernandez) @Jdlrobson followup comment [15:40:51] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:45:48] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2305469 (10Luke081515) [15:50:46] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305493 (10Luke081515) p:05Triage>03Unbreak! Blocks Zuul. [15:52:08] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2305498 (10Paladox) [16:02:58] hm, a few contint instances get create actually [16:03:31] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:05:00] Luke081515 due you mean nodepool [16:05:04] Or something else. [16:05:13] yep, that instances: [16:05:17] 05Gerrit-Migration, 10Differential: Find way to use Differential with plain git (i.e.: without requiring arc) - https://phabricator.wikimedia.org/T127#2305587 (10bd808) >>! In T127#2304394, @Qgil wrote: > From @faidon in wikitech-l: >> If we have spare budget for the FY, a good start, I think, would be (proper... [16:05:18] 18:04 <@rc-pmtpa> [[Nova Resource:Ci-jessie-wikimedia-106273.contintcloud.eqiad.wmflabs]] NB https://wikitech.wikimedia.org/w/index.php?oldid=536353&rcid=783947 * Labslogbot * (+626) Auto update of instance info. [16:05:27] they are used by nodepool [16:05:30] IIRC [16:05:35] Ok, hashar: Is aware. He recommends v+2 for now [16:05:54] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2305591 (10Tgr) [16:06:12] So if you know it wont break anything v+2. If your unsure we should wait until it is fixed. Or add a fail safe [16:06:23] So if nodepool fails fall back to ubuntu trusty instances. [16:06:43] hashar ^^ [16:06:52] back [16:07:03] Is it possible to add a fail safe. [16:07:08] please [16:07:08] hm, but tests like rake-jessie or composer-php55-trusty aren't running yet [16:07:16] Nope because there nodepool [16:08:08] 05Gerrit-Migration, 10Differential: Find way to use Differential with plain git (i.e.: without requiring arc) - https://phabricator.wikimedia.org/T127#2305611 (10Paladox) @bd808 yep but one problem. if you accept a diff it gets merged automatically. [16:12:53] contint just spawns trusty at the moment [16:12:58] 05Gerrit-Migration, 10Differential: Find way to use Differential with plain git (i.e.: without requiring arc) - https://phabricator.wikimedia.org/T127#2305628 (10valhallasw) >>! In T127#2305611, @Paladox wrote: > @bd808 yep but one problem. if you accept a diff it gets merged automatically. Why would the revi... [16:14:06] 05Gerrit-Migration, 10Differential: Find way to use Differential with plain git (i.e.: without requiring arc) - https://phabricator.wikimedia.org/T127#2305629 (10Paladox) Because you have to have permission to merge whereas arcyd would merge for you. [16:15:46] duh [16:15:48] Luke081515: It dsoent seem to spawn trusty. [16:15:58] revi... pings me lol [16:17:18] the first jessie instances had changes [16:17:45] but only things like https://wikitech.wikimedia.org/w/index.php?diff=536694&oldid=536680 [16:18:35] Oh [16:19:17] hashar: The instance spawn issue looks like a general issue, if I try to spawn instances at my project, they get at the "error" state every time [16:19:52] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:19:52] Luke081515: yup looks like some labs issue of some sort :( [16:22:07] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305657 (10hashar) Nodepool eventually restarted due to puppet. Horizon interface shows up instances are blocked on various tasks: in Spawning, Sc... [16:25:04] paladox: there is not much fallback possible [16:25:15] paladox: and the integration-slave* instances are going to be phased out entirely [16:32:20] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:33:41] puppet restarts nodepool automagically .. [16:33:52] seems the issue is with labs infra and andrew/chase are looking at it [16:34:20] going off for a few, it is dinner time and I hear kids agitation in the background [16:41:20] (03PS1) 10Hashar: zuul status: notice about ongoing outage [integration/docroot] - 10https://gerrit.wikimedia.org/r/289451 (https://phabricator.wikimedia.org/T135631) [16:42:09] (03CR) 10Hashar: [C: 032] zuul status: notice about ongoing outage [integration/docroot] - 10https://gerrit.wikimedia.org/r/289451 (https://phabricator.wikimedia.org/T135631) (owner: 10Hashar) [16:44:30] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:44:52] (03Merged) 10jenkins-bot: zuul status: notice about ongoing outage [integration/docroot] - 10https://gerrit.wikimedia.org/r/289451 (https://phabricator.wikimedia.org/T135631) (owner: 10Hashar) [16:46:30] hashar: Oh ok [16:46:51] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10Andrew) 2016-05-18 16:45:29.980 5375 ERROR nova.compute.manager [req-86afc675-0c57-44f4-a164-e1a8320c845b novaadmin... [16:47:11] I have added a lame notice message on the Zuul status page with https://gerrit.wikimedia.org/r/#/c/289451/ :D [16:47:44] hasharAway good idea [16:47:45] Thanks [17:05:52] 03Scap3, 06Operations: Scap3 calls the checker script before restarting the service, not able to restart a service if it's down. - https://phabricator.wikimedia.org/T135609#2305902 (10thcipriani) Oh! I know exactly what is happening. To implement `scap --service-restart` we added a `restart_service` stage (se... [17:12:25] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth, 13Patch-For-Review, and 2 others: "Nonce already used" regularly occurring on beta cluster - https://phabricator.wikimedia.org/T109173#2305914 (10jayvdb) As mentioned in T129763 , we are seeing this again now that the oauth lo... [17:13:30] hasharAway you can now stop jobs in zuul https://phabricator.wikimedia.org/T127#2305628 [17:13:34] wrong link [17:13:40] this one [17:13:40] https://github.com/openstack-infra/zuul/commit/19233fbff5ee1b252ab014d2ab908fb66d70751a [17:18:39] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2305942 (10Legoktm) [17:19:58] twentyafterfour: I added a new deployment blocker fyi ^ [17:22:42] legoktm: could you merge https://gerrit.wikimedia.org/r/#/c/289434/ please [17:22:51] v+2 too please due to ci issue. [17:24:02] legoktm: thanks for the heads-up [17:31:15] OK, I'm restarting nodepool, I think nova is sorted [17:31:55] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2306039 (10mmodell) [17:32:00] hasharAway ^^ [17:32:34] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [17:34:24] zuul is working now. [17:34:29] Luke081515 ^^ [17:35:17] thcipriani: hey, is this related to your recent announcement? https://www.irccloud.com/pastebin/oU2cm0HG/ [17:35:38] * thcipriani looks [17:35:40] (03PS1) 10Paladox: Revert "zuul status: notice about ongoing outage" [integration/docroot] - 10https://gerrit.wikimedia.org/r/289463 [17:36:11] (03CR) 10Paladox: "Zuul now works." [integration/docroot] - 10https://gerrit.wikimedia.org/r/289463 (owner: 10Paladox) [17:40:35] Amir1: in a meeting give me a minute. I think this error is different. [17:40:45] kk [17:40:51] thanks [17:42:13] twentyafterfour: I'm going to test it on mw1017 right now [17:43:35] twentyafterfour: actually, tgr already tested it, so cherry-pick is good to go [17:44:27] also, I'm hoping we're not going to deply while CI is still down :S [17:46:40] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306116 (10Andrew) 05Open>03Resolved a:03Andrew This is resolved now, and I don't know what went wrong :( [17:47:11] legoktm: thanks [17:47:52] legoktm: regarding deploy without CI - I don't know, do we have a policy about what to do in that situation? [17:48:49] legoktm: as far as I know, CI is not down. thcipriani it's back, right? [17:49:11] (03CR) 10Physikerwelt: [C: 031] Moritz is owner of selenium-Math Jenkins job [integration/config] - 10https://gerrit.wikimedia.org/r/289231 (https://phabricator.wikimedia.org/T134492) (owner: 10Zfilipin) [17:49:37] andrewbogott: hmm, it seems to be creating new boxes there is 1 job that seems to be plugging up the works right now... [17:52:41] (03PS1) 10Hashar: Revert "zuul status: notice about ongoing outage" [integration/docroot] - 10https://gerrit.wikimedia.org/r/289484 [17:52:47] (03CR) 10Hashar: [C: 032] Revert "zuul status: notice about ongoing outage" [integration/docroot] - 10https://gerrit.wikimedia.org/r/289484 (owner: 10Hashar) [17:53:07] andrewbogott: it looks back, but super backlogged [17:53:38] once this job finishes it should be back in business https://integration.wikimedia.org/ci/job/mediawiki-phpunit-php55-trusty/125/ [17:53:52] or, at least, that should clear a lot out [17:58:22] twentyafterfour: I don't think there's a policy per se, but deploying backports without CI is a bit concerning to me. Just bad timing for everything to break at the same time -.- [17:59:11] legoktm: there are only two trusty nodepool instances. It will be super slow [17:59:16] if any repo uses composer [17:59:48] why are there only two? [17:59:56] aaand it just dumped the entire queue because of an Echo failure. [18:00:16] legoktm: Im not sure but hashar has asked for more nodepool [18:00:20] To increase the size. [18:00:28] legoktm: I don't see any reason to be concerned about this particular one-line fix to unbreak group0. In general though CI is indeed important [18:00:40] But they should use more then two hashar said. [18:01:33] legoktm: See https://phabricator.wikimedia.org/T133911 [18:01:34] please [18:01:43] twentyafterfour: right, I think this was fine, but just in general. Plus if we stop deploys because CI is broken, maybe more people will start learning how to fix it ;) [18:02:11] paladox: aha, thanks [18:02:20] Your welcome. [18:02:23] legoktm: indeed :) [18:02:42] legoktm: But im not sure what happends when we have a ci issue. [18:02:45] legoktm: but in the case of nodepool it's kinda out of our hands currently [18:02:54] And jessie is backed up [18:03:06] because releng still do not have enough access to fix it [18:04:44] legoktm: I think hashar plans to reduce integration-trusty once we migrate the rest of mw tests to nodepool [18:06:56] Amir1: so the message you pasted looks like a logging error [18:07:43] hasharAway legoktm we should create something like travis ci to test muitple php versions at once. [18:07:48] Amir1: looking at: scap deploy-log for ores I see this line: 14:41:27 [deployment-sca01.deployment-prep.eqiad.wmflabs] Check 'setup_virtualenv' failed: Already using interpreter /usr/bin/python3 [18:07:53] thcipriani: yeah, I didn't quite realize why. I'm not touching anything related to logging [18:08:04] paladox: there's a bug for that somewhere [18:08:08] which, I think, is the root cause of the failure. [18:08:12] Since i caught php 5.6 errors because it broke mediawiki/vendor [18:08:31] hmm [18:08:37] let me check [18:08:42] thanks [18:08:53] legoktm: Could we create an instance with php 5.6 to test against mw core please [18:09:11] paladox: can you file a bug for that? [18:09:13] to prevent php 5.6 from being broken again. [18:09:17] legoktm: Ok [18:13:03] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling: Create some instances with php5.6 on them - https://phabricator.wikimedia.org/T135666#2306242 (10Paladox) [18:13:57] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling: Create some instances with php5.6 on them - https://phabricator.wikimedia.org/T135666#2306216 (10Paladox) @hashar can I add you to this task please. I'm not sure if your watching the project which means you already got the email. [18:14:07] legoktm: https://phabricator.wikimedia.org/T135666 [18:15:53] legoktm: Seems we could install php 5.3, 5.5, 5.6 then we can remove the services then add the services everytime we run a job [18:16:08] I managed to do it on trusty. [18:16:31] On the webserver it was running php 5.5 but i installed php 7 [18:17:24] But because services had php5 it started that [18:21:14] thcipriani: I checked, the underlying issue is that, targets don't clone submodules [18:21:24] so the submodules are empty [18:21:46] e.g. /srv/ores/deploy/submodules/wheels is empty in deployment-sca01 [18:22:14] I enabled submodules in scap.cfg by "git_submodules: True" [18:22:15] Amir1: hmm it should clone submodules from tin... [18:22:26] maybe that's changed [18:22:43] I retried with --force [18:22:52] git submodule update --init? [18:23:03] Usually you need to run that to get the submodules populated. [18:23:19] halfak: that should be done in scap itself [18:23:26] but it seems it's not being done [18:23:38] if submodules aren't populated on tin, then it won't be able to clone from tin, so, yeah, that'll have to be done on tin, but scap should handle targets. [18:23:48] so 1- we didn't configure properly 2- scap has issues [18:24:04] thcipriani: It's populated in tin [18:24:14] I checked that explicitly [18:24:28] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2251644 (10Legoktm) >>! In T133992#2304203, @gerritbot wrote... [18:24:31] Amir1: could you try git_upstream_submodules: True in scap.cfg? [18:24:40] sure [18:25:36] legoktm: Seems zuul is about ready to crash internet explorer [18:25:39] Slow loading. [18:25:51] thcipriani: nothing changed [18:26:39] eh, /srv/ores/deploy/submodules/wheels is populated on deployment-sca01 [18:27:40] er /srv/deployment/ores/deploy/submodules/wheels [18:28:31] thcipriani: ops told us we need to change directories from /srv/ores/deploy to /srv/deployment/ores [18:28:38] Is it what you need? [18:29:02] let me rephrase, is it possible via scap3? [18:29:22] the path for most services is: /srv/deployment/[service]/deploy [18:29:32] not on tin [18:29:34] on nodes [18:29:46] right, for most services on targets that is the path. [18:29:47] (or you mean in nodes too) [18:29:51] exactly. [18:29:54] okay [18:30:09] I will change the scap configs [18:30:54] ladsgroup@deployment-sca01:/srv/ores/deploy/submodules/wheels$ ls [18:30:54] ladsgroup@deployment-sca01:/srv/ores/deploy/submodules/wheels$ [18:31:44] yup [18:32:02] right, but check: /srv/deployment/ores/deploy/submodules/wheels [18:33:31] https://phabricator.wikimedia.org/P3126 [18:34:44] so if you want to move to /srv/deployment/ores/deploy then the checks.yaml scripts will have to be updated with the new path, which I think is what's failing. [18:36:22] yup [18:36:52] did it right now and just deploy successfully [18:39:08] \o/ [18:39:38] does that seem right? Is that the way you want it to work? [18:39:49] i.e. the directory structure [18:42:16] thcipriani: yeah [18:42:23] I'm making the commit right now [18:43:05] okie doke—wanted to make sure I wasn't misunderstanding what was happening. Thanks! [18:46:24] o/ [18:46:25] \o/ [18:48:43] paladox: You can abadone https://gerrit.wikimedia.org/r/#/c/289463/, another revert was already merged ;) [18:49:28] (03Abandoned) 10Paladox: Revert "zuul status: notice about ongoing outage" [integration/docroot] - 10https://gerrit.wikimedia.org/r/289463 (owner: 10Paladox) [18:49:34] Luke081515 ok thanks [19:03:31] (03Merged) 10jenkins-bot: Revert "zuul status: notice about ongoing outage" [integration/docroot] - 10https://gerrit.wikimedia.org/r/289484 (owner: 10Hashar) [19:03:59] (03PS1) 10Chad: Always use --3way merging or patches conflict too easily [tools/release] - 10https://gerrit.wikimedia.org/r/289500 [19:04:34] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T135559#2306432 (10Krinkle) [19:05:46] 06Release-Engineering-Team, 05Release: MW-1.28.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T135559#2303065 (10Krinkle) Specifically, the regression detailed at T121793#2306414 as caused by https://gerrit.wikimedia.org/r/273108 (MediaWiki core) and https://gerrit.wikimedia.org/r/273105 (V... [19:20:49] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2306482 (10Anomie) [19:21:38] 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306486 (10hashar) ``` File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in if ret == -1: raise libvirtError ('... [19:21:48] legoktm: Hi could you re c+2 https://gerrit.wikimedia.org/r/#/c/289434/ please [19:23:29] paladox: people would notice the test failed and they will +2 :-) not sure there is a point in pinging everyone ;} [19:23:39] Oh sorry [19:23:55] CI should be back around now [19:24:03] it mostly caught up with the pile of pending changes [19:24:07] Yep [19:24:38] hasharAway i created https://phabricator.wikimedia.org/T135666 [19:24:43] for php 5.6 [19:24:56] Due to me finding out php 5.6 broke in mw core. [19:25:08] due to composer 1.1.0 being used to update mediawiki/vendor [19:26:13] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling: Create some instances with php5.6 on them - https://phabricator.wikimedia.org/T135666#2306519 (10hashar) I am watching all the Continuous-Integration projects :-) https://phabricator.wikimedia.org/project/profile/401/ for example has t... [19:26:34] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling: Create some instances with php5.6 on them - https://phabricator.wikimedia.org/T135666#2306520 (10Paladox) Oh ok, thanks. [19:35:39] thcipriani: I'm updating the cherry-pick on beta for ores scap configs. What was your modification to the commit? [19:38:12] Amir1: for the time being scap::target is in flux so I added 2 lines to ores::scapdeploy to ensure that puppet continued to work in beta. above scap::target I added: $key_file = hiera('service::deploy::scap::public_key_file', 'puppet:///private/ssh/tin/servicedeploy_rsa.pub') [19:38:27] and then in the scap::target definition I added: public_key_source => $key_file, [19:39:08] thanks [19:40:30] If you made the $key_file line: $key_file = hiera('service::deploy::scap::public_key_file, 'puppet:///modules/service/servicedeploy_rsa.pub') that would probably be better (now that I'm reviewing it. [19:45:55] thcipriani: I added a local hack commit, we can simply amend it [19:46:06] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2306648 (10Paladox) [19:59:52] 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service: ores-beta is down - https://phabricator.wikimedia.org/T135677#2306754 (10Ladsgroup) [20:00:24] 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service: ores-beta is down - https://phabricator.wikimedia.org/T135677#2306768 (10Ladsgroup) Super sleepy now but I'll take a look at it soon-ish [20:09:36] 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service: ores-beta is down - https://phabricator.wikimedia.org/T135677#2306754 (10Halfak) ``` halfak@deployment-ores-web:/srv$ source ores/venv/bin/activate (venv)halfak@deployment-ores-web:/srv$ python Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4... [20:12:11] 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service: ores-beta is down - https://phabricator.wikimedia.org/T135677#2306792 (10Halfak) ``` (venv)halfak@deployment-ores-web:/srv/deployment/ores/deploy$ python Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type "help", "copyright",... [20:23:53] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2306834 (10mmodell) https://gerrit.wikimedia.org/r/#/c/289543/ [20:45:31] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth, 13Patch-For-Review, and 2 others: "Nonce already used" regularly occurring on beta cluster - https://phabricator.wikimedia.org/T109173#2306970 (10Tgr) @jayvdb: I don't have much time this month but do you want to shepherd https... [21:10:15] PROBLEM - Puppet run on deployment-cache-upload04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:14:45] (03PS1) 10Chad: Helps to apply the patch to the right place [tools/release] - 10https://gerrit.wikimedia.org/r/289560 [21:32:21] group1 rolled back to wmf.1 [21:32:56] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2307224 (10mmodell) [21:33:22] 06Release-Engineering-Team, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2265934 (10mmodell) rolled back group1 to 1.28.0-wmf.1 due to T135690 [21:35:01] (03CR) 10Chad: [C: 032] Always use --3way merging or patches conflict too easily [tools/release] - 10https://gerrit.wikimedia.org/r/289500 (owner: 10Chad) [21:35:06] (03CR) 10Chad: [C: 032] Helps to apply the patch to the right place [tools/release] - 10https://gerrit.wikimedia.org/r/289560 (owner: 10Chad) [21:36:47] (03Merged) 10jenkins-bot: Always use --3way merging or patches conflict too easily [tools/release] - 10https://gerrit.wikimedia.org/r/289500 (owner: 10Chad) [21:36:49] (03Merged) 10jenkins-bot: Helps to apply the patch to the right place [tools/release] - 10https://gerrit.wikimedia.org/r/289560 (owner: 10Chad) [22:31:18] 06Release-Engineering-Team, 13Patch-For-Review, 05Release, 05WMF-deploy-2016-05-17_(1.28.0-wmf.2): MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450#2307433 (10mmodell) [22:32:12] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#2307434 (10Tgr) ``` * 5cde85b 2016-01-07 (4 months ago) Gergő Tisza * 3eb89da 2016-01-06 (4 months ago) Gergő Tisza ``` Those can be dropped; I added them for testing but... [23:16:48] (03PS1) 10Niedzielski: Gate Android patches on instrumentation test cc too [integration/config] - 10https://gerrit.wikimedia.org/r/289576 [23:17:39] (03CR) 10Niedzielski: "This is ready for review but I mucked up my Jenkins Job Builder configuration so I haven't been able to deploy it yet." [integration/config] - 10https://gerrit.wikimedia.org/r/289576 (owner: 10Niedzielski) [23:27:58] RECOVERY - Keyholder status on deployment-tin is OK: OK: Less than 100.00% above the threshold [0.0]