[00:05:58] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [00:20:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [00:35:57] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [00:45:54] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [00:55:23] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [00:56:49] (03PS1) 10Legoktm: Remove VisualEditor from extension-gate [integration/config] - 10https://gerrit.wikimedia.org/r/259959 (https://phabricator.wikimedia.org/T116258) [01:02:13] 10MediaWiki-Releasing: Ready-to-use Docker package for MediaWiki - https://phabricator.wikimedia.org/T92826#1889599 (10GWicke) The latest version now comes with MediaWiki 1.26 and VisualEditor out of the box. [01:04:21] 10Differential, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken - https://phabricator.wikimedia.org/T121830#1889600 (10Tgr) 3NEW [01:05:50] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [01:05:51] 10Gitblit-Deprecate: A gerrit project's Access > "History:" links to gitblit - https://phabricator.wikimedia.org/T107981#1889612 (10Tgr) These now point to Diffusion although they still cannot be browsed there. Filed as T121830. [01:07:00] 10Differential, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1889616 (10Tgr) [01:07:21] 10Differential, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1889600 (10Tgr) [01:10:30] (03CR) 10Legoktm: [C: 031] "I've deployed the jjb part of this, but not the zuul part yet." [integration/config] - 10https://gerrit.wikimedia.org/r/259959 (https://phabricator.wikimedia.org/T116258) (owner: 10Legoktm) [01:11:35] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [01:15:53] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [01:19:49] someone broke zuul :( https://integration.wikimedia.org/zuul/ [01:20:03] or well, at least i've never seen anywhere near that number of patches in the queue [01:20:09] (03CR) 10Legoktm: [C: 04-1] "Aaaand reverted. Other extensions in the gate depend upon VisualEditor :/" [integration/config] - 10https://gerrit.wikimedia.org/r/259959 (https://phabricator.wikimedia.org/T116258) (owner: 10Legoktm) [01:20:29] oh, its all lego :P [01:20:52] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [01:20:55] I blame ostriches. [01:21:38] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [01:22:34] 49 patches. 6 branches. [01:23:00] ebernhardson: long day ^ [01:25:40] ebernhardson: http://status.openstack.org/zuul/ is what their zuul normally looks like :P [01:27:12] This is gonna take forever. [01:27:24] 10Differential, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1889651 (10Paladox) Viewing open patches doesn't work well I'm phabricator because the ref it looks at is hard coded and not possible to... [01:27:27] What, Jenkins V-1 everything? [01:27:43] 10Differential, 10Diffusion, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1889655 (10Paladox) [01:29:17] Should we just bypass jenkins to get them through? [01:29:26] I'm tempted for all the non-master ones. [01:29:28] legoktm: ? [01:29:48] 1.23 is gonna keep failing because that FauxRequest missing method at the very least. [01:29:51] it's not going to speed it up [01:30:12] well, from a jenkins perspective. it's going to run the tests on the force merged patches anyways [01:30:13] if jenkins is just going to -1 everything [01:30:13] It'll let the changes land and let me get done tagging and call it a night :p [01:30:17] I can ignore jenkins :p [01:30:17] oh yeah [01:30:18] lmao [01:30:24] then...yeah :P [01:30:37] We can kill the jobs it triggers. [01:30:51] * legoktm forgot about the people who deploy MW using git...like himself [01:30:58] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [01:37:22] just master to deal with [01:38:50] Ok, all 4 branches tagged and signed and pushed [01:38:55] So ya, just master. [01:39:15] Imma let zuul chill out a tad before touching those last few. [01:50:55] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [02:16:03] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [02:30:50] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [02:49:19] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [02:57:44] ostriches: I'm not a zuul pro, but 1hr 50 min and we still have 45 changes (from 49, if I read your conversation with legoktm correctly?) in zuul gate-and-submit queue, is this normal, that it takes sooo long? :) [02:57:55] RECOVERY - salt-minion processes on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:37] FlorianSW: when you drop 49 changes on it at once, yes [02:58:49] It still hasn't caught up [02:59:16] ok, cool, thanks for the quick answer :) [02:59:27] It's 49 times however many times they got +2d [02:59:37] Plus anything else people pushed [03:00:36] Really they should queue by branch but meh [03:00:56] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 3.64 ms [03:30:55] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [03:46:01] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [04:12:19] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [04:35:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [04:54:46] 10Beta-Cluster-Infrastructure: Review rights removal by user Vogone - https://phabricator.wikimedia.org/T121168#1889768 (10doctaxon) 5stalled>3Resolved a:3doctaxon Case closed [04:56:35] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [05:16:33] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [05:39:19] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [06:05:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [06:25:23] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [06:50:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [07:05:51] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [07:16:24] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [07:45:56] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [07:58:20] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1889857 (10RobLa-WMF) I've been thinking about the central question I think this area should be addressing, and I th... [07:58:49] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1889858 (10RobLa-WMF) [08:00:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [08:20:59] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [08:30:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [09:01:36] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [09:02:05] still 8h queue :/ [09:04:59] It's going down. [09:05:07] * ostriches has been refreshing all night [09:06:03] 10Beta-Cluster-Infrastructure, 6operations, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1889935 (10mobrovac) [09:15:03] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #817: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/817/ [09:15:54] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [09:16:25] !log mass cancelling jobs for changes that got force merged [09:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:18:10] !log killing Zuul [09:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:24:09] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1889958 (10Nemo_bis) > how do we simultaneously optimize the following conditions? The question seems to assume a f... [09:28:56] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #655: 15ABORTED in 7 min 55 sec: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/655/ [09:31:33] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [09:50:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [09:52:36] hashar: thanks [09:55:24] !sal [09:55:24] https://tools.wmflabs.org/sal/releng [09:56:53] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1889996 (10hashar) This issue is still happening despite https://gerrit.wikimedia.org/r/#/c/258634/ :( [09:58:56] may someone merge in the contint monitoring change https://gerrit.wikimedia.org/r/#/c/257568/ ? It is to monitor a Zmq daemon is listening properly. Tested on gallium :-} [10:04:07] !log rechecking mediawiki/core REL branches ( REL1_26 https://gerrit.wikimedia.org/r/#/c/247327/ ) ( REL1_25 https://gerrit.wikimedia.org/r/#/c/247337/ ) ( REL1_24 https://gerrit.wikimedia.org/r/#/c/179886/ ) ( https://gerrit.wikimedia.org/r/#/c/143591/ | REL1_23 ) [10:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:25:44] 7Browser-Tests, 10Wikidata: Wikidata Feature: Item smoke test: fails to find cancel button - https://phabricator.wikimedia.org/T117582#1890063 (10Lydia_Pintscher) p:5Triage>3Normal [10:42:55] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #656: 04STILL FAILING in 42 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/656/ [10:45:19] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [11:25:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [11:31:34] 6Release-Engineering-Team, 10DBA, 7Epic: Implement a system to automatically deploy schema changes without needing DBA intervention - https://phabricator.wikimedia.org/T121857#1890202 (10jcrespo) 3NEW [11:43:31] food time [12:00:24] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [12:59:42] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #882: 04FAILURE in 27 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/882/ [14:38:49] (03PS1) 10Hashar: Dump stat() for tmp directory [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260007 (https://phabricator.wikimedia.org/T120824) [14:40:00] (03CR) 10Hashar: [C: 032] Dump stat() for tmp directory [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260007 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [14:40:56] (03Merged) 10jenkins-bot: Dump stat() for tmp directory [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260007 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [14:42:16] (03CR) 10Hashar: "Deployed:" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260007 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [14:59:40] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1890442 (10daniel) @RobLa-WMF I think your "lead question" sums it up pretty well. My question is: how many sessions... [15:03:01] (03PS1) 10Hashar: Tie mwext-mw-selenium* jobs to specific slaves [integration/config] - 10https://gerrit.wikimedia.org/r/260008 (https://phabricator.wikimedia.org/T120824) [15:03:37] (03CR) 10Hashar: [C: 032] Tie mwext-mw-selenium* jobs to specific slaves [integration/config] - 10https://gerrit.wikimedia.org/r/260008 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [15:05:44] (03Merged) 10jenkins-bot: Tie mwext-mw-selenium* jobs to specific slaves [integration/config] - 10https://gerrit.wikimedia.org/r/260008 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [15:05:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [15:06:46] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1890453 (10hashar) ArticlePlaceholder has `mwext-mw-selenium-composer` in the experimental pipeline. I crafted... [15:11:52] 7Browser-Tests, 10Browser-Tests-Infrastructure, 10CirrusSearch, 6Discovery, and 2 others: Upgrade CirrusSearch browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99653#1890454 (10zeljkofilipin) [15:17:58] ostriches: it looks like openstack's zuul works well with that many because they auto-scale the worker nodes ;P [15:19:03] !log Created Github repo https://github.com/wikimedia/thumbor-video-engine [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:19:23] ebernhardson: that is more or less what we do but not for the PHP jobs :-( [15:37:13] !log Deleting mediawiki/tools/codesniffer.git branch wmf-deploy (was 358c2e7bdec269cec999af89e3412951bb463dc0 ) [15:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:37:26] qa-morebots: ping [15:37:26] I am a logbot running on tools-exec-1211. [15:37:26] Messages are logged to https://tools.wmflabs.org/sal/releng. [15:37:26] To log a message, type !log . [15:41:57] 10Continuous-Integration-Config, 5Patch-For-Review: Deprecate global CodeSniffer rules repo - https://phabricator.wikimedia.org/T66371#1890487 (10hashar) Cherry picked on integration puppet master. I have deleted /srv/deployment/integration/mediawiki-tools-codesniffer [15:43:04] !log salt '*slave*' cmd.run 'rm -fR /srv/deployment/integration/mediawiki-tools-codesniffer' https://phabricator.wikimedia.org/T66371 [15:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:44:06] 10Continuous-Integration-Config, 5Patch-For-Review: Deprecate global CodeSniffer rules repo - https://phabricator.wikimedia.org/T66371#697969 (10hashar) **done** This is just pending for puppet patch https://gerrit.wikimedia.org/r/260018 to be merged by ops. [15:45:57] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [16:00:40] 10Differential, 10Diffusion, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1890535 (10mmodell) [16:01:33] 10Differential, 10Diffusion, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1889600 (10mmodell) [16:03:13] (03PS1) 10Unicornisaurous: Whitelist unicornisaurous (for Verified+2 tests) [integration/config] - 10https://gerrit.wikimedia.org/r/260028 [16:06:12] twentyafterfour: Could you review https://gerrit.wikimedia.org/r/#/c/260023/ please. [16:09:34] paladox: looks good. [16:09:49] twentyafterfour: Thanks. [16:11:55] !log Nodepool force refreshing image to make sure zuul is up to date (should be 2.1.0-60-g1cc37f7-wmf4jessie1 ) [16:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:12:24] twentyafterfour: Seems to cause this error https://phabricator.wikimedia.org/diffusion/PHEX/browse// adds a double //. [16:15:56] !log Image ci-jessie-wikimedia-1450455076 in wmflabs-eqiad is ready ( still has the wrong Zuul version grr) [16:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:16:44] twentyafterfour: Could you review https://gerrit.wikimedia.org/r/#/c/260032/ since the patch that was approved broke the redirect script because it is adding double //. [16:25:59] paladox: fixed already [16:26:14] twentyafterfour: Thanks. [16:26:37] everything works now except weird refs like meta/config [16:27:23] (03PS1) 10Hashar: dib: attempt force apt-get update for snapshot [integration/config] - 10https://gerrit.wikimedia.org/r/260035 [16:28:13] (03CR) 10Hashar: [C: 032] dib: attempt force apt-get update for snapshot [integration/config] - 10https://gerrit.wikimedia.org/r/260035 (owner: 10Hashar) [16:29:23] (03Merged) 10jenkins-bot: dib: attempt force apt-get update for snapshot [integration/config] - 10https://gerrit.wikimedia.org/r/260035 (owner: 10Hashar) [16:30:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [16:33:58] (03PS1) 10Hashar: Revert "dib: attempt force apt-get update for snapshot" [integration/config] - 10https://gerrit.wikimedia.org/r/260039 [16:34:10] (03CR) 10Hashar: [C: 032] Revert "dib: attempt force apt-get update for snapshot" [integration/config] - 10https://gerrit.wikimedia.org/r/260039 (owner: 10Hashar) [16:36:00] (03Merged) 10jenkins-bot: Revert "dib: attempt force apt-get update for snapshot" [integration/config] - 10https://gerrit.wikimedia.org/r/260039 (owner: 10Hashar) [16:36:59] (03PS1) 10Hashar: nodepool: force apt-get update on snapshot creation [integration/config] - 10https://gerrit.wikimedia.org/r/260040 [16:37:20] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [16:37:31] (03CR) 10Hashar: [C: 032] nodepool: force apt-get update on snapshot creation [integration/config] - 10https://gerrit.wikimedia.org/r/260040 (owner: 10Hashar) [16:38:21] (03Merged) 10jenkins-bot: nodepool: force apt-get update on snapshot creation [integration/config] - 10https://gerrit.wikimedia.org/r/260040 (owner: 10Hashar) [16:40:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [16:42:41] !log Nodepool instances now have Zuul 2.1.0-60-g1cc37f7-wmf4jessie1 finally [16:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:43:11] !log Nodepool: Image ci-jessie-wikimedia-1450456713 in wmflabs-eqiad is ready [16:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:46:07] 7Browser-Tests, 10Browser-Tests-Infrastructure, 10CirrusSearch, 6Discovery, and 2 others: Upgrade CirrusSearch browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99653#1890688 (10zeljkofilipin) [16:49:19] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [16:50:57] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [16:53:06] 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1890742 (10hashar) Nodepool snapshots were not properly upgrading packages from apt and were left with the version that was in the image... [16:53:18] 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7Upstream, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1890752 (10hashar) [16:53:28] 10Continuous-Integration-Infrastructure, 7Upstream, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1233116 (10hashar) [16:56:36] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [16:57:21] 7Browser-Tests, 10Browser-Tests-Infrastructure, 10CirrusSearch, 6Discovery, and 2 others: Upgrade CirrusSearch browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99653#1890787 (10zeljkofilipin) [17:05:45] 3Scap3: Bug in scap3 git submodule url rewriting - https://phabricator.wikimedia.org/T121884#1890798 (10Ottomata) 3NEW [17:06:14] 3Scap3: Bug in scap3 git submodule url rewriting - https://phabricator.wikimedia.org/T121884#1890805 (10Ottomata) Although, now that I am redeploying (instead of the first checkout), the url is still as I said, but the submodule checkout is working? [17:07:04] 3Scap3: Bug in scap3 git submodule url rewriting - https://phabricator.wikimedia.org/T121884#1890824 (10Ottomata) Ahh, no its not. I believe the submodule was cached in -cache somehow. If I remove the whole target path, and start again, I get: ``` Submodule 'config/schemas' (http://tin.eqiad.wmnet/eventloggin... [17:09:14] 10Continuous-Integration-Infrastructure: Update jobs to use zuul-cloner with git cache - https://phabricator.wikimedia.org/T97098#1890837 (10hashar) From T97106#1890742 now has hardlink support from git bare repos simply by using `--cache-dir /srv/git` on Nodepool instances. [17:10:53] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [17:11:01] 10Differential, 10Diffusion, 10Gerrit, 10Gitblit: Gerrit repository browsing links are broken after being switched from Gitblit to Diffusion - https://phabricator.wikimedia.org/T121830#1890842 (10Paladox) Branches redirection link now works. All that is left is for us to be able to view open patches from r... [17:14:30] 10Gitblit-Deprecate, 10Diffusion, 5Patch-For-Review: redirect gerrit repo paths to diffusion callsigns - https://phabricator.wikimedia.org/T110607#1890845 (10Paladox) Branches redirection link has been fixed all working now. Except from refs/meta but that's because phabricator isen't pulling from that ref.... [17:20:25] 10Gitblit-Deprecate, 10Diffusion: Linkify "Ibcc1b499" changeset hash etc. in commit messages on diffusion - https://phabricator.wikimedia.org/T89939#1890875 (10Paladox) [17:25:54] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [17:35:55] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 2.66 ms [17:44:49] 10Deployment-Systems, 3Scap3: scap environment-specific host file not working - https://phabricator.wikimedia.org/T121705#1890943 (10thcipriani) [17:49:15] 10Deployment-Systems, 3Scap3: scap environment-specific host file not working - https://phabricator.wikimedia.org/T121705#1890946 (10thcipriani) p:5Triage>3Normal [17:49:26] 10Deployment-Systems, 3Scap3: scap environment-specific host file not working - https://phabricator.wikimedia.org/T121705#1890948 (10thcipriani) a:3thcipriani [18:06:00] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [18:15:18] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1891041 (10daniel) Some thoughts after a conversation with @RobLa: For the main "Software Engineering" slot at the... [18:31:36] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [19:12:10] Was the zuul queue dropped yesterday? [19:12:12] https://gerrit.wikimedia.org/r/#/c/251008/ [19:13:48] label:Code-Review+2 status:open only showed one other patch, so I +2'd it [19:17:24] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [19:40:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [19:53:30] thcipriani: HIiiII :) [19:54:02] ottomata: howdy. [19:54:21] so, things are working! :) now i want to make the scap/ dir setup better [19:54:31] i know you may not yet be opinionated, but I am looking for advice [19:54:42] thus far, most scap/ dirs are committed to the actual code repos, right? [19:55:27] i don't want to do that, but i see scap.cfg can live anywhere [19:55:32] so my options are: [19:56:54] - new 'scap' or 'deploy' repo, cloned by puppet to my deploy dir on tin [19:56:55] - global(ish) puppetized scap/ configs that go to /etc/scap/... (?) or /srv/scap/scap.cfg?? [19:56:55] [19:57:46] thus far, my scap configs on deploy server are limited to setting up ssh stuff [19:57:47] https://github.com/wikimedia/operations-puppet/blob/production/modules/eventlogging/manifests/deployment/source.pp [19:58:05] which is nice, because it can be used to deploy any target for which the public key is setup [19:58:34] in my case, i have both /srv/deployment/eventlogging/{eventlogging,eventbus} which are the same code repo, but deployed to diffferent servers for different purposes [19:58:55] i include eventlogging::deployment::source once on the deploy server [19:59:14] and then do the scap::target { ... thing for any target i want to deploy with the eventlogging user [19:59:34] so, it'd be nice to have to put scap/ config stuff in my eventlogging::deployment::source class [19:59:44] i guess I could make a repo and then manually clone on tin [19:59:52] Buuut, that isn't ideal for automation :/ [20:02:44] so in this instance the global scap config is currently used for scapping mediawiki only, the scap dir in the deployment repo is a hard requirement. I could see how, if it weren't committed to the repo, it could be tricky to automate. [20:03:59] ottomata: the ideal in your case would be something like: put your scap directory in /etc/scap/[repo] and not keep it in your code at all? [20:04:24] (that doesn't currently work, just asking if that would be your ideal way of using it) [20:05:25] hm, no not necessarily [20:05:57] i don't mind puppetizing a special clone into $my_repo_path/scap, not sure htough [20:05:58] hm [20:06:20] its just extra stuff to do on the deploy server. but, /etc/scap is the same i guess [20:06:20] hm [20:07:39] hmmm [20:08:02] if I did abuse the environments for this, i could avoid the problem of having multiple source repos on the deploys server [20:08:17] and then only have one scap/ dir for all different targets/envs of that repo [20:08:22] -e eventbus, -e analytics [20:08:22] etc. [20:08:34] we talked about that some other time, it does feely hacky [20:09:22] yeah, if you've got them split into separate deploy repos on tin. [20:09:41] yeah that's what i'm doing now [20:09:53] the puppetization is just more cumbersom, because now I have to do special things on tin [20:09:55] hmmm [20:10:12] hmm, so the actual cloning of repos on tin is still handled by trebuchet, right? [20:10:24] hm, is it trebuchet? or just deployment.yaml setuff? [20:10:25] stuff? [20:10:55] maybe we could put the information about the scap config repo to clone in deployment.yaml... [20:10:56] for the time being it is trebuchet. it's something like the deploy.deploy_server_init salt command. [20:11:01] m [20:11:02] hm [20:11:05] like [20:11:25] scap_repo: http://gerrit.wikimedia.org/blabla/scap/eventbus [20:11:32] and the initial clone could just pull it down [20:11:37] hm [20:11:53] currently deployment.yaml is the only place that makes special stuff happen on tin [20:12:28] also, FWIW, the contents of deployment.yaml are put in a salt pillar that is used by trebuchet. [20:14:24] ahh, yeah am just now seeing that [20:14:25] hmmm [20:14:37] don't really want to hack trebuchet salt stuff to do this :/ [20:15:10] ok, maybe thcipriani i'll just do it manually for now, and when yall get around to taking over the deploy server config from trebuchet we can think about this more [20:16:24] heheh i could ALSO use exported resources... [20:16:30] make the target export a git::clone [20:16:38] and realize them on the deployment server [20:16:44] after the salt execs.. [20:16:50] buuut that would also be hacky nasty [20:16:54] and maybe wouldn't work in labs [20:20:15] hmm, yeah, I like the idea of reusing deployment.yaml in some kind of puppet context. Adding `scap_dir` as an additional key would be a good thing, too, I think. I'm less sure about how exported resources would look. [20:26:35] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [20:31:45] thcipriani: it is probably a bad (but cool) idea :) [20:32:02] it would look slick in puppet, but be bad in practice [21:05:52] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [21:15:49] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [21:17:19] Hey folks, looks like there are CI issues? [21:17:29] Lost of things are getting V-1ed because one of the jobs hits a permissions error [21:17:31] e.g. https://integration.wikimedia.org/ci/job/npm/42776/console [21:17:35] 21:12:56 chmod: changing permissions of ‘/mnt/home/jenkins-deploy/tmpfs/jenkins-0’: Operation not permitted [21:17:44] 21:12:56 rm: cannot remove ‘/mnt/home/jenkins-deploy/tmpfs/jenkins-0/lessphp_805a13p17e8skww04swww4s0w48w00w.lesscache’: Permission denied [21:17:45] 21:12:56 rm: cannot remove ‘/mnt/home/jenkins-deploy/tmpfs/jenkins-0/lessphp_h8b1csvw54gs4gwk0o4kgkww044040w.lesscache’: Permission denied [21:18:10] or https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/43610/console [21:18:20] What's weird is it's not every job all the time, but at least one job for most commits [21:18:27] So maybe some servers are broken and others are OK? [21:21:21] RoanKattouw: blerg. Sounds like https://phabricator.wikimedia.org/T120824 [21:22:24] thcipriani: Thanks for finding that, I'll comment there with those links [21:22:59] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891537 (10Catrope) Looks like this or something similar is happening again: ``` [13:17] RoanKattouw Hey folks,... [21:24:28] greg-g: so this isn't really an emergency...but with no deployments for a few weeks it's kind of a pain. The new beta feature we put out yesterday isn't collecting the data we intended. I have two patches (core & cirrus) which amount to 7 lines of javascript changes. Doesn't change any logic, just provides a proper string for event logging. [21:25:50] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [21:26:47] ebernhardson: AIUI greg-g is on vacation and ostriches is his delegate [21:27:26] RoanKattouw: oh thats right, i remember something like that [21:27:41] ostriches: so this isn't really an emergency...but with no deployments for a few weeks it's kind of a pain. The new beta feature we put out yesterday isn't collecting the data we intended. I have two patches (core & cirrus) which amount to 7 lines of javascript changes. Doesn't change any logic, just provides a proper string for event logging. [21:27:51] RoanKattouw: I removed those lesscache files from integration-slave-trusty-1014 (looked like the main box that was misbehaving) [21:27:53] Just caught scrollback. Lemme see. [21:27:56] ebernhardson: ^ [21:27:57] links [21:28:07] ostriches: https://gerrit.wikimedia.org/r/260080 https://gerrit.wikimedia.org/r/260081 [21:29:18] it's kind of a silly oversight... [21:29:29] Those are fine by me. [21:29:43] ostriches: thanks [21:30:56] !log salt --show-timeout '*slave*' cmd.run 'rm -fR /mnt/home/jenkins-deploy/tmpfs/jenkins-?/*' [21:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:32:36] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891551 (10JanZerebecki) I just executed what is explained in the description, so it should be work again for a... [21:32:57] RoanKattouw: ^^ [21:33:10] 6Release-Engineering-Team, 3Scap3, 7Security-General: Scap should be aware of security patches - https://phabricator.wikimedia.org/T118477#1891556 (10thcipriani) 5Open>3Resolved [21:33:16] Thanks, I'll recheck the ones that failed [21:34:52] jzerebecki: oh I tied the mwext-mw-selenium jobs to specific slaves [21:35:10] since I think that the job causing the weird tmpfs being owned by www-data :( [21:36:38] hashar: so that it appeared now on integration-slave-trusty-1014 means it is a different job? [21:37:09] RoanKattouw: on your IRC log at https://phabricator.wikimedia.org/T120824#1891537 the time is PST isn't it ? [21:37:16] hashar: Yes, sorry [21:37:29] I copypasted it from my client instead of from the public log [21:37:33] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891570 (10hashar) Roan mentioned https://integration.wikimedia.org/ci/job/npm/42776/ which ran on executor 0 of... [21:37:35] it is ok :-) [21:37:46] Add 8h for UTC, 9h for your local time [21:37:50] RoanKattouw: I don't expect you to paste a log from 4am :- [21:37:59] (soon to be my local time, starting Sunday) [21:38:03] jzerebecki: yeah [21:38:35] jzerebecki: I have trouble mapping jobs build and the executor they run on. jenkins.log shows the Gearman executor number but that is not the same as EXECUTOR_NUMBER hehe [21:38:35] 10Deployment-Systems, 6Release-Engineering-Team, 6Performance-Team, 6operations, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1891582 (10Krinkle) [21:38:52] BUT [21:38:57] I have added a few stat() calls :D [21:40:19] /mnt/home/jenkins-deploy/tmpfs/jenkins-0 -> Access: 2015-12-18 20:51:37.791584293 +0000 [21:40:41] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891600 (10hashar) The `stat` calls on each tmp directory I added show on https://integration.wikimedia.org/ci/j... [21:40:47] that bug is killing me [21:40:49] the worth [21:40:59] is that whenever it get fixed I will face palm at how simple the fix is [21:42:22] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [21:47:40] Project beta-scap-eqiad build #82796: 04FAILURE in 1 min 39 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/82796/ [21:50:30] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891615 (10hashar) From gallium `/var/log/jenkins/jenkins.log` and some manual filtering, here are the job start... [21:53:42] Dec 18, 2015 20:52:04 integration-slave-trusty-1014_exec-0 https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/43608 [21:53:56] failed the same way [22:04:28] hashar: any luck finding the previous job? [22:04:35] digging :( [22:05:33] marxarelli: I have updated the list of jobs https://phabricator.wikimedia.org/T120824#1891615 [22:05:37] that ran before the one roan mentionned [22:05:49] with the executor number extracted from the console output [22:06:47] so https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/43608 does show /mnt/home/jenkins-deploy/tmpfs/jenkins-0 belonging to www-data and created at 20:51:37 [22:07:05] and the build before on that executor was https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/23883 [22:07:09] which got cancelled by Zuul [22:07:48] and [22:08:13] something must be running as www-data with an assigned TMPDIR then, righ? [22:08:15] 20:51:37 has some karma/chromium error [22:08:19] yeah [22:08:27] and creating / changing owner [22:08:33] https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/23883/consoleFull [22:08:46] it runs mediawiki core karma tests [22:08:48] right, but before our jenkins script can have the chance to create it [22:08:51] which is similar to selenium iirc [22:09:00] hashar: would now be a bad time for me to deploy a zuul change? (https://gerrit.wikimedia.org/r/#/c/260028/1 specifically) [22:09:06] it wouldn't be able to set the user otherwise [22:09:09] in this qunit job at 20:50:43 [22:09:20] I added stat calls to show whether the script properly create the files [22:09:42] and both /mnt/home/jenkins-deploy/tmpfs/jenkins-0 and /tmp/jenkins-0 belong to jenkins-deploy [22:10:25] right, but if they exist before hand, the `mkdir -p` aren't going to show any failure [22:10:40] legoktm: i am doing forensic in logs. This zuul layout change is harmless you can deploy ! ;) [22:10:46] legoktm: thank you to have asked [22:10:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [22:11:06] (03CR) 10Hashar: [C: 031] Whitelist unicornisaurous (for Verified+2 tests) [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:11:27] (03PS2) 10Legoktm: Whitelist unicornisaurous (for Verified+2 tests) [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:11:33] ok, thanks :) [22:11:40] (03CR) 10Legoktm: [C: 032] Whitelist unicornisaurous (for Verified+2 tests) [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:12:33] marxarelli: yup [22:12:39] marxarelli: but the stat calls indicate they still belong to jenkins-deploy [22:16:31] (03Merged) 10jenkins-bot: Whitelist unicornisaurous (for Verified+2 tests) [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:17:23] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:18:19] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [22:20:53] !log deploying https://gerrit.wikimedia.org/r/260028 [22:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:21:47] (03CR) 10Legoktm: "And deployed. Thanks for your contributions so far! :-)" [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:21:53] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891702 (10hashar) So the build order show: https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit... [22:22:07] so [22:22:33] https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/23883/consoleFull dies exactly at the point the dir are owned by www-data [22:23:09] and should have deleted both directories via rm -fr [22:23:14] shouldn't mw-teardown-mysql.sh fail if the tmpdir is owned by www-data? but in https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/23883/console it is not. [22:23:25] so if one belonged to www-data it should have died in the postbuild [22:23:33] yeah [22:23:46] that is what I would expect [22:23:56] which would mean that it changed ownership happend just after, which I'm not sure makes sense [22:24:18] even when phrased in proper english ;) [22:25:00] (03CR) 10Unicornisaurous: "Thanks!" [integration/config] - 10https://gerrit.wikimedia.org/r/260028 (owner: 10Unicornisaurous) [22:25:11] :-)))))))))) [22:25:18] I will take some german lessons one day [22:26:11] I am going to make the post build rm verbose [22:29:36] (03PS1) 10Hashar: global-teardown: be more verbose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260098 (https://phabricator.wikimedia.org/T120824) [22:30:06] (03CR) 10Hashar: [C: 032] global-teardown: be more verbose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260098 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [22:30:46] salt '*slave*' cmd.run 'cd /srv/deployment/integration/slave-scripts && git pull' [22:34:17] hashar: It seems that alot of tests are failing such as https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/43632/console and https://integration.wikimedia.org/ci/job/mediawiki-core-npm/8357/console [22:34:24] REPRODUCED !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [22:34:28] youuu [22:35:29] (03Merged) 10jenkins-bot: global-teardown: be more verbose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/260098 (https://phabricator.wikimedia.org/T120824) (owner: 10Hashar) [22:37:18] 10Continuous-Integration-Config: Add a check to verify that [mediawiki/core]/autoload.php doesn't change after running maintenance/generateLocalAutoload.php - https://phabricator.wikimedia.org/T121921#1891738 (10matmarex) 3NEW [22:37:43] paladox: yeah [22:39:34] !log salt -v '*slave*' cmd.run 'find /mnt/home/jenkins-deploy/tmpfs -user www-data -delete' [22:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:43:33] hashar: is it possible that some request to an hhvm process wouldn't be entirely complete before a job is finished? [22:43:46] and result in something being written to a temporary file [22:43:47] ohhhhhh [22:44:14] after the teardown script has run, etc. [22:44:36] yup that would explain it [22:44:46] it seems we should probably set MW's tmpdir to somewhere different [22:44:53] the qunit jobs often fails because of an ajax timeout [22:45:49] so the karma runner does hit hhvm [22:45:58] timeout for some reason or the job get canceled [22:46:03] post build run [22:46:08] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [22:46:10] it's the only way to explain the www-data ownership [22:46:11] and hhvm continue running the system [22:46:15] right [22:46:16] yeah [22:46:39] given Zuul cancel jobs [22:46:47] and Jenkins probably kill -9 every sub process [22:47:03] I am not sure how we can have post build watch / wait till hhvm finished [22:47:05] right, but not hhvm (for good reason) [22:47:18] i'm not sure we need to, really [22:47:44] we just need to make sure that hhvm/MW processes write to their own tmpdir [22:47:58] or instead of deleting the directory we can delete the files underneath? This way the dir still belong to jenkins-deploy [22:48:04] oh [22:48:04] one that's cleaned up perdiocally, but outside the scope of each job [22:48:23] but that needs to be scoped per job :( [22:48:53] if we leave the tmpdir as owned by jenkins-deploy, and hhvm tries to write to it, it will bork [22:48:59] right? what's the mode? [22:49:04] it is 777 [22:49:12] ah, ok [22:49:13] so that might work [22:49:22] we did that because MediaWiki under hhvm does write to the cache dir [22:49:33] and the process has no idea of the job context [22:49:33] i kind of think the cleanup schedule should mirror the process(es) lifespan though [22:49:57] hhvm processes are long running and separate from job processes themselves [22:50:54] or we can namespace tmpdirs based on jenkins context again [22:51:03] i.e. using the job/build number or some hash of it [22:51:24] yeah, but apache/hhvm isnt' being run by jenkins [22:51:59] i don't think it necessarily needs it's tmpdir scoped to the executors that run against it [22:52:11] we can inject it via the mediawiki.d which would set $wgTmpDirectory = getenv('TMPDIR') / getenv('BUILD_TAG') [22:52:20] we just need to ensure that something is triggered or run periodically to clean up hhvm temp files [22:52:58] paladox: Although your enthusiasm is appreciated, I don't think we should be taking things like https://secure.phabricator.com/D14813 to the upstream project. They simply aren't interested. Gerrit isn't at all important to most of the world. [22:53:24] twentyafterfour: Ok. [22:54:10] hashar: we _could_ have some hhvm maintenance endpoint that's requested when jobs complete [22:54:40] paladox: we may try to address the issue our fork. If we come up with a good solution, anyone else who wants to follow our lead can apply patches to their own instance of phab. [22:54:54] ...in our fork [22:54:56] and let it clean up its own shite, as www-data [22:54:59] Ok. [22:55:26] hashar: it would have to be a sync operation to avoid race conditions [22:56:55] marxarelli: so embedded in hhvm ? [22:57:39] hashar: i.e. assign MW's tmpdir to /whatever/jenkins-0/hhvm, and at the end of the job, request http://127.0.0.1/jenkins/cleanup.php?dir=/whatever/jenkins-0/hhvm [22:57:52] or some other wacky shit like that :) [22:58:08] but you can still have a long running process standing bye [22:58:10] by [22:58:36] hashar: grr... [22:58:38] you're right :) [22:58:57] maybe we could use the hhvm console to list threads based on some info they might have [22:59:05] maybe apache pass the virtual host to it [22:59:11] `while lsof +D /tmp/jenkins-0; sleep` :) [22:59:13] and the vhost has the job name + build # [22:59:18] ahaha [22:59:25] paladox: thanks for your help with the redirect url stuff, at least we got that mostly straightened out. [22:59:58] twentyafterfour: Yes, Most should be working now. [23:00:18] marxarelli: would you mind summarizing your suggestion at https://phabricator.wikimedia.org/T120824 ? I don't want to steal your finding :-} [23:00:37] hashar: sure. mind if i just copy/paste? [23:00:43] yeah [23:00:57] will be fine [23:01:00] 22:33:05 chmod: changing permissions of ‘/mnt/home/jenkins-deploy/tmpfs/jenkins-3’: Operation not permitted [23:01:03] 22:33:05 Build step 'Execute shell' marked build as failure [23:01:09] https://integration.wikimedia.org/ci/job/mediawiki-core-npm/8357/console [23:01:16] thedj: yup we are talking about it right now [23:01:21] been hurting us for a while now :((( [23:01:59] thedj: that is https://phabricator.wikimedia.org/T120824 [23:02:29] thedj: I have rechecked https://gerrit.wikimedia.org/r/#/c/219629/4 we will see [23:02:37] hashar: damn limechat copy/paste is terrible [23:03:36] marxarelli: I am using https://github.com/Codeux/Textual.git @ some tag with a self signed apple dev certificate of some sort [23:04:45] marxarelli: with a patch on top of it https://phabricator.wikimedia.org/P2441 [23:05:11] which might not work :-} [23:09:57] Yippee, build fixed! [23:09:57] Project beta-scap-eqiad build #82799: 09FIXED in 23 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/82799/ [23:12:26] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Dozens of jobs failing on integration-slave-trusty-1012 because chmod fails for /tmp/jenkins-2 - https://phabricator.wikimedia.org/T120824#1891863 (10dduvall) From IRC with @hashar (TL;DR: parallelism killed Kenny): ``` 14:43 marxarelli: hashar: is i... [23:19:27] hashar: would it be possible to run hhvm outside of apache? i.e. startup a web server owned by jenkins-deploy that serves the mw instance? [23:21:03] if only we had container based CI ... [23:23:39] we do ! :D [23:23:54] we can move the jobs to Nodepool instances [23:24:04] "i need x, y, z services to test against", the job said. "sure thing boss, here's your own little cluster of services. let me know when you're done and i'll nuke em all to hell for ya" CI replied [23:24:17] that is actually what I wanted to play test this week [23:24:26] but instead spent time figuring out that issue [23:24:37] yeah [23:24:47] nodepool even supports spawning several instances [23:25:08] it publish the context as env variable under something like /etc/nodepool/context [23:25:20] so a testruner can then grab ip address of other nodes and set them up [23:25:21] i think we should talk with Yuvi at the dev summit and see what we can borrow from the kubernetes setup they're building [23:25:31] hashar: oh wait, you're not going to be there! :( [23:25:33] yeah that as well [23:25:47] doesn't prevent you guys to talk and figure out whatever solution!!!! [23:26:23] it is not my infra, I surely have the most experience for it but I am responsible for a bunch of shortsighted/faults [23:26:42] so any fresh view would definitely help [23:28:27] i like the idea of hhvm as a standalone server [23:28:34] we don't build it for Jessie though :/ [23:30:10] whaaa, hashar isn't coming in january? :'( [23:30:26] nop [23:30:42] this way you guys can reimagine whatever you want :-D [23:30:51] RECOVERY - Host integration-t102459 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [23:31:15] hashar: whatever it is, we'll call it HASH-R [23:31:33] or HASH-R2 [23:31:38] cause, ya know [23:31:51] hehe [23:32:04] hey [23:32:34] I just remembered that mediawiki has some tools to use the PHP 5.4 or 5.5 embedded server [23:32:40] in maintenance/dev/ [23:32:51] so we could spawn it pre build [23:32:54] and kill it postbuild [23:33:24] would it be a problem to not run it on hhvm? [23:33:28] nop [23:33:39] i don't think it will cause any troubles [23:33:47] the mediawiki vhost is used for karma/qunit and selenium [23:33:58] I don't think the PHP flavor in the backend matter that much [23:34:04] we might see a perf degredation [23:34:07] it shouldn't [23:34:29] and if it does for selenium tests, there's a problem [23:34:42] probably [23:34:49] Jessie has 5.6 [23:35:09] so perf would be in between 5.3 and hhvm I guess [23:35:47] perf testing in ci on labs seems problematic anyhow [23:36:27] but we should check in with Krinkle and crew [23:37:12] I mean, people are just going to complain that jobs have gotten slower. But if it's not breaking all the time, I think they'll be happier :) [23:37:22] oh [23:37:39] wait for us to add the browser tests on mediawiki/core ;-} [23:37:53] legoktm: selenium jobs are slow as hell anyway, it will be negligible there [23:38:34] i.e. the webdriver overhead dwarfs any additional backend performance difference [23:38:55] * marxarelli says with 78% certainty [23:39:04] Fatal error: Call to undefined function xhprof_enable() in /projects/mediawiki/core/includes/libs/Xhprof.php on line 84 [23:39:05] pff [23:39:18] need php5-xhprof installed [23:48:09] marxarelli: php -S localhost:8080 maintenance/dev/includes/router.php [23:48:23] legoktm: yeah don't have it on the php 5.6 build-in for mac [23:48:27] and my homebrew one is a 5.3 [23:49:40] marxarelli: Phabricator supports irc syntax highlight for https://phabricator.wikimedia.org/T120824#1891863 [23:49:43] ``` [23:49:45] lang=irc [23:49:55] 00:00 johndoe: hello world [23:49:57] ``` [23:49:59] profit [23:50:04] :-D [23:50:09] thank you for the past! [23:50:51] paste [23:51:04] 1am [23:51:08] root cause is probably found [23:51:16] guess I will get some rest / think about it over the week-end [23:51:28] hashar: i tried lang=irc but it didn't seem to do much [23:51:32] :( [23:51:37] hashar: yeah, get some sleep! [23:51:47] hashar: thanks for troubleshooting [23:52:34] ohh [23:52:39] i actually have to take off soon myself. holiday party #2 at my wife's office [23:52:40] your client format is not recognized [23:53:13] not surprised. limechat is the worst [23:53:58] marxarelli: https://phabricator.wikimedia.org/F3121853 :D [23:54:34] bd808 introduced Textual to me [23:54:35] 5/5 [23:54:45] nice, i'll give it a go [23:56:14] now [23:56:29] I am looking for the IRC client "pirc" from the 90's [23:56:34] and google has nothing to offer [23:56:36] PROBLEM - Host integration-t102459 is DOWN: CRITICAL - Host Unreachable (10.68.16.67) [23:57:04] it is worrying how stuff disappears from Google / the www :( [23:57:10] hashar: try WP search ;) [23:58:04] oh [23:58:05] I had it wrong https://en.wikipedia.org/wiki/PIRCH !! [23:58:26] anyway yeah will try to figure out a solution [23:59:12] 10Deployment-Systems, 10Architecture, 10Wikimedia-Developer-Summit-2016-Organization, 7Availability: WikiDev 16 working area: Software engineering - https://phabricator.wikimedia.org/T119032#1891985 (10bd808) >>! In T119032#1891041, @daniel wrote: > @bd808 Rob mentioned you might be able to help with struc...