[01:50:33] 10Beta-Cluster-Infrastructure: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2197861 (10Krenair) [01:52:26] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2197341 (10Krenair) [03:25:23] PROBLEM - Puppet staleness on deployment-restbase03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [04:28:48] (03PS1) 10Tim Starling: Fix three bugs in BlockMotionSearch [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282845 [04:28:50] (03PS1) 10Tim Starling: Improve intermediate output [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282846 [04:29:53] (03PS1) 10Tim Starling: Revert "Make it compile without C++11" [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282847 [04:30:00] (03PS2) 10Tim Starling: Revert "Make it compile without C++11" [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282847 [04:31:20] Yippee, build fixed! [04:31:21] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #782: 09FIXED in 24 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/782/ [04:33:06] (03CR) 10Tim Starling: "This fixes the es:Minotauro test case, and presumably any other test case where zero motion was excluded from the search at the bottom of " [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282845 (owner: 10Tim Starling) [05:01:19] (03CR) 10Subramanya Sastry: [C: 031] "Let us get this in and deployed on promethium." [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282845 (owner: 10Tim Starling) [05:01:35] (03CR) 10Tim Starling: [C: 032] Fix three bugs in BlockMotionSearch [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282845 (owner: 10Tim Starling) [05:01:44] (03CR) 10Tim Starling: [V: 032] Fix three bugs in BlockMotionSearch [integration/uprightdiff] - 10https://gerrit.wikimedia.org/r/282845 (owner: 10Tim Starling) [05:05:53] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2197940 (10phuedx) [05:06:17] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 03Reading Web Sprint 70 L: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2196252 (10phuedx) p:05Triage>03Unbreak! [06:33:12] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #77: 04FAILURE in 6 min 11 sec: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/77/ [07:57:15] good morning [07:58:55] i'd appreciate some thoughts on https://phabricator.wikimedia.org/T132379, browser tests are suddenly not running properly on jenkins (and blocking all patches). See https://integration.wikimedia.org/ci/job/mwext-mw-selenium/5443/console . IIRC the browser that run the tests was chrome, but it is complaining about not getting a firefox connection in the [07:58:55] logs. Thanks! [08:06:25] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10RESTBase, 06Services: iptables rules on deployment-restbase02 are wrong - https://phabricator.wikimedia.org/T125038#2198099 (10mobrovac) 05Open>03Resolved a:03mobrovac Yup. Thnx @Krenair for following up on it. [08:20:55] 10Beta-Cluster-Infrastructure, 03Scap3, 07Puppet: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2192991 (10mobrovac) This is likely because `.git/DEPLOY_HEAD` does not exist for these repos on `deployment-tin` [08:22:34] 10Beta-Cluster-Infrastructure, 06Services, 07Puppet, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198117 (10mobrovac) a:03mobrovac *sigh*. A simple `service X... [08:22:44] 10Beta-Cluster-Infrastructure, 06Services, 07Puppet, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198120 (10mobrovac) p:05Triage>03Normal [08:33:17] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #938: 04FAILURE in 23 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/938/ [09:28:55] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 13Patch-For-Review, 03Reading Web Sprint 70 L: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2198258 (10Jdlrobson) I compared the last passing build w... [09:35:00] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 13Patch-For-Review, 03Reading Web Sprint 70 L: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2198286 (10Jdlrobson) I'll be back later today, but if an... [09:42:13] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #777: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/777/ [09:44:45] Project selenium-CentralAuth ยป firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #28: 04FAILURE in 13 sec: https://integration.wikimedia.org/ci/job/selenium-CentralAuth/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/28/ [09:45:30] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:59:51] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [10:48:59] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 13Patch-For-Review, 03Reading Web Sprint 70 L: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2198355 (10phuedx) This looks similar to, if not the same... [11:00:06] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 13Patch-For-Review, 03Reading Web Sprint 70 L: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2198386 (10Jdlrobson) @JanZerebecki may have some idea gi... [11:25:55] 07Browser-Tests, 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, and 2 others: [Task] Remove Wikidata performance-tests jobs - https://phabricator.wikimedia.org/T130017#2198407 (10Tobi_WMDE_SW) [11:34:15] !log Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11 [11:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:34:28] ^^^ got released overnight .. [11:34:55] PROBLEM - Host integration-trusty-1026 is DOWN: CRITICAL - Host Unreachable (10.68.17.98) [11:37:31] 10Continuous-Integration-Config, 10Wikidata, 03Wikidata-Sprint-2016-03-01, 03Wikidata-Sprint-2016-04-12: [Task] Setup a Jenkins job to run Wikidata browsertests on test.wikidata.org - https://phabricator.wikimedia.org/T101499#2198431 (10Tobi_WMDE_SW) [11:37:58] 07Browser-Tests, 10Wikidata, 10Wikidata-Gadgets, 03Wikidata-Sprint-2016-03-01, 03Wikidata-Sprint-2016-04-12: smoke test Feature: Authority control gadget test fails - https://phabricator.wikimedia.org/T131144#2198432 (10Tobi_WMDE_SW) [13:09:52] 10Continuous-Integration-Infrastructure, 07Upstream, 07Zuul: Zuul-cloner failing to acquire .git/config lock sometimes - https://phabricator.wikimedia.org/T86730#2198681 (10Tgr) A different error: https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/62842/console ``` 11:55:45 INFO:zuul.Cloner:Creati... [13:14:26] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Simplify creating Jenkins jobs for running browser tests daily - https://phabricator.wikimedia.org/T128190#2198699 (10zeljkofilipin) [13:15:08] (03PS34) 10Zfilipin: WIP Simplify creating Jenkins jobs for running browser tests daily [integration/config] - 10https://gerrit.wikimedia.org/r/274136 (https://phabricator.wikimedia.org/T128190) [13:17:02] (03PS35) 10Zfilipin: WIP Simplify creating Jenkins jobs for running browser tests daily [integration/config] - 10https://gerrit.wikimedia.org/r/274136 (https://phabricator.wikimedia.org/T128190) [13:26:34] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Simplify creating Jenkins jobs for running browser tests daily - https://phabricator.wikimedia.org/T128190#2198761 (10zeljkofilipin) [13:32:25] 10Beta-Cluster-Infrastructure: Make MathML rendering default on BETA wikipedia - https://phabricator.wikimedia.org/T104550#2198814 (10mobrovac) [13:32:51] 10Beta-Cluster-Infrastructure: Make MathML rendering default on BETA wikipedia - https://phabricator.wikimedia.org/T104550#1420076 (10mobrovac) [14:17:17] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Simplify creating Jenkins jobs for running browser tests daily - https://phabricator.wikimedia.org/T128190#2198914 (10zeljkofilipin) [14:27:56] 10Beta-Cluster-Infrastructure, 06Services, 07Puppet, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198924 (10Krenair) Thank you @mobrovac [14:29:40] (03PS36) 10Zfilipin: WIP Simplify creating Jenkins jobs for running browser tests daily [integration/config] - 10https://gerrit.wikimedia.org/r/274136 (https://phabricator.wikimedia.org/T128190) [14:44:35] (03CR) 10Zfilipin: "recheck" [selenium] - 10https://gerrit.wikimedia.org/r/274938 (https://phabricator.wikimedia.org/T128860) (owner: 10Zfilipin) [14:53:51] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2197341 (10greg) >>! In T132408#2197858, @Jdforrester-WMF wrote: > Looks like the Beta Cluster search servers are broken... [15:12:11] (03PS1) 10Jforrester: Add Coren (marc@uberbox.org) to CI V+2ers [integration/config] - 10https://gerrit.wikimedia.org/r/282952 (https://phabricator.wikimedia.org/T132345) [15:19:27] Yippee, build fixed! [15:19:27] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #393: 09FIXED in 1 min 26 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/393/ [15:24:12] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 10Unplanned-Sprint-Work, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199110 (10MBinder_WMF) [15:24:18] 10Beta-Cluster-Infrastructure, 03Scap3, 07Puppet: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199111 (10thcipriani) aqs deployment failures === It looks like the scap puppet provider is attempting to deploy `analytics/aqs/deploy` from `deployment-t... [15:25:46] 10Beta-Cluster-Infrastructure, 06Operations, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2199121 (10mobrovac) [15:45:20] 07Browser-Tests, 06Release-Engineering-Team, 10MobileFrontend, 10Unplanned-Sprint-Work, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199284 (10Jdlrobson) https://gerrit.wikimedia.org/r/#/c/280470/ wa... [15:48:26] (03CR) 10Jdlrobson: "We believe this has caused T132379" [selenium] - 10https://gerrit.wikimedia.org/r/280470 (https://phabricator.wikimedia.org/T128190) (owner: 10Dduvall) [15:55:58] (03CR) 10Zfilipin: "No, it did not. This commit is only in master. To cause any effect, the gem would have to be released (not done yet) and a repository (Mob" [selenium] - 10https://gerrit.wikimedia.org/r/280470 (https://phabricator.wikimedia.org/T128190) (owner: 10Dduvall) [15:57:59] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 10MobileFrontend, 10Unplanned-Sprint-Work, and 3 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199350 (10zeljkofilipin) [16:01:46] 10Browser-Tests-Infrastructure, 10MobileFrontend, 10Unplanned-Sprint-Work, 13Patch-For-Review, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199372 (10zeljkofilipin) [16:02:39] 10Browser-Tests-Infrastructure, 10MobileFrontend, 10Unplanned-Sprint-Work, 13Patch-For-Review, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2196252 (10zeljkofilipin) a:03zeljkofilipin [16:06:11] 10Beta-Cluster-Infrastructure, 07Puppet: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2199383 (10scfc) The general issue with exported resources is that they require only friendlies on the same `puppetmaster`. In Labs (almost) anyone can be `root... [16:06:26] 10Browser-Tests-Infrastructure, 10MobileFrontend, 10Unplanned-Sprint-Work, 13Patch-For-Review, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199384 (10zeljkofilipin) 05Open>03Resolved [[ https://g... [16:13:13] 10Browser-Tests-Infrastructure, 10MobileFrontend, 10Unplanned-Sprint-Work, 13Patch-For-Review, and 2 others: " unable to obtain stable firefox connection in 60 seconds" blocking merges on MobileFrontend - https://phabricator.wikimedia.org/T132379#2199452 (10zeljkofilipin) Caused by 7c160a9154e7? [16:19:41] 10Beta-Cluster-Infrastructure, 10Monitoring, 07Shinken: Monitor keyholder on deployment-bastion - https://phabricator.wikimedia.org/T111064#2199469 (10scfc) Ah, sorry, I wasn't aware that #Beta has its own #Shinken server. So in the special case of #Beta there might be an easier way, but something along the... [16:21:14] 10Beta-Cluster-Infrastructure, 07Puppet: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2199473 (10Krenair) AFAIK everyone with access to deployment-prep has root on deployment-puppetmaster, which is the puppetmaster for all instances in the project... [16:25:09] 10Beta-Cluster-Infrastructure, 10Monitoring, 07Shinken: Monitor keyholder on deployment-bastion - https://phabricator.wikimedia.org/T111064#2199478 (10Krenair) It doesn't. That file goes to shinken-01.shinken.eqiad.wmflabs:/etc/shinken/customconfig/betacluster-hosts.cfg [16:30:14] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog, 13Patch-For-Review: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2199513 (10EBernhardson) a:03EBernhardson [16:30:42] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog, 13Patch-For-Review: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2197341 (10EBernhardson) 05Open>03Resolved Identified issue and pushed config update, it was b... [16:30:59] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 13Patch-For-Review: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408#2199516 (10EBernhardson) [16:31:16] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 13Patch-For-Review: Switching the prod cluster to query from codfw as part of the DC switchover broke beta cluster - https://phabricator.wikimedia.org/T132408#2199518 (10greg) [16:31:33] thanks for the quick turn around ebernhardson [16:32:22] np [16:33:59] 06Release-Engineering-Team, 05Release: MW-1.27.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T131555#2199539 (10greg) a:05dduvall>03demon [17:01:15] 10Beta-Cluster-Infrastructure, 03Scap3, 07Puppet: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199651 (10Ladsgroup) @thcipriani I was dealing with this issue while I was working on using scap I thought I solved it. That's what I've got so far: 70 me... [17:01:48] 10Beta-Cluster-Infrastructure, 03Scap3, 07Puppet: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199659 (10Ladsgroup) I'll be around in IRC if you want to discuss or ask question. [17:02:04] 10Beta-Cluster-Infrastructure, 03Scap3, 06Revision-Scoring-As-A-Service, 07Puppet: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199660 (10Ladsgroup) [17:02:29] thcipriani: https://phabricator.wikimedia.org/T132267#2199651 [17:02:32] hey btw [17:02:47] Amir1: hiya [17:02:51] * thcipriani looks [17:03:55] I think I explained it very vague, please ask question or tell me to re-tell the whole story [17:05:08] !log ran puppet agen in sca01 manually in /srv directory [17:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:05:44] Amir1: but...why does that make a difference? [17:06:31] puppet shouldn't need to change directories to run deploy-local, right? [17:07:41] I think git clone in scap have issues with that [17:08:07] that's shocking to me [17:08:13] hmmm...but only when run by puppet. Got to be some issue with the provider I guess. [17:08:30] running as the deploy-service user deploy-local seems to run fine from anywhere, seemingly. [17:08:42] it fails if you run it from your home directory [17:09:01] oh boy. /me tries that. [17:09:18] running deploy-local or puppet agent? [17:09:27] deploy-local [17:09:53] if you want to run deploy-local from your home directory it fails [17:10:23] Amir1: are you running it as the deploy-service user? [17:10:24] but when you do sudo -i deploy-service it automatically changes your directory to deploy-service directory [17:10:33] yes [17:10:47] *deploy-service home folder [17:11:14] how are you running the command? I can't seem to make it fail as the deploy-service user. [17:11:38] sure [17:11:53] sudo -u deploy-service && cd /home/ladsgroup [17:12:16] it fails with permission denied [17:12:54] or try to run "puppet agent" in /home/ladsgroup [17:14:18] ah sudo -u deploy-service -- /usr/bin/deploy-local --repo ores/deploy -D log_json:False I see what you're saying. [17:15:14] Oh dear. [17:15:16] "wmf/1.27-wmf.21" [17:15:24] Not "wmf/1.27.0-wmf.21"? [17:17:03] Amir1: ok, lemme futz with this and see if I can find how this is affecting the provider. Thank you for walking me through the problem. [17:17:14] James_F: being fixed now. [17:17:33] thcipriani: Cool, thanks [17:17:35] happy to be helpful :) [17:17:46] thcipriani: tell me if I can do anything [17:20:25] James_F: reason #194919 why we should have automatic branch cuts: fewer typos :) [17:20:33] Amir1: will do, I suspect there is a simple fix somewhere in here. [17:20:35] greg-g: Tell me about it. :-) [17:21:59] 10Beta-Cluster-Infrastructure, 03Scap3, 06Revision-Scoring-As-A-Service, 07Puppet: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199720 (10Krenair) [17:27:25] Amir1: looks like the utils.cd context manager we use to change directories before a git command: https://github.com/wikimedia/scap/blob/master/scap/utils.py#L337 since you've switched users os.chdir blows up if you're in a directory in which the deployment user doesn't have access when you start the process (since it tries to cd back to that directory once the context closes) [17:28:00] good catch, thank you. [17:28:13] \o/ [17:28:26] so we have to wait until the next release comes? [17:29:00] yeah, this is likely going to require a scap change [17:29:21] I thought it would be a puppet change (in provider) [17:29:42] hmm..yeah..I was just thinking that perhaps there is a workaround in the provider that can be applied. [17:30:02] like moving to the deployment directory before starting. [17:30:10] yeah [17:30:22] a temporary fix [17:30:29] is it okay with you? [17:32:34] yeah, that seems fine to me. Shouldn't hurt anything with the provider to do FileUtils.cd :) [17:49:14] Project beta-scap-eqiad build #97935: 04FAILURE in 14 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97935/ [17:53:39] Project beta-scap-eqiad build #97936: 04STILL FAILING in 2 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97936/ [17:54:42] blerg. No space on mediawiki01 again. [17:57:56] Project beta-scap-eqiad build #97937: 04STILL FAILING in 2 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97937/ [17:59:47] 05Gitblit-Deprecate, 06Release-Engineering-Team, 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2199866 (10Dzahn) [18:07:23] Project beta-scap-eqiad build #97938: 04STILL FAILING in 2 min 29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97938/ [18:17:23] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50833 bytes in 0.026 second response time [18:17:26] Project beta-scap-eqiad build #97939: 04STILL FAILING in 2 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97939/ [18:17:56] PROBLEM - Host deployment-restbase03 is DOWN: CRITICAL - Host Unreachable (10.68.16.252) [18:19:17] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [18:22:25] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 40592 bytes in 0.825 second response time [18:27:26] Project beta-scap-eqiad build #97940: 04STILL FAILING in 2 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97940/ [18:30:42] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #655: 04FAILURE in 41 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/655/ [18:37:29] Project beta-scap-eqiad build #97941: 04STILL FAILING in 2 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97941/ [18:37:50] RECOVERY - Host deployment-restbase03 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [18:44:19] PROBLEM - Host deployment-restbase03 is DOWN: CRITICAL - Host Unreachable (10.68.22.121) [18:47:34] Project beta-scap-eqiad build #97942: 04STILL FAILING in 2 min 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97942/ [18:48:20] greg-g: beta wmflabs login is unavailble :( [18:50:04] 10Beta-Cluster-Infrastructure, 10Flow, 03Collab-Team-2016-Q4, 13Patch-For-Review: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#2200108 (10jmatazzoni) [18:53:57] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #658: 04FAILURE in 56 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/658/ [18:55:00] etonkovidova, I think beta is broken in general... looking [18:55:38] Krenair: lol true [18:57:26] Project beta-scap-eqiad build #97943: 04STILL FAILING in 2 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97943/ [18:58:51] thcipriani: it would be great if you take a look at this: https://gerrit.wikimedia.org/r/282992 [18:59:33] BASE_PATH = '/srv/deployment [19:00:00] Amir1: maybe the cd path should be deploy_root? [19:00:58] sure [19:02:18] I think something is wrong on deployment-mediawiki01 [19:02:34] 10Beta-Cluster-Infrastructure, 10Flow, 03Collab-Team-2016-Q4: Set up second External Store cluster on Beta - https://phabricator.wikimedia.org/T128417#2200189 (10jmatazzoni) [19:03:22] the projects section in this phab card is so colorful https://phabricator.wikimedia.org/T132267 [19:03:27] Project beta-scap-eqiad build #97944: 04STILL FAILING in 2 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97944/ [19:03:37] Krenair: blerg. deployment-mediawiki01 has been running out of space pretty regularly :( [19:03:38] thcipriani: done [19:05:07] Amir1: +1'd feel free to cherry pick on beta. [19:05:10] ah, disk full [19:05:12] didn't notice that [19:05:32] thanks :) [19:06:15] (03CR) 10Legoktm: [C: 032] Update phpunit to 4.8.24 [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/281817 (owner: 10Paladox) [19:06:56] 1.2G /srv/mediawiki/php-master/cache/l10n/upstream/.~tmp~ [19:07:25] Krenair: hmmm [19:07:47] Project beta-scap-eqiad build #97945: 04STILL FAILING in 2 min 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97945/ [19:08:23] !log manually cherry-picked 282992/2 into to puppetmaster [19:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:10:04] ori, there are some disk space problems on deployment-mediawiki01, /home/ori/debs is >400M.. do you need it? [19:10:21] in particular -rw-r--r-- 1 ori wikidev 405M Mar 8 02:50 hhvm-dbg_3.12.1+dfsg-1_amd64.deb [19:10:27] No. You can delete it. Or, if you like, I can. [19:10:37] I don't mind [19:10:41] Tnx [19:10:44] !log manually rebooted deployment-ores-web [19:10:45] I have a shell open though, so I'll do it. thanks [19:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:10:49] we really need more disk space on a lot of beta instances [19:12:14] yeah :( [19:13:33] Chrome puts "Close other tabs" *way* too close to "Close tab" [19:16:56] ah, just as I was about to tell her I fixed it :( [19:17:21] twentyafterfour, do we have a task about recurring disk space issues? I should probably know by now, but... [19:17:38] I think there was something about diamond [19:18:48] the atop/var//log one [19:18:52] is the best off the top of my head [19:19:03] Yippee, build fixed! [19:19:03] Project beta-scap-eqiad build #97946: 09FIXED in 4 min 6 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/97946/ [19:19:17] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 40237 bytes in 1.528 second response time [19:34:44] eh, is puppet broken for CI hosts? [19:34:51] integration-slave-trusty-1013 is a CI Jenkins slave for running tests in local browsers (role::ci::slave::localbrowser) [19:34:51] The last Puppet run was at Mon Apr 11 14:50:25 UTC 2016 (1247 minutes ago). [19:39:13] Krenair: on deployment-mediawiki01 it looks like the biggest disk hog is /var/cache/hhvm/fcgi.hhbc.sq3 (2.3G). [19:39:47] Krenair: judging by the output of `sqlite3 fcgi.hhbc.sq3 .tables` we aren't cleaning that file up when we deploy new HHVM builds [19:40:48] a "feature" of hhbc files is that they version the cache to match the exact HHVM build [19:41:10] so when we deploy a new hhvm binary it starts a fresh cache segment [19:41:22] but nothing automatically cleans up the old cache data [19:43:08] it looks like we have 9 versions of each table in that particular hhbc cache [19:46:38] !log Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start` [19:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:47:16] (03Merged) 10jenkins-bot: Update phpunit to 4.8.24 [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/281817 (owner: 10Paladox) [19:47:18] !log Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start` [19:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:47:47] !log Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start` [19:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:48:13] That got back about 2G on each MW server [19:48:27] bd808, interesting, thanks [19:48:34] will try that in future [19:48:41] is it documented somewhere? :) [19:49:29] heh. well, probably not [19:49:32] should we just have that done every week? :) [19:49:42] it's int backscroll here now :) [19:50:00] greg-g: good question [19:50:21] I'm not sure if we are managing this in prod either [19:51:00] in prod there would be other interesting cruft that builds up due to the weekly path changes [19:51:29] * greg-g nods [19:52:05] this is something that FB never deals with because they use repo authoratative. that means they ship a new hhbc cache with each deploy [19:52:13] right right [19:52:38] I was looking around but didn't find something I thought I had built 2 years ago [19:52:55] I'm pretty sure I wrote a python program to prune an hhbc file [19:52:59] but I can't find it [19:53:16] also, time flies [19:53:49] it totally doesn't seem like it has bee 2 years since we started the hhvm conversion project [19:54:29] 10Beta-Cluster-Infrastructure, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 13Patch-For-Review, and 2 others: Switching the prod cluster to query from codfw as part of the DC switchover broke beta cluster - https://phabricator.wikimedia.org/T132408#2200611 (10Ryasmeen) [19:56:16] right? [19:56:24] I remember what you're talking about now [19:56:39] or was that about core dumps? [19:56:54] probably core dumps [19:57:33] If I had saved it, it would be in https://github.com/bd808/bug-67168 [19:57:37] but it is not [19:58:12] there is a bash script in there to build repo authoratiative caches as part of scap :) [19:58:18] I just looked at /var/cache/hhvm/fcgi.hhbc.sq3 in prod [19:58:32] Some of those things are 5GB [19:59:03] probably not the best use of sqlite [19:59:55] Loads in codfw at 1.1GB, probably barely touched [20:00:12] 10Beta-Cluster-Infrastructure: Creating wiki at beta cluster for the Dutch Wikipedia - https://phabricator.wikimedia.org/T118005#2200659 (10Natuur12) @Krenair: yes, I am interested though I do have less time than when I asked but it should be managable with some help. [20:00:47] the one on mw1017 is only 230M. I wonder if that means that ori has cleaned it up manually before [20:03:07] Hi, some CI expert here? I got a gerrit patch for /mw/extension/MassMessage of an other author, and json lint is failing, but no one of us found the reason [20:03:18] Luke081515: link? [20:03:19] https://gerrit.wikimedia.org/r/#/c/282977/2 <-- can someone take a look? [20:03:40] Luke081515: well, it's not valid JSON :) [20:04:04] you have two quotes before ActionFilteredLogs [20:04:23] and then you're using {} for arrays, which should be [] instead [20:04:31] ah, ok [20:04:37] thx :) [20:04:56] (It's not my patch, but thanks for explaining, I didn't found the reason) [20:05:07] http://jsonlint.com/ usually gives good advice [20:05:30] bd808: the maximum database size for sqlite is actually around 140 terabytes. multi-gigabyte sqlite databases are common and perform well, in my experience. [20:05:59] ori: good to know [20:06:04] I *think* I purged the hhbc file when I upgraded to the latest version. I may have neglected to do the same for labs [20:06:40] beta cluster was certainly stale. Looked to have 9 copies of each table [20:08:38] I also think you are right about there being (or there having been) a script to automate the pruning [20:08:40] looking that up atm [20:10:39] there are 2 kinds of cleanup we might do: dropping tables for old repo schema versions (easy) and removing rows for old release branches (harder) [20:12:01] The current repo version is in the `hhvm -version` output. Any table that doesn't end with that sha1 is junk [20:12:16] my take: [20:12:40] - The deb's post-upgrade script should wipe hhbc files [20:12:51] - In the absence of that, I should have pruned them and neglected to [20:13:12] - a 5 GB file really ought not to be an issue in 2016 [20:14:11] yeah. the mediawiki nodes in beta cluster have very small disks [20:14:21] VMs blah blah blah [20:15:05] we could benchmark how long it takes HHVM to warm up when there is no preexisting hhbc file. If it's not substantially worse, the service job could delete it before starting hhvm [20:15:50] bd808: have you already pruned the files on labs, or should I? [20:16:16] I killed the ones on the mw hosts [20:16:16] apart from that I think I'm just going to add a note to wikitech on the HHVM upgrading / packaging page to remind myself or whomever else performs the next upgrade to prune bytecode cache files. [20:16:30] sounds like a good plan [20:16:33] thanks! [20:16:42] (for killing the ones on the mw hosts) [20:16:45] it's not a huge thing at all, just an annoyance [20:17:21] 10Beta-Cluster-Infrastructure, 03Scap3, 06Revision-Scoring-As-A-Service, 13Patch-For-Review, 07Puppet: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2200743 (10Ladsgroup) I cherry-picked this patch into the beta puppetmaster. So the issu... [20:18:34] Krenair: another poorly documented cleanup script is ~bd808/cleanup-var-crap.sh on deployment-tin [20:18:55] that one gets rid of lots of logs [20:19:25] it's probably not needed so much now as it was on deployment-bastion where we had the old tiny /var partition [20:19:54] 06Release-Engineering-Team, 06Team-Practices, 10Developer-Relations (Jul-Sep-2016): Developer Summit 2017: Work with TPG and RelEng on solution to event documenting - https://phabricator.wikimedia.org/T132400#2200753 (10Rfarrand) p:05Triage>03Normal [20:25:59] Krenair: not that I know of [20:31:27] greg-g: i'm unsure yet if its needed, but just in case: may we have a deployment window tomorrow for a full scap that updates wikidata in the branch that is on group0? [20:32:29] jzerebecki: sure, go ahead and put one after the deploy window [20:32:35] someting with gate-and-submit is wrong? After one change was merged, zuul aborted all tests and restarted them [20:32:42] and this at php55 tests [20:33:06] greg-g: I meant before the deploy window [20:33:23] s/deploy/train/ [20:34:04] jzerebecki: still ok, just fix the rest of my response [20:34:07] :) [20:34:09] thx [20:34:16] Luke081515: I've been noticing that, not sure what's going on [20:34:46] I hope gate-and-submit is empty till the next SWAT starts :D [20:35:14] If this will continue, I guess Zuul will need 2-3 hours more time [20:38:15] greg-g: I will watch the next extension-php55, let me see if this continues, I hope not [20:38:59] * greg-g nods [20:42:06] greg-g: [20:42:09] 20:40:27 ........Build was aborted [20:42:10] 20:40:27 Aborted by anonymous [20:42:13] uhhhh [20:42:14] 20:40:28 Finished: ABORTED [20:42:21] I will file a bug [20:42:22] which? [20:42:31] https://integration.wikimedia.org/ci/job/mwext-testextension-php55/7770/console [20:42:43] this was core [20:43:21] also started by anonymous [20:43:28] yeah, that's default [20:43:29] the anonymous part is probably a red-herring [20:43:33] I'm going to file a task now [20:44:08] * greg-g 's unawareness of anonymous, obviously [20:44:26] anonymous is zuul/gearman [20:44:53] why would zuul abort a job though? [20:45:04] * twentyafterfour doesn't know much about zuul. maybe time to learn [20:45:07] * Luke081515 going to watch another build [20:45:22] this time mobilefrontened [20:47:18] twentyafterfour: it is a bad spot in our skill matrix: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Skill_matrix :) [20:48:09] twentyafterfour: if someone submits a new patchset, then it'll abort the already running jobs [20:48:33] 10Continuous-Integration-Infrastructure: Zuul restarts all builds of gate-and-submit after the first set of the queue is merged - https://phabricator.wikimedia.org/T132506#2200921 (10Luke081515) [20:49:03] so, I know that zuul usually does this sort of thing if the previous build fails. The log is full of nothing but successful merges. Not sure what could cause this. [20:49:07] meh, gate-and-submit is has currently 10 sets again [20:49:22] DEBUG:zuul.Cloner:Project mediawiki/core in Zuul does not have ref refs/zuul/master/Z7ecc71b531ce46c4b7b3465db3b85a52 [20:49:24] 20:40:55 INFO:zuul.Cloner:Falling back to branch master [20:49:40] Yippee, build fixed! [20:49:40] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #983: 09FIXED in 23 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/983/ [20:49:48] could this be related to the new branch that got cut twice? ostriches? [20:50:18] I'm watching one build progressing but very slowly [20:50:25] something is slowing down the phpunit execution [20:50:53] I don't understand, what's the problem? [20:51:02] I don't know either [20:51:21] seems to me things are just overloaded? [20:51:25] so when a patch is merged in the gate-and-submit pipeline, all the other patches in the pipeline's jobs are getting queued and restarted. [20:51:39] so the queue is just getting longer and longer. [20:51:53] A patch just merged and that didn't happen? [20:51:58] If a patch fails, it'll reset the queue [20:52:10] but it didn't faild [20:52:17] no the current patch [20:52:22] twentyafterfour: Maybe? [20:52:27] just one at the bottom, and this patch is still queued [20:53:23] I don't think anything is broken [20:53:24] Just slow [20:53:34] slow != broken? [20:53:39] ;) [20:53:53] legoktm: But normaly long time taking php55 test don'T restart after a merge of another patch? [20:54:04] 1 hr 11 min gate queue time isn't right [20:54:24] where do you see that, twentyafterfour? [20:54:33] I guess we can guess the next exmaple, https://gerrit.wikimedia.org/r/#/c/282313/ is going to be merged: https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/2698/console [20:54:34] twentyafterfour: well, it's a different kind of broken. [20:54:42] and the next build is https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/2700/console [20:54:58] so this will be aborted if it will happen again [20:55:19] oh, nevermind, I see it [20:55:28] Luke081515: No it won't? [20:56:13] legoktm: We will see :D. the next merge needs a bit at the moment [20:56:23] currently 91% of php55 finished, it's the last test [20:56:24] it's definitely not supposed to :P [20:56:27] ok, 96 [20:57:19] hm, doesn't happen this time [20:57:45] hrm. Maybe there were just a handful of jobs where php55 tests failed. Jamming up the queue. [20:58:07] didn't look like that, monitoring zuul... [20:58:11] I hope it won't happen again, currently nearly all tests finished, so gate-and-submit will become empty quick, if the next test finished [20:58:34] Also it depends on which slaves the jobs get assigned to, and if a slave fills up multiple executor slots, the tests are going to run slower [20:59:20] yeah, but the won't be aborted IIRC? [21:00:14] and restarted.... [21:00:15] hmm, just did it again :( [21:00:29] but two times it didn't happend, but now again [21:00:45] seems like an issue which only happens if mwcore is merged? Or am I wrong? [21:01:22] 21:00:07 .............Build was aborted [21:01:24] 21:00:07 Aborted by anonymous [21:01:36] 21:00:07 /srv/deployment/integration/slave-scripts/bin/mw-run-phpunit-allexts.sh: line 22: 7724 Terminated $PHP_BIN phpunit.php --log-junit "$JUNIT_DEST" --testsuite extensions [21:02:49] Exception: Gerrit error executing gerrit review --project translatewiki --message "Gate pipeline build succeeded. [21:02:52] 10Continuous-Integration-Infrastructure: Zuul restarts all builds of gate-and-submit after the first set of the queue is merged - https://phabricator.wikimedia.org/T132506#2200989 (10Luke081515) Happend again: https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/2706/console [21:03:01] 2016-04-12 21:00:06,282 INFO zuul.DependentPipelineManager: Reported change status: all-succeeded: True, merged: False [21:03:01] 2016-04-12 21:00:06,282 INFO zuul.DependentPipelineManager: Resetting builds for change because the item ahead, in gate-and-submit>, failed to merge [21:03:06] there you go [21:03:20] but the changes were merged [21:03:27] someone force merged, so zuul couldn't merge, so it took that as a failure, and reset the queue [21:03:44] https://gerrit.wikimedia.org/r/#/c/283006/ was force merged [21:04:02] can we avoid this? This is annoying [21:04:20] Yes, tell people not to force merge. [21:04:20] so that zuul aborts gate-and-submit if it's forced merged [21:04:26] or like that [21:08:30] legoktm: Do you know if there is a possibitly to let zuul abort the builds for a change, if he gets forced merged? [21:08:45] what? [21:09:25] if someone is using forece merge, zuul would not try to merge. this is an alternative way too [21:09:33] is that possible? [21:10:18] zuul is not designed to work like that [21:10:37] it expects to be in sole control of merging things into respositories that it manages [21:10:55] which is why we've slowly been removing submit permissions in repositories as it makes sense [21:10:58] one more for my list of zuul gripes [21:11:20] there's no good reason for people to bypass CI, except when CI itself is broken. [21:11:46] The main problem IMO is that we haven't done a good job educating people of the impact of force-merging [21:12:18] I certainly wasn't aware of this [21:13:08] legoktm: Are there other side effects to than skipiing gate-and-submit tests? [21:13:39] I don't understand what you're asking [21:17:56] legoktm: sry. I'm interested if there are other disadvantages too if someone uses force merge [21:19:11] bypasses CI and breaks zuul, mostly [21:19:21] [21:20:10] hm, ok [21:20:24] gate-and-submit is ok now, the oldest patch is only 8 minutes at the queue now [21:24:55] (03PS4) 10Legoktm: Run mw-apply-settings for parsoidsvc-php-parsertests [integration/config] - 10https://gerrit.wikimedia.org/r/274863 (owner: 10Arlolra) [21:28:47] (03CR) 10Legoktm: [C: 032] "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/274863 (owner: 10Arlolra) [21:30:08] (03Merged) 10jenkins-bot: Run mw-apply-settings for parsoidsvc-php-parsertests [integration/config] - 10https://gerrit.wikimedia.org/r/274863 (owner: 10Arlolra) [21:31:20] (03CR) 10Legoktm: "Passed: https://integration.wikimedia.org/ci/job/parsoidsvc-php-parsertests/7150/console" [integration/config] - 10https://gerrit.wikimedia.org/r/274863 (owner: 10Arlolra) [21:41:18] PROBLEM - Puppet run on integration-slave-trusty-1024 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:16:19] RECOVERY - Puppet run on integration-slave-trusty-1024 is OK: OK: Less than 1.00% above the threshold [0.0] [22:20:41] Project beta-update-databases-eqiad build #7810: 04FAILURE in 41 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/7810/ [22:24:38] PROBLEM - Host cache-rsync is DOWN: CRITICAL - Host Unreachable (10.68.23.165) [22:36:56] Yippee, build fixed! [22:36:56] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #1017: 09FIXED in 25 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/1017/ [23:20:31] Yippee, build fixed! [23:20:32] Project beta-update-databases-eqiad build #7811: 09FIXED in 31 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/7811/