[00:07:11] (03PS1) 10Dduvall: Refactored EAL configuration overrides [selenium] - 10https://gerrit.wikimedia.org/r/180681 [00:22:36] Project beta-update-databases-eqiad build #6245: FAILURE in 2 min 35 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/6245/ [00:24:30] (03PS1) 10Dduvall: Fallback to base defined mediawiki_password [selenium] - 10https://gerrit.wikimedia.org/r/180685 [00:27:33] (03Abandoned) 10Dduvall: Fallback to base defined mediawiki_password [selenium] - 10https://gerrit.wikimedia.org/r/180685 (owner: 10Dduvall) [00:28:05] (03Abandoned) 10Dduvall: Refactored EAL configuration overrides [selenium] - 10https://gerrit.wikimedia.org/r/180681 (owner: 10Dduvall) [00:28:15] Yippee, build fixed! [00:28:16] Project beta-scap-eqiad build #34359: FIXED in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34359/ [01:21:23] Yippee, build fixed! [01:21:24] Project beta-update-databases-eqiad build #6246: FIXED in 1 min 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/6246/ [02:14:56] Project beta-scap-eqiad build #34369: FAILURE in 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34369/ [02:35:55] Yippee, build fixed! [02:35:55] Project beta-scap-eqiad build #34371: FIXED in 1 min 50 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34371/ [03:42:46] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #196: FAILURE in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/196/ [04:11:15] PROBLEM - Puppet staleness on deployment-sca-cache01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:41:29] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #422: FAILURE in 1 hr 4 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/422/ [04:56:45] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #251: SUCCESS in 15 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/251/ [05:04:16] Project browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce build #373: STILL FAILING in 49 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce/373/ [05:11:54] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #181: SUCCESS in 56 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/181/ [05:20:29] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #196: STILL FAILING in 1 hr 4 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/196/ [05:55:24] Project beta-scap-eqiad build #34392: FAILURE in 1 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34392/ [06:05:32] Yippee, build fixed! [06:05:33] Project beta-scap-eqiad build #34393: FIXED in 1 min 34 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34393/ [06:06:24] (03PS2) 10Krinkle: Job template to run composer scripts [integration/config] - 10https://gerrit.wikimedia.org/r/174410 (owner: 10Hashar) [06:22:51] Yippee, build fixed! [06:22:51] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #475: FIXED in 1 hr 18 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/475/ [06:35:29] Project beta-scap-eqiad build #34396: FAILURE in 1 min 29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34396/ [06:49:10] (03PS3) 10Krinkle: Add job template for running composer scripts [integration/config] - 10https://gerrit.wikimedia.org/r/174410 (owner: 10Hashar) [06:49:12] (03PS1) 10Krinkle: Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 [06:50:10] (03CR) 10jenkins-bot: [V: 04-1] Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 (owner: 10Krinkle) [06:50:32] (03PS3) 10Krinkle: Replace cdb-phpunit with cdb-composer [integration/config] - 10https://gerrit.wikimedia.org/r/174411 (owner: 10Hashar) [06:51:04] (03PS2) 10Krinkle: Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 [06:52:28] (03CR) 10jenkins-bot: [V: 04-1] Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 (owner: 10Krinkle) [06:53:29] (03PS3) 10Krinkle: Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 [06:54:22] (03CR) 10jenkins-bot: [V: 04-1] Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 (owner: 10Krinkle) [06:56:54] Yippee, build fixed! [06:56:54] Project beta-scap-eqiad build #34398: FIXED in 2 min 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34398/ [07:05:38] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #340: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/340/ [07:32:49] 3OOjs-UI, Continuous-Integration: PHP docs should be auto-generated - https://phabricator.wikimedia.org/T74454#750834 (10Krinkle) When this happens, please maintain a redirect from https://doc.wikimedia.org/oojs-ui/master to https://doc.wikimedia.org/oojs-ui/master/php. [08:33:38] 3Phabricator, Continuous-Integration: Phabricator project display names use awkward capitalization and hyphenation - https://phabricator.wikimedia.org/T911#931834 (10Nemo_bis) > @Nemo_bis: See T75892 instead. Why? I don't want scattered discussions, I want consistency. [08:35:33] Project beta-scap-eqiad build #34408: FAILURE in 1 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34408/ [08:38:36] 3Beta-Cluster: Puppet keeps restarting jobrunner service - https://phabricator.wikimedia.org/T76999#931843 (10Joe) This doesn't happen in prod FYI [08:56:04] Yippee, build fixed! [08:56:04] Project beta-scap-eqiad build #34411: FIXED in 2 min 1 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34411/ [09:13:18] !log enabled MediaWiki core 'structure' PHPUnit tests for all extensions. Will require folks to fix their incorrect AutoLoader and RessourceLoader entries. {{gerrit|180496}} {{bug|T78798}} [09:13:21] Logged the message, Master [09:24:01] (03PS1) 10Hashar: MobileFrontend depends on VisualEditor [integration/config] - 10https://gerrit.wikimedia.org/r/180740 [09:31:00] Yippee, build fixed! [09:31:01] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #351: FIXED in 35 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/351/ [09:34:21] Project beta-scap-eqiad build #34412: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34412/ [09:39:55] (03PS2) 10Hashar: MobileFrontend depends on Echo,VisualEditor [integration/config] - 10https://gerrit.wikimedia.org/r/180740 [09:43:31] 3Continuous-Integration: [OPS] hhvm 3.3.0-20140925+wmf3 has some annoying build dependency - https://phabricator.wikimedia.org/T73413#932079 (10Joe) [09:48:55] Yippee, build fixed! [09:48:56] Project beta-scap-eqiad build #34414: FIXED in 4 min 54 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34414/ [10:38:02] 3Continuous-Integration: AbuseFilter requires the AntiSpoof - https://phabricator.wikimedia.org/T84859#932161 (10hashar) 3NEW [10:39:02] 3Continuous-Integration: AbuseFilter requires the AntiSpoof - https://phabricator.wikimedia.org/T84859#932161 (10hashar) ``` There were 2 skipped tests: 1) AbuseFilterParserTest::testParser with data set #4 ('ccnorm', 'ccnorm("aanyone") = "AANY0NE"', true) Parser test ccnorm requires the AntiSpoof extension 2)... [10:56:29] (03CR) 10Hashar: [C: 032] "Jobs passing properly now. The VisualEditor submodule fetch is really a hack, will need to be done in zuul-cloner one day" [integration/config] - 10https://gerrit.wikimedia.org/r/180740 (owner: 10Hashar) [11:01:16] (03Merged) 10jenkins-bot: MobileFrontend depends on Echo,VisualEditor [integration/config] - 10https://gerrit.wikimedia.org/r/180740 (owner: 10Hashar) [11:08:38] (03PS3) 10Hashar: (WIP) gating extensions together (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/180494 [11:09:01] (03CR) 10Hashar: "rebased and added Echo" [integration/config] - 10https://gerrit.wikimedia.org/r/180494 (owner: 10Hashar) [11:55:24] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:09:14] RECOVERY - Puppet failure on deployment-cache-mobile03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:14:24] RECOVERY - Puppet failure on deployment-pdf01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:19:49] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [12:20:27] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:28:02] hashar: whom to page about issues of Parsoid on Beta? [12:31:38] PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:35:32] kart_, I was talking to YuviPanda in -labs and marcoil in #mediawiki-parsoid [12:35:52] and subbu [12:38:41] Krenair, I was trying to figure out why the git checkout was left in a dirty state .. but i am not sure. so, just reset and did an update .. will restart in a bit. [12:39:11] when you go to VE, parsoid's varnish cache gives a 503 error. [12:43:14] subbu: the git working dir ends up corrupted from time to time. must be some io failure [12:43:19] so, i think there are two different issues .. one is that automatic deploy from gerrit has been failing since a week or so. i am not sure what is going on there yet. [12:43:20] kart_: filling a task is sometime effective :] [12:43:39] and second thing is andrew merged an old puppet patch [12:44:02] https://gerrit.wikimedia.org/r/#/c/166610/ [12:45:24] hashar, yes, sudo git reset --hard origin/master; sudo git checkout master isn't doing the trick there. [12:45:41] so, probably i/o failure then. [12:45:53] subbu: aren't you supposed to sleep ? [12:46:12] looking at the jenkins job [12:47:10] https://integration.wikimedia.org/ci/view/Beta/ [12:47:12] its red! [12:47:51] https://integration.wikimedia.org/ci/view/Beta/job/beta-parsoid-update-eqiad/611/console [12:47:59] there is a local change in tests/mocha/lintertest.js for some reason [12:48:07] stupid git [12:48:16] hashar, there isn't any change. [12:48:20] hashar, i woke up veyr early :) [12:48:33] but, feeling very sleepy again now. [12:49:08] the deploy is a two step process [12:49:14] the repos are checked out in the job workspace [12:49:31] then they are rsynced to the final destination and parsoid reloaded [12:50:01] ya, the first step seems to be failing. [12:50:09] the repo in /mnt/jenkins-workspace/workspace/beta-parsoid-update-eqiad/deploy [12:50:09] [12:50:10] is not clean [12:50:18] -Subproject commit d16dd2db6b3ca56e73439e169d52258214f0aeb2 [12:50:18] +Subproject commit cd7b5b0b165e260c7e413a4c270807a6f05a3786-dirty [12:50:57] hey the files belong to root! [12:51:37] yes, that is what i found when i looked. [12:52:04] !log deleting the workspace for the beta-parsoid-update-eqiad jenkins job on deployment-parsoid04 . Some file belong to root which prevent the job from processing [12:52:09] Logged the message, Master [12:52:18] lets retrigger the last merged change [12:53:46] !log reenqueuing last merged change of Parsoid in Zuul postmerge pipeline in order to trigger the beta-parsoid-update-eqiad job properly. zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2 [12:53:48] Logged the message, Master [12:53:48] Yippee, build fixed! [12:53:49] Project beta-parsoid-update-eqiad build #612: FIXED in 32 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/612/ [12:53:56] subbu: you are our hero :] [12:53:58] kart_: fixed [12:54:05] apparently someone used root to update the repo [12:54:15] hashar, ?? i am? [12:54:49] it has been failing since https://gerrit.wikimedia.org/r/#/c/178871/ .. almost a week now ... https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/590/console [12:55:39] so, not sure who used root when .. but it is not me. but yes, i used sudo just now to fix it :) but, that was not the right fix clearly. [12:55:44] Project beta-scap-eqiad build #34433: FAILURE in 1 min 42 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34433/ [13:02:20] subbu: that might have been me :-/ [13:02:31] I was on the machine a few minutes before that failure [13:08:21] ok. glad beta parsoid is back up now. [13:10:11] subbu, it is? [13:11:08] Krenair, hashar oh, i assumed it was because of that above: " Project beta-parsoid-update-eqiad build #612: FIXED in 32 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/612/" [13:11:22] the error is still there for me [13:11:47] well the code should be up to date [13:11:48] Error 503 Service Unavailable from Varnish cache server [13:11:53] and hopefully parsoid is restarted [13:11:53] hashar, i think this is because of the puppet patch getting merged .. [13:12:15] iirc, you had manually set up local settings.js file while the puppet patch was in review. [13:12:23] maybe the parsoidcache varnish is borked? [13:12:48] aren't there logs that would tell us this? [13:14:21] hashar, 12:53:48 + ln -s /srv/deployment/parsoid/localsettings.js /srv/deployment/parsoid/parsoid/api/localsettings.js [13:14:33] that no longer works because i don't see the files there. [13:14:42] that is from: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/612/console [13:15:01] hmm [13:15:56] it should be /srv/deployment/parsoid/deploy/conf/wmf/betalabs.localsettings.js [13:16:23] the localsettings.js might be provided via puppet [13:16:34] https://gerrit.wikimedia.org/r/#/c/166610/5/manifests/role/parsoid.pp [13:17:21] hmm .. ps shows the right config file now based on the new puppet settings [13:17:23] parsoid 12034 0.1 1.0 754332 42668 ? Sl 12:53 0:01 /usr/bin/nodejs /srv/deployment/parsoid/deploy/src/api/server.js -c /srv/deployment/parsoid/deploy/conf/wmf/betalabs.localsettings.js [13:18:26] yeah the ln -s is apparently of no use nowadays [13:18:37] the betalabs.localsettings.js must be filled in the upstart configuration [13:19:13] !log rebased puppetmaster repo [13:19:13] and /data/project/parsoid/parsoid.log shows the workers are up .. [13:19:15] Logged the message, Master [13:19:16] {"name":"parsoid","hostname":"deployment-parsoid04","pid":12034,"level":30,"logType":"info","process":{"name":"worker","pid":12034},"msg":"ready on :8000","longMsg":"ready on :8000","time":"2014-12-18T12:54:04.768Z","v":0} [13:21:45] (03PS1) 10Hashar: Beta parsoid conf is now in deploy repo [integration/config] - 10https://gerrit.wikimedia.org/r/180783 [13:22:12] subbu: ^^^^ bye bye ln -s [13:22:33] ssastry@deployment-parsoid04:~$ curl http://localhost:8000 .. also returns the right result [13:23:45] !log updated labs/private on puppet master to fix a puppet dependency cycle with sudo-ldap [13:23:46] Logged the message, Master [13:23:57] subbu: sounds great [13:24:33] !log apt-get upgrade on parsoidcache02 and parsoid04 [13:24:35] Logged the message, Master [13:25:24] what about parsoid05? [13:25:31] what it is ? [13:25:34] is it [13:25:35] whatever [13:25:54] subbu: Thank you :) [13:25:55] does that need an upgrade as well? [13:26:05] Krenair: I have no idea what that instance is for :/ [13:26:28] hashar, but, http://en.wikipedia.beta.wmflabs.org/w/index.php?title=0.921072816063511&veaction=edit still returns a 503 [13:26:45] that is what Krenair is reporting as well .. [13:26:54] maybe someone created parsoid05 [13:26:58] and pointed the puppet conf to it [13:27:10] well I hope someone created it rather than it just appearing one day :) [13:27:21] :) [13:27:21] CAUSE IT HAS PARSOID RUNNING ON IT!!!!!!!!!!!!!!!! [13:28:00] * hashar digs [13:28:03] subbu: affermative. still happens. [13:28:08] lets see where the varnish cache points to [13:28:13] (03CR) 10Subramanya Sastry: [C: 031] Beta parsoid conf is now in deploy repo [integration/config] - 10https://gerrit.wikimedia.org/r/180783 (owner: 10Hashar) [13:33:17] Krenair: there is no more varnish backend running on parsoidcache02 :-( [13:35:27] pffff [13:35:31] what a f***** mess [13:35:36] Jenkins updates parsoid04 [13:35:42] but the varnish conf points to parsoid05 [13:35:47] which I have ZERO idea how it is updated [13:39:21] (03CR) 10Hashar: [C: 032] "Jenkins job updated:" [integration/config] - 10https://gerrit.wikimedia.org/r/180783 (owner: 10Hashar) [13:41:58] ah [13:42:02] parsoid05 is using Trusty [13:43:44] (03Merged) 10jenkins-bot: Beta parsoid conf is now in deploy repo [integration/config] - 10https://gerrit.wikimedia.org/r/180783 (owner: 10Hashar) [13:45:52] found out https://gerrit.wikimedia.org/r/#/c/169622/2 [13:45:55] that does the switch [13:47:03] subbu: kart_ Krenair: Parsoid hasn't been updated since October 28th [13:47:25] does parsoid still run with node 0.8? [13:47:40] hashar: that's bad :( [13:47:53] wasn't it switched to 0.10? [13:48:06] pfff [13:48:21] I will finish the migration [13:49:40] hashar, are you moving everything to 04 or 05? [13:49:53] hashar, node 0.10 [13:50:03] yeah [13:50:11] so I am going to migrate the beta updating job to parsoid05 [13:50:15] and add parsoid05 as a jenkins slave [13:50:25] delete all the conf / repos fetched [13:50:27] and run the job [13:50:36] whoever caught that: kudos! [13:51:01] hashar: you need to submit new patch set since 169622 is merged? [13:51:21] no idea [13:51:26] some path got changed https://gerrit.wikimedia.org/r/#/c/169622/2/manifests/role/parsoid.pp [13:51:32] I will just rm -fR * [13:51:37] and let the jenkins job figure it out [13:52:03] !log making parsoid05 a Jenkins slave to replace parsoid04 [13:52:05] Logged the message, Master [13:53:34] hashar: Hey, sorry for making this mess, and thanks for cleaning it up for me :) [13:53:45] * hashar raises fist [13:54:00] RoanKattouw: 50%my fault for never having documented how the parsoid is updated on beta :D [13:54:18] subbu: would be nice to have parsoid expose its version / git sha1 date via the API [13:54:41] subbu: so we could write an icinga monitor to ensure the parsoid service uses the latest commit of the repo [13:54:50] RoanKattouw: no worry :] [13:55:40] the curious thing is why this wasn't broken till now .. and what broke it today. [13:55:45] hashar, /_version [13:55:59] http://parsoid-lb.eqiad.wikimedia.org/_version for ex. [13:56:00] !log parsoid05: disabling puppet, stopping parsoid, rm -fR /srv/deployment/parsoid ; rerunning the Jenkins beta-parsoid-update-eqiad to hopefully recreate everything properly [13:56:02] Logged the message, Master [13:56:17] hashar, remember you added it to parsoid actually! :) [13:56:26] !sal [13:56:27] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:56:41] hashar: Yeah it looks like I tried to port everything to 05 but didn't do a very good job at it [13:56:51] !log applying latest changes of Parsoid on parsoid05 via: zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2 [13:56:53] Logged the message, Master [13:57:07] Project beta-parsoid-update-eqiad build #613: FAILURE in 13 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/613/ [13:57:11] :-( [13:57:23] mkdir: cannot create directory ‘/srv/deployment/parsoid’: Permission denied [13:58:16] and of course puppet is broken [13:58:17] bah [13:59:00] Yippee, build fixed! [13:59:01] Project beta-parsoid-update-eqiad build #614: FIXED in 3 sec: https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/614/ [13:59:32] subbu: Krenair some parsoid managed to start on parsoid05 [13:59:59] # curl http://localhost:8000/_version [13:59:59] {"name":"parsoid","version":"0.2.0-git","sha":"d16dd2db6b3ca56e73439e169d52258214f0aeb2"} [14:00:36] !log parsoid05 seems happy: curl http://localhost:8000/_version
[14:00:36] {"name":"parsoid","version":"0.2.0-git","sha":"d16dd2db6b3ca56e73439e169d52258214f0aeb2"} [14:00:38] bah [14:00:38] Logged the message, Master [14:00:47] !log parsoid05 seems happy: curl http://localhost:8000/_version: {"name":"parsoid","version":"0.2.0-git","sha":"d16dd2db6b3ca56e73439e169d52258214f0aeb2"} [14:00:50] Logged the message, Master [14:03:24] hashar, varnish still not happy though [14:03:36] yeah it loose the varnish backend process [14:04:03] it fails to start :( [14:08:34] !log restarted varnish backend on parsoidcache02 [14:08:35] Logged the message, Master [14:08:49] "Welcome to VisualEditor" ! [14:09:32] subbu: RoanKattouw kart_ Krenair : Parsoid is fixed! I have managed to land an edit http://en.wikipedia.beta.wmflabs.org/w/index.php?title=0.921072816063511&diff=171873&oldid=60217 [14:09:36] will poke qa list about it [14:09:47] hashar, it works! thanks. [14:10:05] hashar, thanks [14:16:30] damn [14:16:38] it si only 3pm I can't even grab a beer to celebrate [14:17:05] !log deleting instance deployment-parsoid04 and removing it from Jenkins [14:17:07] Logged the message, Master [14:17:51] \O/ [14:19:34] (03PS4) 10Hashar: Limit browser tests history to 31 days [integration/config] - 10https://gerrit.wikimedia.org/r/176649 [14:20:29] hashar: \0/ [14:20:40] works fine! [14:20:48] gotta relaunch some browser tests :D [14:21:11] hashar: I can open some Goan wine if that works :) [14:22:12] +1 [14:22:43] (03CR) 10Hashar: [C: 032] "I forgot about this patch apparently. For your review everyone, this is now being deployed and Jenkins will drop history after 31 days." [integration/config] - 10https://gerrit.wikimedia.org/r/176649 (owner: 10Hashar) [14:22:51] PROBLEM - Host deployment-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.16.17) [14:23:35] ah [14:23:54] YuviPanda: deployment-parsoid04 can be removed from Shinken. It has been replaced by parsoid05 :-] [14:25:09] hashar: ah, then just delete the instance? it should go away in shinken in about 30m [14:25:19] YuviPanda: excellent! [14:25:35] YuviPanda: I deleted it already, hence shinken complaint [14:25:46] hashar: yeah, will disappear in about 30m... [14:25:49] when shinkengen runs again [14:27:10] (03Merged) 10jenkins-bot: Limit browser tests history to 31 days [integration/config] - 10https://gerrit.wikimedia.org/r/176649 (owner: 10Hashar) [14:27:36] hashar: Thanks for fixing this all! [14:27:46] I'm glad we have Parsoid running on trusty now [14:27:53] RoanKattouw: we will really want to use trebuchet instead of jenkins+shell scripts [14:27:55] Though I do wish that labs VMs would stop randomly locking up like thsi [14:27:58] yeah that is nice [14:28:12] Every 3-6 months it just gets stuck beyond repair and I have to rebuild a new instance from scratch [14:28:13] locking up? what do you mean? [14:28:24] Like, not responding to anything, no ssh, reboots don't work [14:28:33] Maybe this is an OpenStack problem or something [14:28:36] yeah that used to happen fairly frequently [14:28:42] it is much stable since we migrated to eqiad [14:28:45] But occasionally instances just kill themselves [14:28:54] yeah if the underlying server dies [14:28:55] Yeah it has only happened once in the past year I think [14:29:00] Aaah OK [14:29:00] we loose the instance on it [14:29:14] that happened a few weeks ago and we lost a couple instances [14:29:28] but assuming everything is in puppet, it is all about creating new instances, applying the class [14:29:35] then maybe do a few IP changes in the configuration files [14:29:41] Yeah the Parsoid instance isn't very well puppetized [14:29:58] well [14:30:04] I have deleted the whole /srv/deployment/parsoid [14:30:13] and ran jenkins which repopulated everything [14:30:16] so seems fine to me :] [14:30:32] Oh OK awesome [14:30:36] I think I may have fixed it last time then [14:30:43] I would love to have Jenkins puppetized as well [14:35:22] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:35:30] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:45:23] RECOVERY - Puppet failure on deployment-pdf01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:32] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:53] RECOVERY - Host deployment-parsoid04 is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [15:59:54] I have no idea what's happening right now. [15:59:55] https://integration.wikimedia.org/ci/job/mwext-UploadWizard-testextension-zend/16/console [16:00:14] hashar: Is this a new job for extensions that I'm not aware of? Is it broken in some way? [16:01:18] hello!!!!!!! [16:01:19] marktraceur: [16:01:28] marktraceur: that is PHPUnit using PHP Zend [16:01:31] Right [16:01:35] but [16:01:38] earlier today at 9am UTC [16:01:48] I enabled on extensions the mediawiki core 'structure' test suite [16:01:58] which checks the resourceloader and autoloader for sanity [16:02:13] marktraceur: there the UploadWizardSimpleForm class is not registered in the extension $wgAutoloadClasses [16:02:20] should be a one line change [16:02:38] Ah. [16:02:45] hashar: In the extension, or in core? [16:02:52] Or in a magical QA repo? [16:03:31] in the extension [16:04:03] you can reproduce locally with : [16:04:05] cd tests/phpunit [16:04:11] php phpunit.php --testsuite structure [16:04:21] K [16:05:13] marktraceur: sorry for breaking it. I could not find another way to get the 600+ extensions we host to conform to the norm [16:06:14] No problem! [16:06:21] Simple enough patch for me, anyway [16:35:12] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [16:40:46] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:45:08] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:46:21] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:49:54] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:51:24] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:51:52] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:52:58] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:54:38] hashar: Thanks for fixing Parsoid! [16:54:50] James_F: team work!!!! :-] [16:55:00] hashar: OK, thanks for helping. :-) [16:55:06] James_F: gabriel said roan was maintaining the parsoid vm in beta cluster, is that true? [16:55:08] James_F: that also mean that parsoid has been fairly stable since last october :] [16:55:19] yeah Roan did set up a new Trusty instance [16:55:28] but I / he / we forgot to pull Jenkins on it [16:55:37] greg-g: Apparently. [16:55:45] I did notice his gerrit change but that never tickled in my mind i had to update jenkins to run on the new instance :( [16:55:46] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:55:51] hashar: Yeah, the API doesn't change much from week to week. [16:56:03] meanwhile, puppet is broken on all beta instances [16:56:14] there is some cycle in puppet config I haven't managed to figure out [16:56:33] James_F: does that make sense for him to be the point person? I'm sure you can probably guess my preference of who maintains specific things in BC :) [16:57:16] greg-g: I am one commit away from making the mediawiki core tests under hhvm to be voting! :] [16:57:24] yay! [16:57:41] greg-g: No, it's just a holdover from when Roan was the only person with root who cared about Parsoid. [16:58:10] James_F: can you convince him to pawn that off to a more appropriate person? [16:58:29] * James_F looks slowly from greg-g to subbu|away and gwicke. ;-) [16:59:18] subbu|away: gwicke context: parsoid service maintainership in Beta Cluster, doesn't make sense for it to be Roan, you two should figure out who needs to be on point for it. Go! :) [17:01:19] (03PS1) 10Hashar: Make mediawiki-phpunit-hhvm voting!!!11!!BBQ!! [integration/config] - 10https://gerrit.wikimedia.org/r/180815 [17:02:25] greg-g: I think Roan gave back his root, so right now none of us has root or puppet +2 [17:03:12] gwicke: Beta Cluster [17:03:33] on beta cluster, one can just cherry pick puppet patch on the puppetmaster deployment-salt.eqiad.wmflabs ( under /var/lib/git/operations/puppet/ ) [17:04:09] wouldn't this be a qa or rel eng responsibility? [17:04:27] the beta cluster stuff is pretty custom [17:06:20] gwicke: we can't maintain it all [17:06:26] gwicke: especially all the various services [17:06:54] (03CR) 10Hashar: [C: 032] Make mediawiki-phpunit-hhvm voting!!!11!!BBQ!! [integration/config] - 10https://gerrit.wikimedia.org/r/180815 (owner: 10Hashar) [17:06:56] just as devs are responsible for their MW code, services dev should be responsible for their code [17:07:15] greg-g: IMHO we should work on getting rid of differences between beta and prod [17:07:29] gwicke: of course, that is also happening, but is tangential [17:07:47] (03Merged) 10jenkins-bot: Make mediawiki-phpunit-hhvm voting!!!11!!BBQ!! [integration/config] - 10https://gerrit.wikimedia.org/r/180815 (owner: 10Hashar) [17:07:49] bbiab, on a call [17:07:49] well, a lot of the extra beta work is caused by those differences [17:08:00] (doesn't matter, help fix it :) ) [17:09:02] I'm happy to help with a staged deployment system to the best of my abilities [17:10:45] but realistically I don't have the bandwidth to maintain all kinds of services in a separate qa cluster [17:11:49] well I am off [17:11:59] might show up a bit tomorrow morning / noon [17:12:18] * hashar waves [17:12:39] bye hashar ! [17:13:54] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #76: FAILURE in 2 hr 51 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/76/ [17:14:03] g'night hashar :) [17:17:16] (03PS1) 10Hashar: Make mediawiki-core-regression-hhvm-master voting [integration/config] - 10https://gerrit.wikimedia.org/r/180820 [17:18:04] (03CR) 10Hashar: [C: 032] Make mediawiki-core-regression-hhvm-master voting [integration/config] - 10https://gerrit.wikimedia.org/r/180820 (owner: 10Hashar) [17:19:04] (03Merged) 10jenkins-bot: Make mediawiki-core-regression-hhvm-master voting [integration/config] - 10https://gerrit.wikimedia.org/r/180820 (owner: 10Hashar) [17:20:06] now I can head to my kid year end party at school!  \O/ [17:22:15] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#933163 (10hashar) mediawiki-core now has a HHVM based job that is voting. Next step are the extensions which is a different beast. [17:48:03] gwicke: basically, the issue/summary is: In the SOA future, the small (compared to the rest of engineering) RelEng team can't maintain/troubleshoot/etc every service and MW and puppet and swift and and and and. It's just reality. So, the way to address that is how we address the MW issue: If something breaks in MW on Beta Cluster the RelEng team doesn't fix it, the MW developers do (whoever is responsible). That same ownership must be applied to ever [17:48:16] and really, it's not just "in the future", it's now, today. [17:54:17] greg-g, wait, you're suggesting non-wikimedia mediawiki developers fix beta when it breaks? :/ [17:54:31] greg-g: it's always been like that. the history of beta labs is about people adding the stuff they need to it as it is needed. each addition makes it more valuable to everyone. [17:55:52] Krenair: if anyone commits a change that break the build, as it where, yes, they should fix their commit. Doesn't matter where (what repo) the commit lives. [17:56:00] s/where/were/ [17:56:15] Well, as long as it's actually breaking MediaWiki itself, yes [17:56:18] gah, nondeterministic find/replace [17:56:23] Krenair: that's my point [17:56:33] but not if it just requires some change to make beta itself compatible with the new version [17:56:50] well, ish [17:57:16] bryan has made changes to MW (monolog/vender repo) that would break Beta, and he has shepherded the change and made Beta work with it [17:57:24] that's how you're supposed to do it [17:57:35] otherwise, you break our only integration cluster for everyone [17:57:41] Wikimedia has as much responsibility to keep up with MW development as other users of MediaWiki do. [17:57:47] yes.... [17:58:00] not sure what that means in this context [17:58:39] you can't imagine that the BC is some thing magical that will just fix itself when you break it (however you break it). We're people too, and the person closest to the breakage (who broke it) is the best to diagnose and fix it [17:58:58] Well when, for example, Wikihow (I just picked a random MW user) half-updates to a new version and doesn't make the necessary changes required, that's not the MW developer's fault. [17:59:06] basically, without shared ownership of Beta Cluster, we might as well shut it down [17:59:09] Wikimedia can't be special in that respect. [17:59:34] that's not an apt comparison, or I'm missing your point [18:00:08] the Beta Cluster is not "some WMF thing that's outside of the mediawiki world" [18:00:18] it IS the mediawiki integration cluster [18:00:34] used heavily by WMF, of course, but it's for *Mediawiki* development [18:00:38] just like Jenkins [18:01:08] I'm really confused on the cloudying of this discussion by bringing up "wmf vs commuity" [18:01:11] not helpful [18:02:16] and happens to replicate a large part of the WMF cluster. sure. [18:02:54] which is, guess what, the only way to test the extensions! [18:04:48] so, summary: [18:04:55] I don't really want to argue about this. [18:05:02] You're just going to have to try writing it into the +2 policy I guess. [18:05:29] and probably deployer right policy. [18:06:10] yes. [18:14:58] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:15:36] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:16:31] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:20:30] (03CR) 10Krinkle: MobileFrontend depends on Echo,VisualEditor (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/180740 (owner: 10Hashar) [18:23:45] (03CR) 10Krinkle: (WIP) gating extensions together (WIP) (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/180494 (owner: 10Hashar) [18:24:11] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:27:16] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:36:25] PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:40:07] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:40:15] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:42:49] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:44:52] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #144: FAILURE in 42 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/144/ [18:44:57] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:45:33] (03PS4) 10Krinkle: Add test to tolerate -composer instead of php-composer-validate [integration/config] - 10https://gerrit.wikimedia.org/r/180726 [18:45:40] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:45:42] 3OOjs-UI, Continuous-Integration: PHP docs should be auto-generated - https://phabricator.wikimedia.org/T74454#933517 (10Jdforrester-WMF) >>! In T74454#931711, @Krinkle wrote: > When this happens, please maintain a redirect from https://doc.wikimedia.org/oojs-ui/master to https://doc.wikimedia.org/oojs-ui/master... [18:46:15] 3OOjs-UI, Continuous-Integration: PHP docs should be auto-generated - https://phabricator.wikimedia.org/T74454#933518 (10Krinkle) >>! In T74454#933517, @Jdforrester-WMF wrote: >>>! In T74454#931711, @Krinkle wrote: >> When this happens, please maintain a redirect from https://doc.wikimedia.org/oojs-ui/master to... [18:46:34] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [18:47:36] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:49:13] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:41] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:50:39] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:53:08] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:54:16] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:56:59] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:57:11] RECOVERY - Puppet failure on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:58:17] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:04:59] Yippee, build fixed! [19:05:00] Project browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce build #233: FIXED in 25 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce/233/ [19:05:01] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:05:06] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:05:18] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:08:05] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:09:41] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:09:49] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [19:10:27] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:12:35] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:15:28] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:17:06] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [19:18:35] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:20:42] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:21:16] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:21:32] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:21:36] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:22:33] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:26:11] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:26:27] RECOVERY - Puppet failure on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:47] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:09] any of those recoveries you, marxarelli ? [19:28:15] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:29:07] !log seeing "Could not evaluate: getaddrinfo: Temporary failure in name resolution" in the deployment-* puppet logs [19:29:15] Logged the message, Master [19:29:26] !log restarted puppetmaster on deployment-salt [19:29:28] Logged the message, Master [19:29:34] greg-g: uh, maybe [19:29:35] nice [19:29:36] we'll see [19:29:41] :) [19:30:08] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:16] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:31:40] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:34:39] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [19:39:12] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:01] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [19:43:51] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:46:31] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:46:37] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:50:02] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [19:50:26] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:53:05] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:53:13] RECOVERY - Puppet failure on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:54:40] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [19:54:51] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:18] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:42] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:06] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:10] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:18] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:05] 3Release-Engineering, Monitoring, Wikimedia-Logstash: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090#933705 (10Gage) [20:00:15] greg-g: coren is looking at our DNS situation again now. [20:00:22] (which is what most of these were) [20:00:26] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:39] (https://phabricator.wikimedia.org/T72076 is appropriate ticket) [20:02:33] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:57] YuviPanda: great [20:03:35] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:04:04] Stupid stuck jobs are stupid -- https://integration.wikimedia.org/ci/job/mwext-DonationInterface-npm/579/console [20:05:07] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [20:06:15] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:08:46] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [20:10:39] 3Continuous-Integration: Install and use load based balancer plugin - https://phabricator.wikimedia.org/T84911#933736 (10bd808) 3NEW [20:16:22] * AndyRussG jumps out from behind a laptop and says "boo!" [20:16:48] Hi, can anyone help me target http://en.m.wikipedia.beta.wmflabs.org/ with a CentralNotice campaign? [20:17:44] AndyRussG: If you have beta cluster specific issues, sure, not sure anyone here has any Central Notice expertise. [20:18:09] AndyRussG: what specifically are you having problems with? I don't know anything about CentralNotice but I can point to beta things in git and wikitech [20:18:25] greg-g: thanks! It's for running CentralNotice cross-browser tests that use beta.wmflabs.org [20:18:28] bd808: ^ [20:18:46] and ... [20:19:07] as u can see here... https://integration.wikimedia.org/ci/view/BrowserTests/view/CentralNotice/ [20:19:10] https://integration.wikimedia.org/ci/view/BrowserTests/view/CentralNotice/ [20:19:11] gah [20:19:13] :) [20:19:20] the tests are working except for the one that runs on mobile [20:19:29] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #352: FAILURE in 1 hr 27 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/352/ [20:19:59] Basically we're running a campaign that sends useless little banners to all projects on betw.wmflabs.org [20:20:03] This is the config: [20:20:03] http://meta.wikimedia.beta.wmflabs.org/w/index.php?title=Special:CentralNotice&method=listNoticeDetail¬ice=CN+browser+tests [20:20:33] Works good on the desktop version: [20:20:33] http://en.wikipedia.beta.wmflabs.org/wiki/0.49399334335813494_Moved [20:20:42] (the banner is the little "one" at the top) [20:21:02] http://saucelabs.com/jobs/bae045c80222454ba67936f8a7e2d254 [20:21:14] using: "id" [20:21:15] value: "centralnotice_testbanner_name" [20:21:19] can't be found [20:22:04] greg-g: exactly, the banner doesn't show up on the mobile site: http://en.m.wikipedia.beta.wmflabs.org/wiki/0.018848229735723754#/random [20:22:31] However, according to the CentralNotice campaign's configuration, it should be targetting it (again: http://meta.wikimedia.beta.wmflabs.org/w/index.php?title=Special:CentralNotice&method=listNoticeDetail¬ice=CN+browser+tests) [20:22:32] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#933790 (10Jdforrester-WMF) Can we mark as closed? Seems to be done to me… [20:23:25] greg-g: bd808: ooops unexpected unavoidable family work interruption!!! brb, sorry, thanks!!! (I'll still get backstroll, tho) [20:38:00] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #252: FAILURE in 10 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/252/ [20:38:39] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#933855 (10Krinkle) 3NEW [20:42:37] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#933874 (10Krinkle) 5Open>3Resolved [20:42:47] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#853473 (10Krinkle) p:5Triage>3Normal [20:47:08] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#933895 (10hashar) Well it is only going to pass on master branch, the REL branches will need fixes to be backported as well :-( [20:50:56] greg-g: bd808: back, sorry and thanks again for any insight on this... [21:02:13] * greg-g is going for a walk to attempt to rid himself of his latest headache [21:02:21] bbiab [21:02:33] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#933917 (10Jdforrester-WMF) >>! In T78798#933895, @hashar wrote: > Well it is only going to pass on master branch, the REL branches will need fixes to be backported... [21:05:27] AndyRussG: looks like a legit bug caught by the browser test, maybe related to mobile/cn interaction. the next step i would take is to try and reproduce it locally (using a mw-vagrant environment, etc.) [21:06:30] AndyRussG: you can reproduce it manually against beta by navigating to http://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page?random=0.25 and evaluating $('#centralnotice_testbanner_name') in the js console [21:07:01] (i.e. the result in an empty array, there's no banner element) [21:13:25] marxarelli: thanks!! (sorry just finished standup) ... mm I'm pretty sure CentralNotice is working fine on the mobile site [21:17:35] AndyRussG: perhaps it's just the ?random=0.25 query string that's not working with mobile (implemented solely for the tests afaik). you might have to talk with awight about that [21:18:28] AndyRussG: have you asked about it #wikimedia-mobile? they use beta labs a lot [21:19:33] chrismcmahon: marxarelli: from the Javascipt console I can see that banner choices are either not being sent from the server or they're not getting into the browser on the mobile view on beta.wmflabs.org. I think it may be that CN is not correctly targeting the mobile site [21:19:48] 3MediaWiki-Unit-tests, Continuous-Integration: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#933950 (10hashar) As long as we maintain them probably. We will soon have a Jenkins job common to core + some tensions, and that job might end up failing when we r... [21:20:00] chrismcmahon: yeah may ask there too... [21:21:11] chrismcmahon: marxarelli: on my local install, I activate the mobile view with ?useformat=mobile, and the banners do appear correctly [21:22:35] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#933954 (10hashar) Yeah I messed up with it on monday and haven't fixed it. The slaves install out of the labs branch which is obsolete. The master branch has a wrong commit. Additionally Zuul req... [21:23:45] AndyRussG: you'll want to make sure all your dependencies are current: `vagrant git-update` if you're using mw-vagrant [21:24:02] AndyRussG: stepping out for a bite but i'll try to reproduce it when i get back [21:24:55] marxarelli: I'm 99% sure it's a beta.wmflabs.org configuration issue [21:25:54] marxarelli: thanks also BWT [21:25:54] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #476: FAILURE in 47 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/476/ [21:26:02] AndyRussG: probably so. I just have no idea where that config would be. [21:27:01] marxarelli: you can reproduce it just by going to http://en.m.wikipedia.beta.wmflabs.org/wiki/0.01884822973572375 . You should see a little banner, either "one", "two" or "three", at the top, just like you do on http://en.wikipedia.beta.wmflabs.org/wiki/0.01884822973572375 [21:27:34] s/BWT/BTW/ [21:28:11] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#933990 (10Krinkle) @hashar Which chang(es) where do we revert to restore it to the way that worked? It's obviously not doing anything useful by being broken so I assume it's uncontroversial to re... [21:57:54] !log integration-slave1005 is not ready. It's incompletely setup due to https://phabricator.wikimedia.org/T84917 [21:57:55] Logged the message, Master [22:01:08] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:17:03] marxarelli|lunch: chrismcmahon: greg-g: bd808: update in case you're interested: found the issue, and in fact it is something we didn't know about, so, yay for browser tests! What happened was that http://meta.wikimedia.beta.wmflabs.org/wiki/Special:BannerLoader, the URL that the banner is loaded from, was getting redirected to http://meta.m.wikimedia.beta.wmflabs.org/wiki/Special:BannerLoader, which is not configured, and in fact that shouldn't [22:17:03] really happen. So we did catch a bug! [22:17:45] AndyRussG: good news [22:17:45] i.e., a similar redirect is aslo happening on production, although it works on production, it shouldn't need to redirect in that case [22:17:52] yeah! [22:17:57] good job, Beta Cluster :) [22:17:59] thanks much for your help! [22:18:04] Heh indeed :) [22:18:31] AndyRussG: glad we could help :) [22:18:39] AndyRussG: also, have a good holidays :) [22:18:52] greg-g: thanks! likewise :) [22:18:58] 3VisualEditor, Beta-Cluster: Beta Cluster: API PrefixSearch is taking a very long time to return, and returns nothing when it does - https://phabricator.wikimedia.org/T74332#934140 (10Ryasmeen) Yeah, I am not getting any slowness in any search any more.Closing the bug for now. [22:19:22] 3VisualEditor, Beta-Cluster: Beta Cluster: API PrefixSearch is taking a very long time to return, and returns nothing when it does - https://phabricator.wikimedia.org/T74332#934141 (10Ryasmeen) 5Open>3Resolved [22:19:29] Mmm I don't know why meta.m.wikimedia.org is set up but meta.m.wikimedia.beta.wmflabs.org isn't, but anyway I'm glad that was the case [22:20:45] I don't think we've ever needed the meta wiki for anything [22:21:45] chrismcmahon: on production it's set up as the wiki that controls banners on all the other wikis [22:21:52] also on the beta cluster [22:21:58] yeah, it's the m. part [22:22:35] chrismcmahon: it's in use on BC for all of the CentralNotice browser tests, afaik [22:23:08] yes, we have an ongoing banner campaign set up there to run the browser tests against [22:23:38] adding meta.m is probably just a matter of setting up the apache vhost in operations/puppet [22:23:46] greg-g: yeah, I mis-spoke, never done anything with a mobile view of meta before, so it hasn't had any attention [22:23:51] * greg-g nods [22:24:00] chrismcmahon: bd808: greg-g: but... [22:24:06] bd808: wonder why the diff with prod though.... [22:24:17] the issue now is that we're getting an extra round-trip to the server to show banners on all mobile devices [22:24:33] so that's _not_ so desireable [22:24:39] because beta's apache config is a manual branch with totally different apache configs [22:25:05] bd808: well then... [22:25:30] either we need specifically meta.wikimedia.org/wiki/Special:BannerLoader _not_ to redirect regardless of the device [22:25:51] Or we need to change the banner controller to call meta.m.wikimedia.org/wiki/Special:BannerLoader instead [22:25:53] AndyRussG: ideally, whatever happens in prod should happen in beta cluster [22:26:06] whatever *should* happen in prod, that is :) [22:26:09] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [22:26:10] greg-g: and that is essentially what's happening, yes [22:26:19] Yeah now I'm talking about production [22:26:56] greg-g: hysterical raisins... we had a branch in the old apache-config repo because (dunno) and then when that was folded into operations/puppet by _joe_ we ended up adding a whole set of custom vhosts because the apache configs are not written to have the host names change. They are just files, not templates [22:27:25] right [22:28:00] They probably *could* be made templates with some hiera naming magic today [22:28:01] Mmmm is it easy to make add an exception filter for the redirects to the .m. URLs? [22:28:30] "exception filter"? [22:28:32] For me the most urgent issue is fixing the production problem this uncovered, which is the extra round-trip to the server to show banners [22:28:47] That causes an added delay in the banner page bump [22:29:09] I don't know how redirection for mobile devices works exactly [22:29:21] But if we could make meta.wikimedia.org/wiki/Special:BannerLoader specifically not redirect regardless of the device [22:29:29] I think that'd a nice option [22:31:55] Does that make sense? whom should ask about that? [22:33:29] AndyRussG: we never hear much about this sort of thing. who is it works on Special:BannerLoader ? [22:34:19] * AndyRussG points to self and ejegg and awight, mainly [22:34:26] hahaha [22:34:29] that is, wrt the code itself [22:35:11] I assume the redirect happens at the varnish layer? [22:35:18] or is it lower than that? [22:35:27] * chrismcmahon reasons: AndyRussG seems to me that the redirect would have to be in the Apache config, so ops might be helpful [22:35:41] or maybe not? [22:36:33] absolutely no idea [22:38:09] I guess mobile or operations is where to ask :) [22:39:42] AndyRussG: seems like a start. either that or start spelunking in the config code. asking seems easier. [22:40:21] Yeah, I just got steamrolled by grep -ri mobile puppet/ [22:42:41] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#934189 (10hashar) I have tagged most of my deploys. Reviewing them none would work out of the box unfortunately :-( Tags from december were experiments to teach zuul-cloner how to use ref-update... [22:51:17] (03Abandoned) 10Dduvall: Refactored EAL configuration overrides [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180640 (owner: 10Dduvall) [22:52:55] (03PS1) 10Dduvall: Refactored EAL configuration overrides [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180985 [22:52:57] (03PS1) 10Dduvall: Fallback to base defined mediawiki_password [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180986 [22:59:09] (03CR) 10Dduvall: [C: 032] "Self-merging simple fix into experimental branch." [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180639 (owner: 10Dduvall) [23:09:23] 3operations, Continuous-Integration: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#934261 (10greg) 3NEW [23:16:48] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#934288 (10Krinkle) a:3Krinkle [23:17:01] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#853100 (10Krinkle) p:5Triage>3Normal [23:18:43] I'm not seeing latest Flow code on beta labs. Known issue? [23:19:35] spagewmf: shouldn't be. want to file a bug? [23:20:06] chrismcmahon: will do, is it project deploy? No new Flow code since Wednesday. [23:20:44] spagewmf: project: beta-cluster I think [23:23:04] we can move it to the actual root cause project as needed later :) [23:25:10] (03PS1) 10Krinkle: Don't run operations-apache-config-lint on operations/mediawiki-config [integration/config] - 10https://gerrit.wikimedia.org/r/180994 [23:25:17] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#934312 (10Krinkle) It was enabled in change [[ https://gerrit.wikimedia.org/r/#/c/163813/ | 3ecc0a8d52 ]] in integration/zuul-config. I guess the idea was that, since the actual... [23:30:37] 3Continuous-Integration: Investigate whether zuul-cloner recognize tags just like branches - https://phabricator.wikimedia.org/T76088#934328 (10Krinkle) 5Open>3declined a:3Krinkle Tags don't change. They don't receive changes and as such won't get a test pipeline triggered. When we freeze a milestone, we f... [23:32:23] 3Phabricator, Continuous-Integration: Create tag for zuul-cloner - https://phabricator.wikimedia.org/T84945#934331 (10Krinkle) 3NEW a:3greg [23:32:33] 3Beta-Cluster: Extension:Flow not updated to master on beta cluster for 23 hours - https://phabricator.wikimedia.org/T84946#934339 (10Spage) 3NEW [23:32:56] chrismcmahon , greg-g ^ there you go [23:33:47] maybe because https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is failing ? [23:34:20] thanks spagewmf [23:35:10] 3Beta-Cluster: Extension:Flow not updated to master on beta cluster for 23 hours - https://phabricator.wikimedia.org/T84946#934358 (10Spage) > I know there is some Jenkins job that updates Beta cluster Perhaps it's https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ , which has been failing for 10 hours. [23:36:07] whole bunch of "23:24:50 Permission denied (publickey)." in the console output [23:36:25] fuuuuuuuuuuuuc [23:36:35] 3Beta-Cluster: Extension:Flow not updated to master on beta cluster for 23 hours - https://phabricator.wikimedia.org/T84946#934359 (10greg) https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated :) But, beta-scap-eqiad is failing: https://integration.wikimedia.org/ci/job/beta-sc... [23:36:38] * bd808 will look [23:36:57] I bet I know what causes it [23:37:02] bd808: thanks. maybe we shouldn't bother with these pesky keys 8-) [23:37:28] 3Mobile-Web, Continuous-Integration, MediaWiki-General-or-Unknown: MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins - https://phabricator.wikimedia.org/T78590#934361 (10Jdlrobson) This is frequently hitting us today. Cookie for whoever can fix it! :) [23:38:20] spagewmf: stupid security [23:38:47] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934367 (10greg) [23:39:30] stupid puppet refactoring without care of damage to beta is where I'm putting my money [23:39:42] bd808: likely [23:42:06] can we get that shinken-wm or wmf-insecte to tell us if beta-scap-eqiad fails more than twice in a row? [23:42:50] we should just have the jenkins job notify here on every failure rather than on state transition [23:43:14] bd808: for certain jobs, sure [23:43:32] yeah well that job specifically [23:46:15] 3Beta-Cluster: Announce beta-scap-eqiad failures in -qa every time - https://phabricator.wikimedia.org/T84947#934379 (10greg) 3NEW [23:47:07] !log Updated scap to latest HEAD version [23:47:11] Logged the message, Master [23:48:19] 3Continuous-Integration: Zuul cloner fails on extension jobs against a wmf branch - https://phabricator.wikimedia.org/T73133#934393 (10Krinkle) 5Open>3stalled [23:48:50] 3Continuous-Integration: Zuul-cloner should check out submodules - https://phabricator.wikimedia.org/T84942#934405 (10Krinkle) p:5Triage>3High [23:52:57] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins - https://phabricator.wikimedia.org/T78590#934410 (10Krinkle) >>! In T78590#934361, @Jdlrobson wrote: > This is frequently hitting us today. Cook... [23:53:34] !log Restarted salt-minion on deployement-bastion [23:53:35] Logged the message, Master [23:53:47] !log Restarted udp2log-mw on deployment-bastion [23:53:49] Logged the message, Master [23:54:30] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934411 (10Krinkle) [23:56:24] !log killed some ancient screen sessions on deployment-bastion [23:56:27] Logged the message, Master [23:57:32] !log temporarily disabled jenkins scap job [23:57:35] Logged the message, Master