[00:12:23] PROBLEM - Puppet staleness on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [03:04:11] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:11:18] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:38:25] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<44.44%) [04:55:25] PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [04:58:31] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:02:10] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:04:48] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [05:05:37] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:07:40] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:08:22] PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [05:11:05] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [05:12:14] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [05:14:55] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [05:18:37] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:24:04] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:26:04] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:28:20] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:29:46] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [05:32:06] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:32:40] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [05:32:54] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [05:33:24] RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0] [05:33:26] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:35:36] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:36:08] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:37:13] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [05:37:57] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:39:09] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [05:39:57] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:40:47] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:43:36] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:44:22] PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:45:24] RECOVERY - Puppet failure on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [05:48:41] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:52:10] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:53:50] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [05:57:07] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [05:58:18] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [06:01:03] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [06:03:09] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:03:25] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:04:10] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:05:48] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [06:09:04] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [06:13:40] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [06:14:18] RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0] [06:17:54] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:18:48] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:22:57] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [06:23:33] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [06:28:08] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:38:24] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [06:39:38] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [07:04:39] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:54:04] (03PS1) 10Adrian Lang: (Does not work) Make mwext-WikibaseJavaScriptApi-qunit voting [integration/config] - 10https://gerrit.wikimedia.org/r/180418 [08:54:54] (03CR) 10jenkins-bot: [V: 04-1] (Does not work) Make mwext-WikibaseJavaScriptApi-qunit voting [integration/config] - 10https://gerrit.wikimedia.org/r/180418 (owner: 10Adrian Lang) [08:57:00] (03PS2) 10Adrian Lang: (Does not work) Make mwext-WikibaseJavaScriptApi-qunit voting [integration/config] - 10https://gerrit.wikimedia.org/r/180418 [09:06:43] (03CR) 10Gilles: [C: 031] Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [09:17:05] 3Continuous-Integration: common gating job for mediawiki core and extensions - https://phabricator.wikimedia.org/T60772#852789 (10hashar) Not yet. We have the `mediawiki-gate` dummy jobs used to enforce MediaWiki related changes to share the same queue in the Zuul gate-and-submit pipeline. That bug is about mak... [10:09:42] (03CR) 10Hashar: Add jobs for Sentry (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:09:52] (03PS2) 10Hashar: Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:10:28] (03CR) 10Hashar: [C: 032] "Jobs created:" [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:24:10] (03PS3) 10Hashar: Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:24:20] (03CR) 10Hashar: [C: 032] Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:25:02] (03CR) 10jenkins-bot: [V: 04-1] Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:30:57] (03Merged) 10jenkins-bot: Add jobs for Sentry [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [10:38:02] (03PS1) 10Hashar: Drop {name}-{ext-name}-testextension [integration/config] - 10https://gerrit.wikimedia.org/r/180437 [10:50:47] (03CR) 10Hashar: [C: 032] Drop {name}-{ext-name}-testextension [integration/config] - 10https://gerrit.wikimedia.org/r/180437 (owner: 10Hashar) [10:57:49] (03Merged) 10jenkins-bot: Drop {name}-{ext-name}-testextension [integration/config] - 10https://gerrit.wikimedia.org/r/180437 (owner: 10Hashar) [11:35:50] Project beta-scap-eqiad build #34296: FAILURE in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34296/ [11:55:55] Yippee, build fixed! [11:55:56] Project beta-scap-eqiad build #34298: FIXED in 1 min 39 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34298/ [12:43:27] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#853100 (10Dzahn) 3NEW [12:44:15] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#853100 (10Dzahn) example links: https://gerrit.wikimedia.org/r/#/c/180451/ operations-apache-config-lint FAILURE in 29s (non-voting) https://integration.wikimedia.org/ci/job/o... [13:27:44] (03PS1) 10Hashar: Prevent hhvm on REL1_19 and REL1_22 [integration/config] - 10https://gerrit.wikimedia.org/r/180469 [13:34:54] (03PS1) 10QChris: Add jobs for analytics/blog [integration/config] - 10https://gerrit.wikimedia.org/r/180470 [13:35:06] Project beta-scap-eqiad build #34308: FAILURE in 1 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34308/ [13:41:26] (03CR) 10Hashar: [C: 032] Prevent hhvm on REL1_19 and REL1_22 [integration/config] - 10https://gerrit.wikimedia.org/r/180469 (owner: 10Hashar) [13:42:45] (03Merged) 10jenkins-bot: Prevent hhvm on REL1_19 and REL1_22 [integration/config] - 10https://gerrit.wikimedia.org/r/180469 (owner: 10Hashar) [13:43:03] (03PS2) 10Hashar: Add jobs for analytics/blog [integration/config] - 10https://gerrit.wikimedia.org/r/180470 (owner: 10QChris) [13:45:01] (03CR) 10Hashar: [C: 032] "Jobs deployed, thanks!" [integration/config] - 10https://gerrit.wikimedia.org/r/180470 (owner: 10QChris) [13:49:51] (03Merged) 10jenkins-bot: Add jobs for analytics/blog [integration/config] - 10https://gerrit.wikimedia.org/r/180470 (owner: 10QChris) [13:55:27] Yippee, build fixed! [13:55:27] Project beta-scap-eqiad build #34310: FIXED in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34310/ [14:03:28] (03PS1) 10Hashar: Force color for tox [integration/config] - 10https://gerrit.wikimedia.org/r/180474 [14:09:52] (03CR) 10Hashar: [C: 032] Force color for tox [integration/config] - 10https://gerrit.wikimedia.org/r/180474 (owner: 10Hashar) [14:14:22] (03Merged) 10jenkins-bot: Force color for tox [integration/config] - 10https://gerrit.wikimedia.org/r/180474 (owner: 10Hashar) [14:14:35] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#853270 (10Dzahn) also see T72068 [14:34:34] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:53:34] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:04:32] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [15:16:11] (03PS1) 10Hashar: Drop mediawiki-phpunit [integration/config] - 10https://gerrit.wikimedia.org/r/180487 [15:21:19] (03CR) 10Hashar: [C: 032] Drop mediawiki-phpunit [integration/config] - 10https://gerrit.wikimedia.org/r/180487 (owner: 10Hashar) [15:25:42] (03Merged) 10jenkins-bot: Drop mediawiki-phpunit [integration/config] - 10https://gerrit.wikimedia.org/r/180487 (owner: 10Hashar) [16:03:40] (03PS1) 10Hashar: (WIP) gating extensions together (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/180494 [16:11:01] (03CR) 10Hashar: [C: 04-2] (WIP) gating extensions together (WIP) (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/180494 (owner: 10Hashar) [16:14:53] 3MediaWiki-Unit-tests, Continuous-Integration: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#853473 (10hashar) 3NEW [16:38:13] 3Continuous-Integration: HHVM Jenkins job throw: Unable to set CoreFileSize to 8589934592: Operation not permitted (1) - https://phabricator.wikimedia.org/T78799#853512 (10hashar) 3NEW [16:44:45] (03PS2) 10Hashar: (WIP) gating extensions together (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/180494 [16:45:18] (03CR) 10Hashar: "Added VisualEditor and added a hack to initialize the submodule." [integration/config] - 10https://gerrit.wikimedia.org/r/180494 (owner: 10Hashar) [16:54:05] (03CR) 10Hashar: "Failures can be seen at https://integration.wikimedia.org/ci/job/mediawiki-phpunit-integration-hhvm/6/#showFailuresLink" [integration/config] - 10https://gerrit.wikimedia.org/r/180494 (owner: 10Hashar) [17:07:50] Yippee, build fixed! [17:07:51] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #75: FIXED in 2 hr 45 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/75/ [17:29:19] off [17:43:46] ryasmeen: I think I fixed all the browser test failures, I kicked off a build to see what happens: https://integration.wikimedia.org/ci/view/BrowserTests/view/-All/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/474/ [17:46:24] chrismcmahon:nice! [17:47:52] ryasmeen: the only issue that was not just a new locator was needing to clear the default text first from the Media search text box: https://gerrit.wikimedia.org/r/#/c/180524/ [17:48:01] hooray for Page Objects :-) [17:55:38] Project beta-scap-eqiad build #34335: FAILURE in 1 min 34 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34335/ [18:16:00] Yippee, build fixed! [18:16:00] Project beta-scap-eqiad build #34337: FIXED in 1 Minute 53 Sekunden: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34337/ [18:36:17] Project browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce build #231: FAILURE in 22 Minuten: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce/231/ [18:38:03] Yippee, build fixed! [18:38:03] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #142: FIXED in 36 Minuten: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/142/ [18:40:11] 3Scrum-of-Scrums, Beta-Cluster: beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE - https://phabricator.wikimedia.org/T67683#853799 (10Cmcmahon) 5Open>3Resolved [18:40:33] 3Scrum-of-Scrums, Beta-Cluster: beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE - https://phabricator.wikimedia.org/T67683#698061 (10Cmcmahon) spoke to Brandon on IRC, as far as anyone can tell this is now working correctly [19:14:31] Krinkle: Why has Jenkins switched into German? [19:14:47] it seems to be in random languages recently [19:14:52] Hmm. [19:23:10] James_F: Yeah, been happening for the past 12 months. [19:23:18] Sometimes it adopts one of the languages of the users. [19:25:17] James_F: Fixed by changing default language from en to en. (Yeah, makes so much sense, right) [19:25:35] Krinkle: Helpful. :-) [19:25:48] Krinkle: Can we ban users from setting their interface language to stop it breaking? [19:26:33] James_F: Annoyingly, there's several areas where messages are substituted and now stay in French, Spanish and German. [19:26:36] E.g. build logs [19:27:00] most of build scripts are english, but the bootstrap from Jenkins itself is actually localised [19:27:02] Even better. [19:27:10] It's also not well localised. [19:27:17] Half the messages "in German" were still in English. [19:37:42] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:43:56] (03PS1) 10Ejegg: Add composer check for DonationInterface/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/180566 [19:48:58] (03Abandoned) 10Ejegg: Add composer check for DonationInterface/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/180566 (owner: 10Ejegg) [19:54:55] Project beta-scap-eqiad build #34347: FAILURE in 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34347/ [20:16:08] Yippee, build fixed! [20:16:08] Project beta-scap-eqiad build #34349: FIXED in 1 min 50 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34349/ [20:26:16] i am getting different values for the field user_id in bounce_record table while running php unit tests on my local install and later on jenkins - https://gerrit.wikimedia.org/r/#/c/176366/ - and of course - jenkins fails [20:26:47] probably - jenkins user table and my php unit created user table are on a mismatch :\ [20:26:52] any way it can be sorted out ? [20:29:25] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<11.11%) [20:39:25] (03PS1) 10Legoktm: Setup php-composer-validate for operations/mediawiki-config [integration/config] - 10https://gerrit.wikimedia.org/r/180591 [21:02:16] !log cancelled all browser tests,suspecting them to deadlock Jenkins somehow :( [21:02:21] Logged the message, Master [21:03:07] eh? [21:09:26] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<11.11%) [21:29:42] (03PS1) 10Dduvall: Removed inclusion of pry-byebug [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180639 [21:29:44] (03PS1) 10Dduvall: Refactored EAL configuration overrides [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/180640 [22:10:27] (03CR) 10Gergő Tisza: Add jobs for Sentry (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/180309 (owner: 10Gergő Tisza) [22:10:55] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:14:25] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<11.11%) [22:26:35] 3Release-Engineering: Add shinken output for Beta Cluster to -operations channel - https://phabricator.wikimedia.org/T1334#854342 (10yuvipanda) Bump? [22:39:26] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [22:39:58] 3Release-Engineering: Mukunda ready to do deploy on 12/16 - https://phabricator.wikimedia.org/T76049#854388 (10greg) 5Open>3Resolved a:3greg This was done (and successful) :) [22:40:06] 3Release-Engineering: Mukunda ready to do deploy on 12/16 - https://phabricator.wikimedia.org/T76049#854393 (10greg) p:5High>3Triage [22:40:10] twentyafterfour: congratulations :] [22:40:56] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:54:01] Project beta-scap-eqiad build #34356: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34356/ [22:58:02] 3Continuous-Integration, MediaWiki-Unit-tests: MediaWiki core 'structure' tests are not run for extensions - https://phabricator.wikimedia.org/T78798#854427 (10hashar) a:3hashar Announced on wikitech https://lists.wikimedia.org/pipermail/wikitech-l/2014-December/079864.html [23:03:20] hashar: congratulations? for surviving my first deploy? [23:14:01] twentyafterfour: yup :] [23:14:47] twentyafterfour: and hopefully gives you some good ideas to refine the deploy process [23:19:51] hashar: shinken puppet failures is driving me mad. [23:19:53] Can we do something about that? [23:20:02] It seems most of them aren't about aptitude [23:20:27] When I look in syslog, I see puppet is locked. It's previous run still has the lock in place. Then the next time is fine. [23:20:37] well it is past midnight about to sleep [23:20:39] Presumably taking longer than the iteration duration. [23:20:53] but in short, we labs had transient DNS issues for most of the week which causes random failures of puppet [23:20:56] hashar: jenkins is about to shutdown from what it looks like. [23:21:06] I assume you scheduled that [23:21:08] /var/vdb/ is reported as full on the varnish instances, cause varnish actually fill the disk [23:21:17] Krinkle: yeah to remove a bunch of deadlock [23:21:20] will restart Zuul as well [23:21:23] and went to sleep [23:21:56] there is a nasty interactions between plugins that causes execution slots to end up locked :( [23:22:22] been trying to solve it without disturbing devs too much [23:24:09] hashar: The static file server for build logs seems easy enough. Write stdout/stderro and artefacts to a unique directory that's mounted as some-thing.wikimedia.org (or wmfusercontent.org) and cronjob to purge directories older than 30 days. Zuul config can take custom urls for build result (openstack uses it already). And remove Jenkins forever :P. Goal for next quarter? Seems relatively light weight. [23:24:18] compared to other plans, like disposable vms [23:25:27] The main thing I guess we'll need to do in addition is figure out what we use Jenkins for. Right now it's mostly execution. We're already bypassing most of the plugins (e.g. zuul-cloner instead of jenkins git). [23:25:49] And execution of the job I think still depends on Jenkins. I'm curious how that would work otherwise. Can Zuul/Gearman connect directly? [23:25:59] creation of workspace on the slaves [23:28:07] hashar: Looking at openstack, I think they still use Jenkins. They just don't link to it. The build output looks lie it's generated by Jenkins (e.g. "Started by user anonymous") [23:28:47] yeah they scp iirc [23:28:52] or maybe send to swift can't remember [23:29:08] we could indeed get some entry under wmfusercontent.org with a server having a bunch of disk space [23:29:16] Looks like they use jenkins build html export, and then sync to the build log server with scp. [23:29:25] then use LOG_PATH ( ex: 63/180663/1/test/mwext-MobileFrontend-npm/741798c ) [23:29:44] hashar: wouldn't need much space. gallium seems to handle it fine as well. It's just text files and a few artefacts (log files). [23:29:51] yup [23:29:58] we'll have to purge manaully, since jenkins wouldn't do that for us anymore. [23:30:23] hashar: I'm curous if the performance slowdown would get better if we still have Jenkins but without it keeping any builds in memory. [23:30:29] we have a bunch of old app server we can probably reuse [23:30:33] That should speed things up a bit, but ideally we'd cut it out entirely. [23:30:37] feel free to get one allocated for that :] [23:31:44] hashar: well, we can just mount it on gallium. like docwikimedia.org [23:31:50] wouldn't be too difficult. [23:31:54] Same space, different directory. [23:32:18] Separate it as a module first, then move to separate server later. [23:33:15] well ideally I would prefer to get rid of gallium one day [23:33:48] hashar: well, then we'll need more people to do it for us. I don't see us having time for that. Or priority. [23:35:54] hashar: Hm.. did you research this before? Would save me some time. E.g. how to export it and from where we'd hook in to do that (e.g. post-build task inside the jenkins job? or is there a way to do it globally?) [23:36:05] if we get a spec defining what we want, I am sure ops will be happy to comply [23:36:24] i.e. something like: low CPU/mem + 1 TB disk with a 10.0.0.0 IP [23:36:35] have cilog.wmfusercontent.org point to misc varnish [23:36:36] hashar: it'll need to be puppetised, whih means it'll take me a month or Infinity for someone else to write it. [23:36:46] configure misc varnish for that DNS entry to point to the new host [23:37:07] then get the 500GB disk mounted on /srv/ and ask for a lame apache virtual host [23:37:14] If I do it on gallium I can finish it in 2 days and be done with it. And then leave it to ops or the next guy to puppetise it. Terrible terrible, I know. But may be more realistic. [23:37:39] I am sure it can be done reasonably fast on a new server [23:38:19] or maybe on lanthanum [23:38:27] it as 380GB free :] [23:39:05] but then that is still a Precise machine :/ [23:42:35] hashar: https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8310/console 23:39:52 java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOException: request to write '1272' bytes exceeds size in header of '1219' bytes for entry 'log/mw-debug-www.log' [23:42:40] causing the submit job to fail [23:42:45] ah yeah [23:42:50] was looking at that task in phabricator [23:43:18] legoktm: https://phabricator.wikimedia.org/T78590 [23:43:30] legoktm: the qunit tests ends before all apache requests have been completed [23:43:46] weird [23:43:49] ok [23:43:50] at the end of the qunit test, Jenkins compress all the logs using gzip [23:43:52] but [23:43:57] only the submit jobs are failing [23:43:59] the test ones passed [23:44:00] then uses tar to gather them [23:44:18] but an apache thread ends up still being writing to the log file and tar complains the file has changed while it was processing it [23:44:33] https://gerrit.wikimedia.org/r/#/c/180653/ the "recheck" one passed, then after I +2'd, it failed [23:44:34] yeah that is a race condition in the publishing :-5 [23:44:38] blagh [23:44:56] there must be some requests that is not properly waited for. Timo commented on the task [23:45:05] if you remove the +2 and revote, that might pass [23:46:03] 3MediaWiki-General-or-Unknown, Mobile-Web, Continuous-Integration: MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins - https://phabricator.wikimedia.org/T78590#854539 (10Legoktm) Also happening on https://gerrit.wikimedia.org/r/#/c/180653/ intermittently. [23:46:30] PROBLEM - Free space - all mounts on deployment-cache-upload02 is CRITICAL: CRITICAL: deployment-prep.deployment-cache-upload02.diskspace._srv_vdb.byte_percentfree.value (<100.00%) [23:48:35] passed :D [23:49:02] legoktm: :-] [23:49:09] legoktm: one will have to investigate what happens though