[00:03:41] (03CR) 10Krinkle: [C: 032] Don't run operations-apache-config-lint on operations/mediawiki-config [integration/config] - 10https://gerrit.wikimedia.org/r/180994 (owner: 10Krinkle) [00:04:30] (03Merged) 10jenkins-bot: Don't run operations-apache-config-lint on operations/mediawiki-config [integration/config] - 10https://gerrit.wikimedia.org/r/180994 (owner: 10Krinkle) [00:06:37] !log restored local commit with ssh keys for scap to deployment-salt [00:06:39] Logged the message, Master [00:06:57] 3Continuous-Integration: remove operations-apache-config-lint on operations/mediawiki-config - https://phabricator.wikimedia.org/T78782#934438 (10Krinkle) 5Open>3Resolved [00:07:54] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934443 (10Jdlrobson) @Krinkle this is not the same bug. What does a log file running out of space have to do with te... [00:08:28] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934444 (10Jdlrobson) (you'll see the tests pass but it is Archiving artifacts that fails) [00:10:39] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934447 (10Jdlrobson) (Running locally I'm seeing no ajax requests. I just verified that again.) [00:11:08] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934449 (10Krinkle) >>! In T78590#934443, @Jdlrobson wrote: > @Krinkle this is not the same bug. > What does a log fi... [00:12:23] PROBLEM - Puppet staleness on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [00:13:35] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production - https://phabricator.wikimedia.org/T84950#934455 (10Tgr) 3NEW [00:16:51] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:17:53] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:18:25] 3Continuous-Integration, MediaWiki-General-or-Unknown, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#934465 (10Jdlrobson) @krinkle I'm not seeing anything obvious so if you can find the time to take a look at our tests... [00:19:05] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#934466 (10greg) p:5Triage>3Normal [00:19:56] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:20:07] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#934455 (10greg) [00:20:30] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#934455 (10greg) [00:20:31] 3Continuous-Integration: 'mediawiki-gate' seems to block …wmf12 branch merges because a master item is being merged, which isn't helpful - https://phabricator.wikimedia.org/T84951#934472 (10Jdforrester-WMF) 3NEW [00:20:48] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [00:21:26] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:22:40] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934479 (10bd808) a:3bd808 A local commit in deployment-salt:/var/lib/git/labs/private was lost due to a reset to origin/master followed by a fast-forward pull. I dug up the commit using reflog and have put it bac... [00:23:29] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#934481 (10greg) [00:23:54] 3Multimedia, MediaWiki-File-management, Beta-Cluster: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#934455 (10greg) [00:25:10] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:25:25] marxarelli: https://phabricator.wikimedia.org/T84946#934479 (not sure if you or not, and no worries if so, shit happens) [00:25:48] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [00:26:43] I just set the magic git flags on that repo to make `git pull` rebase always [00:26:45] greg-g: not me, afaik [00:27:51] marxarelli: kk [00:28:06] It could have been one of several folks, but not a huge deal. git-reflog and I are old buddies on deployment-salt [00:28:17] :) & :/ [00:28:42] bd808: dumb question, is that setting a local config setting, thus would be lost in a theoretical rebuild of the instance? [00:28:52] yeah [00:28:56] great [00:29:08] can that be puppetized? (not by you) [00:29:12] it really should be the default behavior of git IMO [00:30:14] it could be maybe yeah [00:30:24] it's just some lines in the .git/config file [00:30:55] [branch "master"] rebase = true [00:31:06] [branch] autosetuprebase = always [00:40:12] Yippee, build fixed! [00:40:12] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #341: FIXED in 51 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/341/ [00:41:16] 3Beta-Cluster: Puppetize rebase = true for all relevant hosts (all?) - https://phabricator.wikimedia.org/T84953#934522 (10greg) 3NEW [00:43:03] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934532 (10bd808) Ran these commands to help prevent this from accidentally happening again: - git config branch.master.rebase true - git config branch.autosetuprebase always I really wish that was the default... [00:44:09] bd808: can we restart the beta-scap job? [00:44:30] greg-g: soon. Runnign manual scap at the moment [00:44:52] ah, sorry, I'm impatient :) [00:45:03] 00:44:04 Finished scap: (no message) (duration: 18m 07s) [00:45:31] bd808: when you come to town in January, I'll just give you my credit card to take to the bars with you, k? [00:45:37] It takes a while to sync a whole day's changes [00:45:59] greg-g: Only if you come with me and protect me from Siebrand [00:46:12] I can do part 1, but I can't promise part 2 [00:46:18] He can smell an open tab from miles away [00:46:22] lol [00:46:54] I'll be staying at club quarters so not much room for you to crash on the floor after :( [00:47:07] There is a good bar in the alley there though [00:47:14] well at least they have good beer [00:47:17] Yippee, build fixed! [00:47:18] Project beta-scap-eqiad build #34504: FIXED in 1 min 57 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34504/ [00:47:21] ^^ [00:47:26] Yippee! [00:47:43] spagewmf: ^ [00:47:59] Yippee, build fixed! [00:48:00] Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #231: FIXED in 19 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/231/ [00:49:11] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934545 (10bd808) > I know there is some Jenkins job that updates Beta cluster and there are alerts for it, but there are no links for them on https://www.mediawiki.org/wiki/Beta_cluster or https://wikitech.wikimedi... [00:49:58] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934546 (10bd808) 5Open>3Resolved Fixed: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34504/ [00:53:36] 3Beta-Cluster: Announce beta-scap-eqiad failures in -qa every time - https://phabricator.wikimedia.org/T84947#934551 (10bd808) I manually changed the irc notification strategy from "new failure and fixed" to "failure and fixed" at . Ne... [00:55:10] greg-g: On a related note, I disabled irc pings from wmf-insecte so somebody on your team should pay attention to that scap job. :) [00:56:14] bd808: good [00:57:18] whereis twentyafterfoud [01:04:12] alright, time to leave, later all [01:51:31] Yippee, build fixed! [01:51:32] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #197: FIXED in 47 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/197/ [02:07:50] Project browsertests-MultimediaViewer-mediawiki.org-linux-firefox-sauce build #351: FAILURE in 10 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-mediawiki.org-linux-firefox-sauce/351/ [02:31:03] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #346: FAILURE in 22 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/346/ [02:33:45] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #352: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/352/ [03:11:18] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:27] Yippee, build fixed! [03:34:27] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #145: FIXED in 32 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/145/ [03:59:40] Yippee, build fixed! [03:59:40] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #198: FIXED in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/198/ [04:54:03] Project beta-scap-eqiad build #34530: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34530/ [04:56:04] Yippee, build fixed! [04:56:05] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #424: FIXED in 1 hr 4 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/424/ [04:57:31] Project beta-scap-eqiad build #34531: STILL FAILING in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34531/ [05:05:08] Project beta-scap-eqiad build #34532: STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34532/ [05:13:24] Yippee, build fixed! [05:13:24] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #253: FIXED in 18 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/253/ [05:16:00] Project beta-scap-eqiad build #34533: STILL FAILING in 1 min 58 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34533/ [05:25:37] Project beta-scap-eqiad build #34534: STILL FAILING in 1 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34534/ [05:33:09] !log Rebasing integration-puppetmaster with latest operations/puppet upstream (5 patches) [05:33:11] Logged the message, Master [05:35:28] Yippee, build fixed! [05:35:29] Project beta-scap-eqiad build #34535: FIXED in 1 min 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34535/ [05:49:38] 3Continuous-Integration: /tmp/sites-1418******.json files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84970#935648 (10Krinkle) 3NEW [05:56:04] 3Continuous-Integration: /tmp/sites-1418******.json files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84970#935658 (10Krinkle) [Search](https://github.com/search?p=2&q=%22sites-%22+%40wikimedia&ref=searchresults&type=Code&utf8=%E2%9C%93) has led me to [tests/phpunit/includes/site/SiteL... [05:58:46] 3Continuous-Integration, MediaWiki-Unit-tests: /tmp/sites-******.json files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84970#935659 (10Krinkle) p:5Triage>3Normal a:3aude [05:59:02] 3Continuous-Integration: Jenkins: Figure out long term solution for /tmp management - https://phabricator.wikimedia.org/T74011#935664 (10Krinkle) [06:17:42] 3Phabricator, Continuous-Integration: Create tag for zuul-cloner - https://phabricator.wikimedia.org/T84945#935666 (10Krinkle) 5Open>3declined Never mind for now. These should be filed upstream anyway. Tracking with #zuul, #uptream and a descriptive title should suffice. [06:17:52] 3Phabricator, Continuous-Integration: Create tag for zuul-cloner - https://phabricator.wikimedia.org/T84945#935668 (10Krinkle) [06:18:05] 3Phabricator, Continuous-Integration: Create tag for zuul-cloner - https://phabricator.wikimedia.org/T84945#934331 (10Krinkle) [06:19:11] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#935670 (10Krinkle) 5Open>3declined Instance has been deleted and re-created. Can't reproduce this issue. [06:20:35] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#935672 (10Krinkle) [06:31:24] Yippee, build fixed! [06:31:25] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #477: FIXED in 1 hr 18 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/477/ [06:31:29] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#935679 (10Krinkle) I'm not sure what that has to do with the actual error. The error seems to be because of a dependency on a version of a package that does not exist (`Could not find .. statsd>=... [06:48:49] 3Continuous-Integration: mediawiki-gate job blocking on different branches - https://phabricator.wikimedia.org/T74432#935681 (10Legoktm) [06:48:50] 3Continuous-Integration: 'mediawiki-gate' seems to block …wmf12 branch merges because a master item is being merged, which isn't helpful - https://phabricator.wikimedia.org/T84951#935680 (10Legoktm) [07:24:05] !log Attempt #5 at re-creating integration-slave1001. Completed provisioning per Setup instructions. Pooled. [07:24:08] Logged the message, Master [07:24:13] !log Restarting Gearman connection to Jenkins [07:24:15] Logged the message, Master [07:47:56] 3Continuous-Integration: /tmp/MWDocGen-* files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84973#935718 (10Krinkle) 3NEW a:3hashar [07:48:54] 3Continuous-Integration: /tmp/MWDocGen-* files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84973#935718 (10Krinkle) [07:50:05] 3Continuous-Integration: Jenkins: Figure out long term solution for /tmp management - https://phabricator.wikimedia.org/T74011#935727 (10Krinkle) [07:55:07] 3Continuous-Integration: /tmp/bundler* directories left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84974#935729 (10Krinkle) 3NEW [07:56:30] 3Continuous-Integration: Jenkins: Figure out long term solution for /tmp management - https://phabricator.wikimedia.org/T74011#935745 (10Krinkle) [07:58:14] 3Continuous-Integration: Jenkins: Figure out long term solution for /tmp management - https://phabricator.wikimedia.org/T74011#759885 (10Krinkle) [07:58:15] 3Continuous-Integration: hhvm Jenkins job fill up /tmp with sess_* files - https://phabricator.wikimedia.org/T65611#935746 (10Krinkle) 5Invalid>3Open The integration slaves in labs running Trusty (integration-slave1006 and up) are now running MediaWiki installs in jobs with HHVM. These files are back and bei... [07:58:53] 3Continuous-Integration: /tmp/sess_* left behind on Jenkins slaves (hhvm php sessions) - https://phabricator.wikimedia.org/T65611#935749 (10Krinkle) [08:34:28] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #198: FAILURE in 41 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/198/ [08:42:11] Yippee, build fixed! [08:42:11] Project browsertests-MultimediaViewer-mediawiki.org-linux-firefox-sauce build #352: FIXED in 6 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-mediawiki.org-linux-firefox-sauce/352/ [08:55:13] 3Quality-Assurance: mediawiki/ruby/api repo should run unit tests after every patch set - https://phabricator.wikimedia.org/T65467#683668 (10zeljkofilipin) [09:04:55] 3Quality-Assurance: mediawiki/ruby/api repo should run unit tests after every patch set - https://phabricator.wikimedia.org/T65467#935823 (10zeljkofilipin) @hashar, @dduvall: The last commit for the repository has mediawiki-ruby-api-bundle-rspec job and it is green: https://gerrit.wikimedia.org/r/#/c/175870/ ht... [09:05:45] 3Quality-Assurance: mediawiki/ruby/api repo should run unit tests after every patch set - https://phabricator.wikimedia.org/T65467#935825 (10zeljkofilipin) 5Open>3Resolved [09:06:57] 3Quality-Assurance: mediawiki/ruby/api repo should run unit tests after every patch set - https://phabricator.wikimedia.org/T65467#683668 (10zeljkofilipin) 5Resolved>3Open [09:07:22] 3Quality-Assurance: mediawiki/ruby/api repo should run unit tests after every patch set - https://phabricator.wikimedia.org/T65467#683668 (10zeljkofilipin) 5Open>3Resolved [09:13:54] 3Release-Engineering: Make RuboCop job voting - https://phabricator.wikimedia.org/T84979#935849 (10zeljkofilipin) 3NEW [09:15:43] 3Continuous-Integration: test failure in site.random - https://phabricator.wikimedia.org/T84944#935858 (10Qgil) [09:20:42] 3Quality-Assurance: Use dotenv ruby gem for configuration management - https://phabricator.wikimedia.org/T71405#935863 (10zeljkofilipin) @dduvall: you have decided to use `environments.yml` instead of `.env` file? Example: https://gerrit.wikimedia.org/r/#/c/180324/2/tests/browser/environments.yml,cm [09:47:17] Yippee, build fixed! [09:47:18] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #347: FIXED in 26 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/347/ [10:00:15] 3Continuous-Integration, pywikibot-core: test failure in site.random - https://phabricator.wikimedia.org/T84944#934318 (10hashar) [10:08:49] 3MediaWiki-General-or-Unknown, Continuous-Integration, Mobile-Web: QUnit job unable to collect logs, test still has active http requests writing to them - https://phabricator.wikimedia.org/T78590#935949 (10Florian) I tried to figure out, what @Krinkle said. What i did: I added a QUnit.done() callback on top of... [10:10:16] 3Continuous-Integration: /tmp/MWDocGen-* files are left behind on Jenkins slaves - https://phabricator.wikimedia.org/T84973#935955 (10hashar) The maintenance script uses wfTempDir(), would be solved by creating a $WORKSPACE/tmp dir and point $TMP and $TEMP to it. That is the purpose of T70563 [10:11:22] 3Continuous-Integration: Jenkins: point TMP/TEMP to workspace and delete it after build completion - https://phabricator.wikimedia.org/T70563#935957 (10hashar) 5duplicate>3Open Reopening, this has not happen. We need to clear/create $WORKSPACE/tmp and point $TMP and $TEMP to it. [10:18:11] 3Release-Engineering: Make RuboCop job voting - https://phabricator.wikimedia.org/T84979#935966 (10hashar) For mediawiki-core, the test should not be voting for the old release branch unless we fix rubocop in them as well (unlikely). In Zuul that can be done with: jobs: - mediawiki-core-bundle-rubocop bran... [10:21:54] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution - https://phabricator.wikimedia.org/T76250#935970 (10hashar) 5declined>3Open >>! In T76250#935670, @Krinkle wrote: > Instance has been deleted and re-created. C... [10:22:46] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: CI labs instances can't start on reboot: tmpfs: Bad value 'jenkins-deploy' for mount option 'uid' - https://phabricator.wikimedia.org/T76250#935972 (10hashar) [10:23:47] 3Phabricator, Continuous-Integration: Create tag for zuul-cloner - https://phabricator.wikimedia.org/T84945#935973 (10hashar) Zuul cloner is indeed part of Zuul :-] [10:30:27] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#936003 (10hashar) >>! In T84946#934479, @bd808 wrote: > A local commit in deployment-salt:/var/lib/git/labs/private was lost due to a reset to origin/master followed by a fast-forward pull. I dug up the commit usin... [10:38:49] 3Release-Engineering: scap.ssh.cluster_ssh() only returns the last line of error - https://phabricator.wikimedia.org/T84986#936017 (10hashar) 3NEW [10:39:26] 3Release-Engineering: scap.ssh.cluster_ssh() only returns the last line of error - https://phabricator.wikimedia.org/T84986#936017 (10hashar) Note the above error message comes from T84946 which was about mwdeploy private key not existing. A more obvious error message would have helped. [10:39:47] 3Beta-Cluster: beta-scap-eqiad failing due to key error - https://phabricator.wikimedia.org/T84946#934339 (10hashar) And I have filled T84986 about the misleading error message reported by scap. [10:48:17] 3operations, Continuous-Integration: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#936038 (10hashar) +Coren I talked with him about CI labs during the summer. That is following the meeting we had for the RFC "Extensions continuous integration" (T1350). With more an... [10:52:59] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#936043 (10hashar) [11:15:16] Project beta-scap-eqiad build #34569: FAILURE in 1 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34569/ [11:20:16] 3Continuous-Integration: Create labs project for CI disposables instances + OpenStack API credentials - https://phabricator.wikimedia.org/T84988#936061 (10hashar) 3NEW [11:21:17] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: Create labs project for CI disposables instances + OpenStack API credentials - https://phabricator.wikimedia.org/T84988#936061 (10hashar) [11:24:35] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: Figure out how to dedicate baremetal to a specific labs project - https://phabricator.wikimedia.org/T84989#936078 (10hashar) 3NEW [11:25:00] 3operations, Continuous-Integration: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#934261 (10hashar) Filled {T84989} to see how we could dedicate that material to CI. [11:25:16] Project beta-scap-eqiad build #34570: STILL FAILING in 1 min 8 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34570/ [11:25:24] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#936091 (10hashar) [11:35:56] Project beta-scap-eqiad build #34571: STILL FAILING in 1 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34571/ [11:45:44] Project beta-scap-eqiad build #34572: STILL FAILING in 1 min 32 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34572/ [11:51:17] 3operations, Continuous-Integration: [OPS] hhvm 3.3.0-20140925+wmf3 has some annoying build dependency - https://phabricator.wikimedia.org/T73413#936140 (10fgiunchedi) p:5Triage>3Low [11:56:08] Yippee, build fixed! [11:56:09] Project beta-scap-eqiad build #34573: FIXED in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34573/ [14:20:14] YuviPanda: around? [14:20:53] kart_: ‘sup [14:20:59] YuviPanda: has notification about Beta down stopped or Beta cluster is all doing good? haven't received it for a while. [14:21:16] :) [14:21:24] kart_: I think it’s doing all good, but we migrated to shinken and perhaps didn’t include you in the alerts list :D [14:21:36] YuviPanda: please add me :) [14:21:54] kart_: wanna make a patch? ;) [14:22:04] YuviPanda: also, possible to get when instance is 503? [14:22:05] shinken/contacts.cfg and contactgroups.cfg [14:22:12] kart_: instance or everything? [14:22:17] YuviPanda: ok. will do. [14:22:27] YuviPanda: instance. [14:22:27] :) [14:22:33] kart_: not yet, but is on the way... [14:22:39] haven’t had time to work on ityet [14:23:16] YuviPanda: no hurry, but will be very useful. [14:23:34] kart_: it will email when an instance is unresponsive to ping [14:26:40] YuviPanda: thanks. [14:33:39] YuviPanda: https://gerrit.wikimedia.org/r/181067 [14:44:47] hi zeljkof_ [14:45:10] chrismcmahon: hi, in a meeting with manybubbles [14:45:22] hi! [14:46:49] zeljkof_: I don't have to much to go over, would you like to skip our pairing session today and start your holiday early? [14:46:59] hi manybubbles [14:47:26] chrismcmahon: sure, let's meet next year then :) [14:48:17] sounds good! [14:53:34] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:41:27] zeljkof_: so here is something that using vagrant would solve: I need to install a plugin on the elasticsearch instance that jenkins uses [15:41:55] zeljkof: it got further! [15:41:59] manybubbles: :) [15:45:02] zeljkof: can you make the rubocop job voting? I'll fix the failures now [15:45:13] manybubbles: sorry, not today :( [15:45:30] sure, yeah sorry, I forgot that it is the end of your day [15:45:32] wasn't thinking [15:45:37] have good vacation! [15:45:54] but it is on my list for next year :) https://phabricator.wikimedia.org/T84979 [15:51:59] greg-g, /cc cscott gwicke i lost track of your pings here y'day .. but got reminded just now. we do work with hashar helping maintaining parsoid-related test infrastructure, I would like to think of this as a collaborative endeavour since we'll never be as familiar with the beta labs and test infrastructure as you guys are. [15:52:03] 3Release-Engineering: Make RuboCop job voting - https://phabricator.wikimedia.org/T84979#936418 (10Manybubbles) +1. Can you do Cirrus first? I had it clean and then kept hacking a while back and didn't notice that I broke it again. I'm going to add a patch that should get it passing again but without it votin... [15:52:51] 15:51:09 /tmp/hudson417148877664820201.sh: line 2: /usr/local/bin/zuul-cloner: No such file or directory [15:52:56] So that pretty much can't be good, right? [15:53:15] marktraceur: there is only zuul. [15:53:22] no wait, there is never zuul [15:53:27] cscott: Apparently there is only *not* zuul. [15:53:28] there are lots of zuuls? [15:53:38] there *were* lots of zuuls, but now there is but one zuul? [15:53:58] do we really want to have zuul clones running around everywhere? [15:54:00] Regardless of how many zuuls there may or may not (only) be, we can no longer make more. [15:54:04] i think this is a good thing [15:54:20] he wasn't exactly a good guy, you know [15:54:47] anyway, it sounds like a job for HASHAR [15:54:51] our resident ghost buster [15:55:15] oh, wait, he's not on irc. who ya gonna call? [16:01:42] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #434: ABORTED in 12 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/434/ [16:04:02] Project beta-scap-eqiad build #34595: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34595/ [16:05:38] Project beta-scap-eqiad build #34596: STILL FAILING in 3.2 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34596/ [16:07:24] cscott: Krinkle usually? [16:07:31] I guess someone called, ‘coz he’s here [16:09:11] subbu: /me nods [16:10:45] greg-g: yeah, i tend to do a lot of continuous integration work, and i've had it on my informal to-do list to get a handle on our beta-publishing system, because I want OCG to get published to beta too. [16:12:03] greg-g: the current system is not terribly secure (rt 8866) [16:12:52] greg-g: but it's also the case that every project currently seems to have a different way of publishing to beta. VE and parsoid have completely separate systems, OCG does it manually, I don't know what the mechanism is that mw-core uses, etc. [16:13:36] hmmm, rt numbers to phab.... [16:13:45] so i guess i'm saying that i'm willing to do more beta-maintenance work but i'd like it if there weren't quite so much to do. [16:14:02] 3Quality-Assurance, Release-Engineering: Review environment abstraction layer for mediawiki_selenium - https://phabricator.wikimedia.org/T78356#936448 (10zeljkofilipin) We had the review meeting. Is there something left to do, or could this be resolved? [16:14:02] cscott: mw+extensions use scap, ie: what we use in prod [16:14:22] VE is also scap [16:14:35] VE has it's own credentials and own scap process [16:14:37] parsoid and other services are the ones that are different (because, I don't know why) [16:15:03] parsoid and ocg use git-deploy, and i think the short reason is that git-deploy isn't automatable. [16:15:16] why not? [16:15:21] same as scap isn't? :) [16:15:22] parsoid's beta deploy actually uses an ad-hoc rsync, which is different from how parsoid is actually deployed in production [16:15:49] ah [16:15:54] (this: https://integration.wikimedia.org/ci/view/Beta/job/beta-parsoid-update-eqiad/ ) [16:16:06] I don't see the special VE one, but I'll take your word for it :) [16:16:14] greg-g: it's in the RT tickret [16:16:26] what's the phab number? [16:16:32] ah, there might be redirects... [16:16:51] wrt to git-deploy: https://github.com/cscott/trigger/commit/296d4128e779c789d461efb8ffe3371e1b7a6a3b [16:17:20] Yippee, build fixed! [16:17:21] Project beta-scap-eqiad build #34597: FIXED in 3 min 6 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34597/ [16:17:28] but trebuchet-deploy does not seem to currently be maintained, since ryan left, eg https://github.com/trebuchet-deploy/trigger/pull/28 [16:17:33] tl:dr right now [16:17:38] (speaking of security issues) [16:18:48] anywho, I need to get some things done, I have 40 minutes of non-scheduled time today [16:19:00] maybe gwicke or bd808 can describe why we use git-deploy instead of scap. i really don't know anything about scap under the hood. [16:19:09] please file a ticket with your thoughts on this if there is anything we can do to make things better [16:19:26] I understand why git-deploy vs scap, but, as I said, gotta run [16:19:28] greg-g: i trust that email to operations@ still works, post phab migration? [16:20:52] for what? [16:21:17] better yet: ask in -operations :) [16:24:13] isn't -qa for questions and answers? ;) [16:26:22] cscott: :P [16:26:36] 3Release-Engineering: scap.ssh.cluster_ssh() only returns the last line of error - https://phabricator.wikimedia.org/T84986#936459 (10bd808) I think the actual output of the failed syncs was something along the lines of: ``` 11:44:59 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'deployment-... [16:27:11] YuviPanda: cscott: Where did you see zool-cloner no file error? Which job/slave? [16:27:37] Krinkle: it was marktraceur who posted that [16:28:05] 10:52:51 AM) marktraceur: 15:51:09 /tmp/hudson417148877664820201.sh: line 2: /usr/local/bin/zuul-cloner: No such file or directory [16:28:05] (10:52:56 AM) marktraceur: So that pretty much can't be good, right? [16:28:16] and then i just contributed to a general loss of productivity after that [16:28:20] Krinkle: https://integration.wikimedia.org/ci/job/mwext-UploadWizard-qunit/688/console [16:28:27] I think every UW patch is getting it. [16:32:14] 3Release-Engineering: scap.ssh.cluster_ssh() only returns the last line of error - https://phabricator.wikimedia.org/T84986#936460 (10bd808) 5Open>3Invalid a:3bd808 > ``` > output = fds.pop(proc.stdout.fileno(), '') > yield host, status, output > ``` `fds` is a dictionary, so this `pop()` is not a list po... [16:34:36] marktraceur: thx. A faulty slave is in the pool. Fixing it now [16:34:54] Thanks! [16:35:37] Project beta-scap-eqiad build #34599: FAILURE in 1 min 32 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34599/ [16:45:26] Yippee, build fixed! [16:45:27] Project beta-scap-eqiad build #34600: FIXED in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34600/ [16:46:18] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#936478 (10Krinkle) Okay, so my suspicion is correct. integration-slave1001 has a clean puppet run but `install_zuul` was never finished and was actually causing jenkins-jobs that use zuul-cloner... [16:46:48] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#936481 (10Krinkle) p:5Unbreak!>3High a:3hashar [16:49:13] greg-g: just beat me to the explanation of the restricted task :p [16:49:24] JohnLewis: :) [16:49:40] 16:34:11 zuul-cloner: error: Can not mix change and refupdate parameters [16:49:43] https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/34262/console [16:50:52] not sure if related to the above, still reading... [16:55:25] well, it passed now [17:06:10] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#936503 (10hashar) [17:09:44] legoktm: Yeah, that slave was missing zuul-cloner. I installed it while it was pooled. [17:09:50] you caught it while installing [17:10:00] puppet is broken [17:18:22] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#936515 (10Krinkle) Documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup#Ubuntu_12_Precise [17:20:40] ok, thanks [17:35:04] Project beta-scap-eqiad build #34605: FAILURE in 1 min 1 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34605/ [17:42:53] greg-g: do you have a particular patch that took a really long time in Jenkins? [17:45:22] Project beta-scap-eqiad build #34606: STILL FAILING in 1 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34606/ [17:45:51] chrismcmahon: as I said, I couldn't find one, I was just spot checking [17:47:17] greg-g: I'm wondering if the context is merging multiple patch sets [17:47:58] chrismcmahon: there was complaints of a merge being blocked on unrelated changes (jenkins explicitly waiting until another unrelated change was tested/merged) [17:48:16] the zuul merge queue was pretty slow at several times yesterday. All the jobs I saw that were holding things up were disk i/o related on overloaded slaves (esp 1006) [17:48:56] I saw jobs that took 8+ minutes to setup the environment before running the tests [17:49:11] Load on 1006 was 20+ for quite a while [17:49:47] Which inspired me to file https://phabricator.wikimedia.org/T84911 [17:51:20] zuul intentionally enforces a ordering to merges so when lots of folks are merging in different repos they end up in a single funnel [17:51:36] This is based on the theory that all projects are interrelated [17:51:53] which holds at openstack (inveotrs of zuul) but not always here [17:52:03] *inventors [17:52:22] * bd808 wonders how he soaked all of this up from just email and irc [17:56:07] Project beta-scap-eqiad build #34607: STILL FAILING in 2 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34607/ [17:56:56] twentyafterfour: fyi ^ beta scap is failing again (it was yesterday, too): https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34607/console [17:57:16] twentyafterfour: looks to be the same error, maybe? https://phabricator.wikimedia.org/T84946 [17:57:19] fuck fuck fuck fuck [17:57:44] That host worked yesterday :/ [17:58:31] !log forcing puppet run on deploymnet-videoscaler01 [17:58:36] Logged the message, Master [17:58:42] greg-g: I'll look [17:59:01] "12:10 < Krinkle> puppet is broken" what does this mean? which puppet? puppet-master or a specific host? [17:59:12] integration's puppet stuff [17:59:28] hashar broked it [17:59:31] * greg-g goes into a 2 hour block of 30 minute meetings [18:00:04] Notice: /Stage[main]/Mediawiki::Users/File[/home/mwdeploy]/owner: owner changed 'mwdeploy' to 'mwdeploy' [18:00:04] Notice: /Stage[main]/Mediawiki::Users/File[/home/mwdeploy]/group: group changed 'mwdeploy' to 'mwdeploy [18:00:50] twentyafterfour: ^ That may have been the cause of the scap failures. I bet there is a local user/group there for mwdeploy [18:01:38] !log deployment-videoscaler01 has mysteriously aquired a local mwdeploy user instead of the ldap one [18:01:40] Logged the message, Master [18:02:56] !log removed local mwdeploy user & group from videoscaler01 [18:02:59] Logged the message, Master [18:04:52] twentyafterfour: forcing puppet run did not recreate a local mwdeploy user/group on videoscaler01 which is good [18:05:23] ok [18:05:35] so I wonder where the local user came from? [18:05:39] I think the scap that is running now should succeed [18:05:57] Yippee, build fixed! [18:05:57] Project beta-scap-eqiad build #34608: FIXED in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34608/ [18:06:08] There is puppet code to create the user/group. If there was an ldap failure during the puppet run it would make a local one [18:06:25] and it works again now ^^ [18:06:48] Labs ldap and dns issues bork things occasionally [18:06:54] which sucks [18:07:38] It would be nice if there was a way to disable puppet's ability to create local accounts or at least make an ldap look a hard failure instead of a soft one [18:08:40] I don't no if setting User { forcelocal => false } would do that or not [18:09:09] Or maybe User {provider=>ldap} is really what we need in beta [18:09:47] creating an ldap user would always fail (which I think is really what we want) [18:10:24] It would be nice to catch it in the act. /me looks in puppet logs on video01 [18:10:53] provider=ldap might do it [18:11:11] I could look through the puppet code to see what it actually does [18:13:39] I don't see a log message showing the creation of the mwdeploy user locally, but I do see a lot of "changed 'mwdeploy' to 'mwdeploy'" messages so it may have been flapping back and forth for quite some time [18:14:16] * bd808 wanders back to looking at the logstash server [18:16:02] !log ran `apt-get dist-upgrade` on logstash01 [18:16:04] Logged the message, Master [18:23:50] !log redis input to logstash stuck; restarted service [18:23:52] Logged the message, Master [18:25:01] YuviPanda: I've got a monitoring question. How/where would I create a monitor on the length of a redis list? [18:25:27] bd808: ah, hmm. I wonder if we already have a check_ for that. [18:25:45] bd808: looks like we don't [18:26:06] bd808: should be trivial to write a check_redis that checks length and returns it and then we can check. [18:26:11] bd808: what are you looking to monitor? [18:26:22] the logstash redis input [18:26:37] to notice when logstash stops taking things in from it [18:26:47] * bd808 is writing a ticket in phab now [18:26:56] yeah, cc me? [18:27:10] bd808: I can write check_redis_length if you do not have the time :) [18:31:02] YuviPanda: https://phabricator.wikimedia.org/T85013 [18:31:13] Prod would benefit form this check too [18:31:16] *from [18:34:12] bd808: yeah [18:34:28] bd808: redis-tools sadly doesn’t exist in precise, so can’t just use redis-cli [18:34:34] so might have to setup python-redis [18:42:20] YuviPanda: It's easy from php too :) [18:42:35] bd808: are you drunk already? ;) [18:42:55] * bd808 counts empties around desk [18:43:01] not yet :) [18:43:32] heh [18:51:46] !log Re-created and provisioning integration-slave1005 (UbuntuTrusty) [18:51:48] Logged the message, Master [19:34:11] Project beta-scap-eqiad build #34614: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34614/ [19:35:20] Project beta-scap-eqiad build #34615: STILL FAILING in 0.36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34615/ [19:47:19] Yippee, build fixed! [19:47:20] Project beta-scap-eqiad build #34616: FIXED in 3 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34616/ [20:48:02] 3Continuous-Integration: Puppet broken on integration slaves: install_zuul - https://phabricator.wikimedia.org/T84917#937007 (10Krinkle) On a new Trusty instance (e.g. integration-slave1005), the latest master of `integration/zuul` (`wmf-deploy-20141208-2`) actually installs without problems. So I guess that pac... [20:51:19] !log integration-slave1005 (new Ubuntu Trusty instance) is now pooled [20:51:22] Logged the message, Master [21:03:21] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: CI labs instances can't start on reboot: tmpfs: Bad value 'jenkins-deploy' for mount option 'uid' - https://phabricator.wikimedia.org/T76250#937056 (10Krinkle) >>! In T76250#935970, @hashar wrote: >>>! In T76250#935670, @Krinkle wrote: >> Instance has been... [21:32:24] 3Wikimedia-Labs-Infrastructure, Continuous-Integration: CI labs instances can't start on reboot: tmpfs: Bad value 'jenkins-deploy' for mount option 'uid' - https://phabricator.wikimedia.org/T76250#937096 (10hashar) >>! In T76250#937056, @Krinkle wrote: > I didn't rebooted it. But it worked fine when provisioning... [21:55:33] Project beta-scap-eqiad build #34629: FAILURE in 1 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34629/ [21:58:46] 3Quality-Assurance: Use dotenv ruby gem for configuration management - https://phabricator.wikimedia.org/T71405#937156 (10dduvall) 5Open>3declined a:3dduvall Yes, I think the native functionality for an `environments.yml` file in the new environment abstraction layer for mw-selenium essentially does what w... [21:58:59] Yippee, build fixed! [21:58:59] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #403: FIXED in 35 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/403/ [22:01:51] well, THAT hasn't happened in a long time ^^ [22:01:55] 3Quality-Assurance, Release-Engineering: Review and merge environment abstraction layer for mediawiki_selenium - https://phabricator.wikimedia.org/T78356#937167 (10dduvall) [22:05:36] Yippee, build fixed! [22:05:36] Project beta-scap-eqiad build #34630: FIXED in 1 min 37 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34630/ [22:07:33] 3Quality-Assurance, Release-Engineering: Review and merge environment abstraction layer for mediawiki_selenium - https://phabricator.wikimedia.org/T78356#937175 (10dduvall) I've expanded this task to include actually merging the experimental branch. [22:23:07] Yippee, build fixed! [22:23:08] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #435: FIXED in 1 hr 26 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/435/ [22:23:58] aand the VE-to-wikitext bug is fixed... ^^ [22:30:17] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #343: FAILURE in 44 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/343/ [22:38:57] Zuul seems to be hanging. Anyone here able to fix it? [22:40:43] chrismcmahon: ^ can you diagnose please [22:41:34] if I knew where zuul was running I could at least look into it but I'm not sure [22:42:29] https://integration.wikimedia.org/zuul/ [22:42:54] https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:43:09] twentyafterfour: ^ server info there [22:43:40] greg-g: I don't have access to gallium [22:44:10] twentyafterfour@bast1001:~$ ssh gallium [22:44:11] Permission denied (publickey). [22:44:46] well that doesn't help [22:45:04] I am not at all sure that gallium is accurate [22:45:59] I'll see if I can get there [22:46:00] twentyafterfour: there are some things you can do through the web interface, see the known issues section of that page [22:46:30] (I don't have access either) [22:46:32] looks like gallium is correct to me (based on puppet configs) [22:46:49] yeah, it is [22:47:21] greg-g: there is a list though :) [22:47:44] (of those with access - those in contint-* groups have access) [22:48:36] I don't know how to diagnose the issue and be confident in the action to take [22:48:46] I know continuous integration isn't my direct responsibility, but I'd be glad to help out with such things when nobody is around, if you wanna give me access [22:49:12] greg-g: I don't either but I know how to be careful and only proceed when I know for sure ... [22:51:33] twentyafterfour: greg-g: thanks for your help. Not sure what you did or if at all, but it seems to be progressing now [22:52:14] bearND: spirits [22:54:04] greg-g: replied to the phab ticket [22:54:18] (about, hopefully twentyafterfour :p) [22:54:55] 17:53 < greg-g> Krinkle: did you do anytihng with Zuul recently? [22:54:55] 17:53 < Krinkle> greg-g: Nope. Zuul and Jenkins commit suicide every 10-40 hours as usual for the past few months. [22:54:58] 17:53 < greg-g> and just come back up? [22:55:01] 17:54 ~ dbrant is now dbrant|bbl [22:55:03] 17:54 < Krinkle> No, they escalate until Antoine or I wake up and push every button we see until it comes back [22:55:06] 17:54 < Krinkle> I've just disconnected and relaunched the Gearman manager on gallium [22:55:09] 17:54 < Krinkle> That usually brings it back [22:55:29] twentyafterfour: ^ [22:55:37] ISTR I once could get on gallium, but no luck today [22:56:26] I tried from the internet and from bastion.wmflabs.org [22:56:57] well, it's not in labs [22:57:07] try bast1001.wikimedia.org [22:57:13] chrismcmahon: I think ops are now stern with puppetized access and you're not listed down for gallium in puppet [22:57:40] might be something you want to prod about though maybe :) [22:57:44] JohnLewis: that makes sense. greg-g ^^ I haven't had need of gallium in quite some time and a lot has changed. [22:58:01] gallium is root and contint only. Pretty much everything is managed via Jenkins web UI though, or via Jenkins API from jenkins-job-builder. Both of which use wmf ldap group as access and most ppl are in. [22:58:29] I just don't know how to know if the current issue is actually: "The Gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:" [22:58:47] greg-g: It was exactly that [22:58:52] how did you know? [22:59:25] greg-g: Well, usually I don't bother figuring out whether it is that. Restarting Gearman is quick and harmless (running jobs keep running, queued jobs are kept by Zuul). [22:59:42] Whenever I see Zuul dashboard full of jobs all 'queue' with nothing running, that's pretty much it [22:59:57] twentyafterfour: ^ :) [22:59:58] This time though I determined it manually by trying to connect to gearman over ssh, and it timed out [23:00:13] By e.g. running $ /usr/local/bin/zuul-gearman.py status on gallium [23:00:38] greg-g: where did you paste that from btw? [23:00:51] https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock [23:01:05] * greg-g can read docs :) [23:01:10] Perfect [23:01:55] !log Krinkle restarted Gearman, which got the jobs to flow again [23:01:58] Logged the message, Master [23:02:15] oh yeah, I have done https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock before :-) [23:02:51] I was looking for it in the Jenkins docs on mw.o though [23:03:15] yeah, where one ends and the other begins/where the problem exists isn't always obvious [23:08:53] mail sent [23:09:58] greg-g: Jenkins is the most known piece of software in the stack (like anyone's heard of Zuul or Gearman in CI context). But it rarely has issues. It's always there in the background, just waiting to run jobs. [23:10:57] I still haven't been able to fully investigate what causes the deadlocks in the first place. I think Antoine gave it a few tries and mostly found upstream logic errors in Zuul or Gearman that apparently other customers aren't hitting as much. Could be related to our scale and/or complexity. [23:11:43] * greg-g nods [23:19:39] 3Continuous-Integration: Warn/alert on too many jobs queued - https://phabricator.wikimedia.org/T85034#937275 (10greg) 3NEW [23:22:42] greg-g: I assume you're the 'manager approver' here for the gallium access? [23:23:02] probably [23:23:28] greg-g: are you his manager basically :p [23:23:37] yep [23:24:08] Project beta-scap-eqiad build #34635: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34635/ [23:24:14] then yep. I'll patch it up pending a quick chat [23:28:45] Yippee, build fixed! [23:28:46] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #199: FIXED in 40 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/199/ [23:31:19] Yippee, build fixed! [23:31:19] Project beta-scap-eqiad build #34636: FIXED in 5 min 47 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34636/ [23:32:18] greg-g twentyafterfour: https://gerrit.wikimedia.org/r/#/c/181211/ [23:37:25] 3Continuous-Integration: Do not run HHVM tests on mediawiki-core < 1.24 branches - https://phabricator.wikimedia.org/T85036#937299 (10awight) 3NEW [23:38:34] 3Continuous-Integration: Do not run HHVM tests on mediawiki-core < 1.24 branches - https://phabricator.wikimedia.org/T85036#937307 (10bd808) Actually anything older then 1.25wmf12 should have hhvm either disabled entirely or at least non-voting. [23:46:30] PROBLEM - Free space - all mounts on deployment-cache-upload02 is CRITICAL: CRITICAL: deployment-prep.deployment-cache-upload02.diskspace._srv_vdb.byte_percentfree.value (<100.00%) [23:48:40] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #298: FAILURE in 10 min: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/298/ [23:54:51] Project beta-scap-eqiad build #34639: FAILURE in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/34639/ [23:57:13] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #348: FAILURE in 23 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/348/