[00:50:35] PROBLEM - Puppet staleness on deployment-prometheus01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [00:52:46] 10Gerrit, 10Phabricator: Disable Tjlsangria Gerrit and Phabricator accounts - https://phabricator.wikimedia.org/T147165#2683347 (10Legoktm) [04:06:03] Yippee, build fixed! [04:06:03] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #160: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/160/ [09:25:20] hashar: o/ [09:25:32] do you have a minute to help me with Jenkins and the puppet compiler? [09:25:55] the slave seems not reachable and I am a bit ignorant about the magic to restore it :) [09:32:27] elukey: sure in half an hour or so [09:32:54] thank you! [09:42:24] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [09:43:22] PROBLEM - Host deployment-db1 is DOWN: CRITICAL - Host Unreachable (10.68.16.193) [09:47:01] elukey: still in audio. Can you meanwhile fill a bug for the puppet compiler? Phabricator should have a #puppet-compiler tag [09:47:08] + continuous-integration-infrastructure [09:50:03] hashar: ah yes but I wanted to ask you if there was some daemon to restart on the host to make Jenkins work again, maybe it doesn't need a phab task [09:50:13] it seem that Jenkins is not able to re-launch the slave [09:54:34] well [09:54:39] if in doubt fill a task :D [09:54:53] they are cheap and spam all interested people [09:55:00] at least puppet pass on the instance \o/ [09:55:24] zeljkof: https://integration.wikimedia.org/ci/computer/compiler02.puppet3-diffs.eqiad.wmflabs/ is offline [09:55:27] sounds familiar ? [09:55:40] we had a slave locked last week [09:56:08] yteah same issue [09:56:13] ssh slave plugin being crazy [09:56:27] I just provisioned deployment-poolcounter03 with jessie for T123734 though can't login yet, I suspect because puppet is broken? [09:56:52] godog: labs instance creation is broken afaik [09:57:13] hashar: ah! is there a task to follow? [09:57:18] a weird conditions of LDAP badly provisionned, puppet master being off, SSL cert not up-to-date etc [09:57:38] yeah there must be a task. Something like new jessie instance labs [09:57:40] perhaps [09:57:50] godog: maybe it is fixable via salt ? [09:58:30] hashar: perhaps, no idea what the problem is atm [09:58:43] deployment-salt02.deployment-prep.eqiad.wmflabs [09:58:45] might save you [09:58:56] but no time to go into the rabbit hole now, I'll let it be for now [10:02:39] elukey: ok the Jenkins slave is back :) [10:02:42] it seems that setting the host offline and trying to bring it back in jenkins did something [10:02:45] elukey: had to kill a bunch of Java threads [10:02:51] ahhhh you did something! [10:02:52] :P [10:02:59] there is some wierd java deadlock inthe code somewhere [10:03:09] but on the compiler or on gallium? [10:03:11] and one can see the java threads in https://integration.wikimedia.org/ci/monitoring?part=threads [10:03:18] then randomly kill threads that are BLOCKED and with a name related [10:03:33] that is on the Jenkins master side [10:03:53] it happened on another slave last week [10:04:02] ah okok, thanks a lot! [10:04:24] andI have zero ideas how to debug/write java :D [10:04:54] godog: 2016-10-03T10:03:22.873729+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known [10:05:07] godog: got it by asking for the console output on Wikitech [10:05:13] should be available in Horizon as well [10:05:26] hashar: yeah, I got the same from horizon [10:05:34] most probably salt works [10:05:45] so one can sed /etc/puppet.conf and or /etc/resolv.conf to fix it up [10:07:44] salt not available bah [10:08:31] sure I can hammer my way through, I'll report it instead as I got something not working through no fault of mine I think [10:11:48] hashar: a slave offline again (just saw that) [10:11:50] ? [10:12:49] zeljkof: yeah same issue as last week. Jenkins ssh plugin being deadlocked [10:12:57] argh [10:13:05] I have just killed threads :D [10:13:10] we really gotta upgrade Jenkins [10:13:17] jenkins is going crazy again [10:13:19] anyway laundry duty [10:13:24] err [10:13:25] is it ? [10:13:33] no, I mean, locking slaves [10:13:47] upgrade before or after migration? [10:13:52] after [10:13:56] PROBLEM - Host deployment-poolcounter03 is DOWN: CRITICAL - Host Unreachable (10.68.19.250) [10:14:00] I dont want to mess with several things at the same time :D [10:14:47] that's me btw, trying to recreate poolcounter04 now [10:21:56] !log add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502 [10:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:28:29] PROBLEM - Puppet run on deployment-poolcounter04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:33:47] 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic, 13Patch-For-Review: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2684318 (10BBlack) It would probably be better to upgrade the deployment-prep upload cache to varnish4. [13:37:54] !log marked integration-slave-trusty-1014 offline. Cant run job / get stuck somehow [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:39:20] !log integration-slave-trusty-1014 upgrading packages, clean up and rebooting it [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:41:00] !log Tip of the day: to reboot an instance and bypass molly-guard: /sbin/reboot [13:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:43:07] poolcounter failures is me btw, fixing [13:43:22] !log Added integration-slave-trusty-1014 back in the pool [13:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:58:30] RECOVERY - Puppet run on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:00:53] 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic, 13Patch-For-Review: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2684352 (10AlexMonk-WMF) >>! In T147116#2684318, @BBlack wrote: > It would probably be better to upgrade the deployment-prep upload cache to varnish4. Okay... [14:06:07] 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic, 13Patch-For-Review: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2684357 (10BBlack) I wish :) The basic flow we're using on prod nodes is here, but some of that's inapplicable to deployment-prep: https://wikitech.wikimed... [14:07:47] (03PS6) 10Hashar: Rename builders tox and doxygen [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [14:15:18] (03CR) 10Hashar: [C: 032] "Rebased, did an additional replacement." [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [14:16:06] (03Merged) 10jenkins-bot: Rename builders tox and doxygen [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [14:27:12] (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [14:30:57] hashar thankyou for merging :) [14:41:22] (03PS3) 10Hashar: Replace deprecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [14:41:52] (03CR) 10Hashar: [C: 04-1] "Some jobs lack the BuildDiscarderProperty :(" [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [14:42:02] paladox: will look at the build discarder one :D [14:42:05] it is not ready yet [14:42:09] Oh [14:42:27] I am looking at it [14:44:23] paladox: yeah the fix is easy. Some jobs have "properties" which override completely the value from the default [14:44:29] JJB does not inherit/merge properties [14:44:33] Oh [14:44:36] Thanks [14:45:57] hashar guessing that is a bug in zuul? [14:50:11] (03PS4) 10Hashar: Replace deprecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [14:51:25] (03CR) 10Hashar: [C: 031] "I have added the discardbuild property on jobs that had a more specific properties section. JJB does not merge the values from the defaul" [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [14:51:26] That seems to be a bug, if it dosen't inherit and merge [14:51:40] JJB does not inherit/merge properties [14:52:01] Oh [14:52:03] the thing is sometime you want to override, other times you want to merge :] [14:52:08] Yep [14:52:10] anyway [14:52:15] that is updating most jobs so gotta baby sit it [14:52:24] Yep, updating most jobs [15:08:42] 06Release-Engineering-Team, 10MediaWiki-General-or-Unknown, 06Operations, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10BBlack) Is there more to do here on the MW-Core side of things? [15:14:21] Beta people: Can anyone clarify for me what determines which hosts have base::firewall applied and which don't? [15:16:38] hashar hi would you be able to review https://gerrit.wikimedia.org/r/#/c/313213/ and https://gerrit.wikimedia.org/r/#/c/313230/ please [15:16:39] ? [15:16:46] first link is adding a php7 pipeline [15:17:04] andrewbogott, doesn't it depend on whether they have a role that includes it? [15:17:08] second is for reusing php lint code [15:17:25] Krenair: yes — my question is why some do and some don't [15:17:44] (It might be a perfectly reasonable mirror or prod, or it might be haphazard, I can't tell.) [15:17:45] (03PS3) 10Paladox: Reuse phplint code in job-template.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/313230 [15:17:52] (03PS2) 10Paladox: Add php7 pipeline for zuul [integration/config] - 10https://gerrit.wikimedia.org/r/313213 (https://phabricator.wikimedia.org/T144872) [15:17:58] some don't have roles that include it? [15:18:17] are you looking into a specific one? [15:19:55] hm... [15:20:10] (03PS2) 10Paladox: Add composer-php70 as a experimental test to mediawiki/core [integration/config] - 10https://gerrit.wikimedia.org/r/309556 (https://phabricator.wikimedia.org/T144961) [15:20:21] mostly I'm trying to figure out if it makes sense to enforce a "only roles at top-level node definition" and base::firewall is a prime offender [15:20:37] but it's equally so it prod and in beta so I'll see about changing it in prod first and see who complains :) [15:20:54] (03PS5) 10Paladox: Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) [15:21:50] andrewbogott: Krenair: I think we used to have ferm rules applied on beta cluster due to the role classes [15:21:54] and required base::firewall on them [15:22:03] maybe try to drop it on an instance and see what happens? :( [15:22:04] (03CR) 10Paladox: "@hashar https://phabricator.wikimedia.org/T137770 was resolved and I think arcanist is available for trusty now." [integration/config] - 10https://gerrit.wikimedia.org/r/295976 (owner: 1020after4) [15:22:12] gotta rush out though sorry :( [15:22:23] hasharAway: I don't think it's necessarily incorrect to have it applied; just trying to understand. [15:23:44] PROBLEM - Puppet run on deployment-cache-upload04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:25:17] 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic, 13Patch-For-Review: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2684616 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF >>! In T147116#2684357, @BBlack wrote: > I wish :) Yeah I knew you were gonna say that.... [15:26:21] 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic, 13Patch-For-Review: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2684621 (10BBlack) I think we can abandon the patch. We're assuming we're past the point of reverting to varnish3 for the upload caches at this point, just... [15:28:34] (03PS5) 10Paladox: [DonationInterface] Switch jenkins tests to extension-unittests-composer-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/307543 [15:31:11] (03PS6) 10Paladox: [DonationInterface] Switch jenkins tests to extension-unittests-composer-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/307543 [15:31:53] PROBLEM - Puppet run on deployment-db03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [15:34:13] (03CR) 10Paladox: [DonationInterface] Switch jenkins tests to extension-unittests-composer-non-voting (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [15:35:13] ejegg hi, im wondering if you could fix this test https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/7054/console please? [15:35:34] So that we can change it from non voting to voting in https://gerrit.wikimedia.org/r/#/c/307543/ [15:35:35] please [15:35:36] ? [15:36:30] Hi! I can definitely take a look at it [15:37:36] ejegg thankyou :) [15:38:43] RECOVERY - Puppet run on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:40:55] ejegg i think it is not loading this https://github.com/wikimedia/mediawiki-extensions-DonationInterface/blob/30c1339790649bed1410e4a0e4306298bae88a5b/tests/phpunit/TestConfiguration.php#L53 file [15:41:12] so it dosen't know it is an alias to TestingGlobalCollectAdapter [15:41:55] right, it must not be. I just need to deal with an icinga alert about one of the last old queues, then I'll take a look at that define [15:42:43] Ok, thanks [15:48:00] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#2684681 (10thcipriani) Fixed up https://gerrit.wikimedia.org/r/310719 based on some review from @Volans. Additional review is welcome! I think this... [15:49:32] ejegg maybe related https://phabricator.wikimedia.org/T142121 (when you finished what your doing :)) [15:53:12] ejegg i tryed it on REL1_27 and resulted in https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/7056/console success [15:53:24] so looks like a change either in the extension or MW 1.28 broke it [15:53:45] I am going to try something by moving the test files into a sub folder to evade the internal change [16:04:58] ejegg i found a fix https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/7059/console [16:04:59] :) [16:05:04] https://gerrit.wikimedia.org/r/#/c/313822/1 [16:05:14] (need to update commit msg) But was testing a fix [16:05:19] But other errors now show [16:05:48] oh cool, thanks for looking into it! [16:06:20] Your welcome [16:06:33] ejegg it shows Undefined index: pageLanguage now [16:11:52] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:39] 03Scap3: scap version flag - https://phabricator.wikimedia.org/T147155#2684779 (10mmodell) gotta watch out for scap2 commands which use --version to refer to a mediawiki version. [16:13:47] ejegg oh it passes https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/7060/console now [16:13:55] After another check experimental :) [16:14:48] I updated https://gerrit.wikimedia.org/r/#/c/313822/ commit msg now :) [16:14:51] ejegg ^^ :) [16:18:41] (03CR) 10Paladox: [DonationInterface] Switch jenkins tests to extension-unittests-composer-non-voting (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [16:19:35] PROBLEM - Puppet run on deployment-prometheus01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:25:34] RECOVERY - Puppet staleness on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:35:05] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Delete deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T147110#2684845 (10dduvall) a:03dduvall [16:38:45] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2684858 (10dduvall) [16:38:47] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Delete deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T147110#2684857 (10dduvall) 05Open>03Resolved [16:49:36] RECOVERY - Puppet run on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:32] PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:21:24] 10Beta-Cluster-Infrastructure, 07Puppet: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2685129 (10AlexMonk-WMF) [17:39:47] 10Beta-Cluster-Infrastructure, 07Puppet: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2685207 (10KartikMistry) This is happening due to cherry-pick of https://gerrit.wikimedia.org/r/#/c/308679/ which is for testing before deployment in Produ... [18:24:21] PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42) [18:37:52] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:55:59] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:02:13] (03PS1) 10Aude: Update Wikidata branch - wmf/1.28.0-wmf.21 [tools/release] - 10https://gerrit.wikimedia.org/r/313861 [19:12:52] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:58] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:59:11] anyone here understand jenkins? i mean, in a really intimate way? [19:59:11] jenkins thinks there's a conflict in some "cross-repo dependencies" in https://gerrit.wikimedia.org/r/313876 -- i didn't even know jenkins had cross-repo dependency management [19:59:12] i'd love to know what repos got crossed with mine [19:59:12] and how to uncross them [20:02:09] hasharAway ^^ [20:03:26] yeah hmm ? [20:03:31] cscott: looking [20:04:25] cscott: somehow it cant merge https://gerrit.wikimedia.org/r/#/c/313876/ against the tip of the branch. Let me check the logs [20:04:57] * hashar logs to scandium.eqiad.wmnet and inspect /var/log/zuul/*.log [20:05:01] it also happened with the previous patch to this repo, IIRC. i had to force-push to solve. [20:05:52] GitCommandError: 'git clone -v ssh://jenkins-bot@gerrit.wikimedia.org:29418/mediawiki/extensions/Collection/OfflineContentGenerator /srv/ssd/zuul/git/mediawiki/extensions/Collection/OfflineContentGenerator' returned with exit code 128 [20:05:52] stderr: 'fatal: destination path '/srv/ssd/zuul/git/mediawiki/extensions/Collection/OfflineContentGenerator' already exists and is not an empty directory. [20:05:52] :( [20:07:05] paladox: there is no point in hitting recheck [20:07:13] oh sorry [20:07:39] so there's just some old files that need to be cleaned up? [20:07:49] names conflict [20:08:07] mediawiki/extensions/Collection having a directory named OfflineContentGenerator/ [20:08:28] which clash when trying to clone mediawiki/extensions/Collection/OfflineContentGenerator [20:09:38] hm, that was never a problem in the past [20:09:55] but in fact extensions/Collection/OfflineContentGenerator appears to be completely unused, so that could be an easy workaround [20:10:40] Is that another bug in zuul, that it dosent support The same named repo even if it does Repo-name/sub-repo [20:10:41] ? [20:11:36] awight hi, could you merge https://gerrit.wikimedia.org/r/#/c/313822/ please? [20:11:53] Fixes the unit tests so i can make the composer voting instead of non voting? [20:14:06] cscott: I am retrying [20:14:36] cscott: basically I have created the repo manually using git init && git fetch && git symbolic-ref refs/remotes/origin/HEAD [20:14:52] it is definitely an issue in Zuul [20:14:59] but it happens once per quarter or so [20:15:58] cscott: there is some jscs issues, not sure how much you care about them [20:16:31] cscott: also I have overhauled the OCG Grafana board with a few more metrics https://grafana.wikimedia.org/dashboard/db/ocg [20:16:55] i can fix the jscs issues [20:17:15] the ocg grafana board is very nice! i was trying to build a kibana dashboard today. [20:17:28] paladox: awesome, much appreciated. [20:18:01] cscott: the queue size change is a copy paste from the graph I have build on one of the mw jobrunner board [20:18:18] and I have looked at OCG source to figure out some other potentially interesting metrics [20:18:27] but really. I have no idea whether they make sense :D [20:18:36] that looks nice though [20:18:58] awight your welcome :) [20:20:35] (03PS7) 10Paladox: [DonationInterface] Switch jenkins tests to extension-unittests-composer [integration/config] - 10https://gerrit.wikimedia.org/r/307543 [20:20:47] awight i've updated ^^ now :) [20:20:51] thanks for merging too [20:20:52] :) [20:22:45] paladox: btw, my availability is about to take a nose-dive from already unimpressive heights :) -- if you want my team's attention for any of the CI issues you've been helping with, try #wikimedia-fundraising [20:28:51] (03CR) 10Awight: [C: 04-1] "I don't think we're quite ready--the mediawiki-extensions-hhvm test still fails:" [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [20:29:18] I am reviewing this one [20:31:42] PROBLEM - Puppet run on deployment-ms-be01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:33:25] (03CR) 10Paladox: "@Awight this switches the test to extension-unittests-composer" [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [20:33:43] hashar which one? [20:33:59] (03CR) 10Hashar: [C: 04-1] "I am not sure whether we should use composer to ship dependencies. For extensions deployed at wikimedia, we ship dependencies solely via " [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [20:34:26] awight: paladox so for Donation Interface, I think we should remove the generic job. I dont think it has much use [20:34:48] hashar oh, so we doint want to use the generic test [20:34:51] I mean the job " mwext-testextension-hhvm-non-voting " [20:34:56] Oh [20:35:03] Yeh [20:35:17] it uses mediawiki/vendor @ master [20:35:20] Already done in https://gerrit.wikimedia.org/r/#/c/307543/7/zuul/layout.yaml [20:35:22] which is missing dependencies for sure [20:35:32] should use mediawiki/vendor@fundraising/REL1_27 [20:35:37] Yep, [20:35:38] which is what the other job is doing [20:35:53] So i should remove - name: extension-unittests-non-voting [20:36:06] Since this mwext-donationinterfacecore-REL1_27-testextension-zend55 is already doing it? [20:39:34] paladox: hashar: ah this brings up something else--we realized last week that we won't migrate any fundraising boxes to HHVM any time this year. We only need to test on Zend PHP 5.3 and 5.5 [20:40:03] Oh, yep [20:40:24] (03PS1) 10Paladox: [DonationInterface] Remove test extension-unittests-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/313890 [20:40:26] hashar awight ^^ [20:40:49] (03CR) 10Hashar: "Second thought!" [integration/config] - 10https://gerrit.wikimedia.org/r/307543 (owner: 10Paladox) [20:41:13] awight: :( [20:41:23] (03CR) 10jenkins-bot: [V: 04-1] [DonationInterface] Remove test extension-unittests-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/313890 (owner: 10Paladox) [20:41:36] definitely aim at phasing out Zend 5.3 though. Precise is gone in May or June 2017. [20:42:00] :) [20:42:19] hashar it seems better to do https://gerrit.wikimedia.org/r/307543 [20:42:27] awight: we should sit down with FR team at some point and rethink the jobs running for your repos [20:42:29] Since it tests against master too :0 [20:42:38] +1 [20:42:46] hashar: I would love to [20:43:04] awight: I think Ejegg and Tyler paired about it [20:43:11] same TZ, might be a good lead [20:44:09] or bring it to a list :] [20:44:12] I am going to bed [20:53:30] PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:08:19] 10Beta-Cluster-Infrastructure, 06Labs: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2686048 (10Andrew) [21:10:10] twentyafterfour: the beta::deployaccess class says up top 'remove this if https://phabricator.wikimedia.org/T121721 is fixed.' [21:10:13] It is — can I really remove it? [21:11:42] RECOVERY - Puppet run on deployment-ms-be01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:24:37] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:28:31] RECOVERY - Puppet run on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:50] andrewbogott: I think so [21:31:16] twentyafterfour: if I pull all the references out of ldap, will you be around for a bit to notice if things go haywire? [21:31:25] andrewbogott: one caveat, the deploy-service user is a local user on deployment-tin/mira [21:31:33] and that's what we have the exception for [21:31:48] does that matter for the fix? [21:32:07] thcipriani: …understand what you're saying would require me to understand that class a lot more than I understand it now. I'm just going by the fact that it says it can be removed. [21:32:57] this is for mwdeploy [21:33:02] oh hmm [21:33:10] sure, I just don't know what the fix was, so I can't say. If the fix relies on users being known to ldap, then it won't work in this instance. [21:33:21] (I'm pretty sure) [21:33:36] can't we create service users in ldap via wikitech? [21:34:09] andrewbogott: what references are in ldap that you mentioned above? [21:34:57] twentyafterfour: 14 hosts in deployment-prep have that class included, via the wikitech UI (which is ldap) [21:35:15] ohh [21:35:39] So if I'm going to remove the class I need to remove the class from those node definitions first. [21:35:47] else their puppet runs will fail [21:35:49] it's a little late in my deploy window, but i'm about to deploy an updated OCG (which among other things will stop our en.wiktionary.org DoS) [21:36:09] it looks like you can remove it since https://gerrit.wikimedia.org/r/#/c/286852/2/modules/scap/manifests/target.pp seems to implement the same thing [21:36:38] ah, yup, that should work :) [21:36:39] andrewbogott: ^ from my reading of that patch it should be ok to remove beta::deployaccess [21:36:41] !log starting OCG deploy [21:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:36:59] twentyafterfour: ok, we'll see what happens :) [21:37:15] andrewbogott: cool [21:37:26] um, did the ssh host key for deployment-tin.eqiad.wmflabs change? [21:37:42] cscott: I think so, somewhat recently [21:37:48] The fingerprint for the ECDSA key sent by the remote host is SHA256:2GreQk3ZeEOCHBzfQA0fzxSUrS8LOlNb7L0TZyU0pLY. [21:38:08] twentyafterfour: done [21:38:08] it was recreated recently [21:38:49] twentyafterfour: can I get a +1 on https://gerrit.wikimedia.org/r/#/c/313903/ ? [21:39:22] andrewbogott: done [21:39:28] thanks [21:40:18] Project beta-scap-eqiad build #122775: 04FAILURE in 2 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122775/ [21:42:42] !log updated OCG to version 0bf27e3452dfdc770317f15793e93e6e89c7865a [21:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:44:01] hrm, do the mediawiki servers have the scap::target class? [21:45:39] it would appear not [21:45:43] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2686205 (10Andrew) [21:45:46] > pam_access(sshd:account): access denied for user `mwdeploy' from `deployment-tin.deployment-prep.eqiad.wmflabs [21:46:08] which is causing the beta-scap-eqiad failure [21:46:49] Project beta-scap-eqiad build #122776: 04STILL FAILING in 1 min 58 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122776/ [21:47:08] andrewbogott: twentyafterfour ^ may need to revert and figure a slightly different solution. [21:47:35] ok… let me know [21:48:40] Project beta-scap-eqiad build #122777: 04STILL FAILING in 1 min 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122777/ [21:48:51] that bot is really upset [21:49:00] heh, yeah [21:49:17] andrewbogott: let's revert that patch for now, and we'll work on something that will make that less upset :) [21:49:39] ok [21:49:49] that fix only works for scap3 on beta, but now mediawiki update on beta is broken :\ [21:50:03] (since mediawiki is not on scap3 yet) [21:50:05] so, do you happen to know the set of VMs that need the class applied? [21:52:05] could figure it out, they're listed in /etc/dsh/group/{mediawiki-*,scap-*} [21:53:13] but scap::target is a define that is used to setup specific services on a target so we can't just apply it to the broken nodes, seemingly [21:53:14] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2686255 (10Andrew) [21:55:45] thcipriani: the class is back available in puppet now [21:56:12] My agenda here is T147233, trying to not have instances apply non-role classes directly. [21:56:33] So if you want to wrap that class up in a role, or rewrite it as a role, I can review and merge. [21:56:50] Project beta-scap-eqiad build #122778: 04STILL FAILING in 1 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122778/ [21:57:13] oh! I have to reapply it to the machines, eh? [21:58:00] can we define a scap::target for mediawiki? [21:58:12] well, ideally we would be applying via the horizon UI, but that would require it to be a role [21:58:48] eh, well, currently scap::target would try to install mediawiki using the scap3 provider... [21:59:27] thcipriani: what if the /srv/deployment/mediawiki was just a dummy repo for now (we will need it soon enough once we start building the combined mediawiki repo) [21:59:51] (03PS1) 10Paladox: [XenForoAuth] Add jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/313924 [22:00:34] twentyafterfour: I don't know, probably would be fine. [22:01:35] weeellll [22:04:36] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [22:05:54] !log reapplied beta::deployaccess to mediawiki servers [22:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:06:08] beta-scap-eqiad should recover shortly [22:06:45] Project beta-scap-eqiad build #122779: 04STILL FAILING in 1 min 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122779/ [22:07:20] ahem [22:16:39] thcipriani, twentyafterfour, do we need to re-open https://phabricator.wikimedia.org/T121721 or is something else happening now? [22:16:42] Project beta-scap-eqiad build #122780: 04STILL FAILING in 1 min 51 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122780/ [22:17:07] andrewbogott: I have a patch in process that should allow the removal of beta::deployaccess [22:17:16] ah, great, thank you. [22:20:42] 06Release-Engineering-Team, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2686313 (10jcrespo) I have done #3 of T146673#2675083. The stopwords handling requires a patch to ada... [22:23:33] 06Release-Engineering-Team, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2686315 (10Paladox) @jcrespo yeh,but https://gerrit.wikimedia.org/r/313235 should improve things more... [22:26:51] Project beta-scap-eqiad build #122781: 04STILL FAILING in 1 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122781/ [22:28:40] Project beta-scap-eqiad build #122782: 04STILL FAILING in 1 min 47 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122782/ [22:30:51] andrewbogott: https://gerrit.wikimedia.org/r/#/c/313927/ would that work for what you're trying to do? [22:31:09] (just wraps what's there in a role instead of a standalone thing) [22:31:26] thcipriani: Sure. [22:32:09] want me to go ahead and merge it? You'll need to change the references again :( [22:32:35] yeah, I can do that, not that many hosts. [22:32:56] merge at will. [22:33:51] done — thanks! [22:34:53] okie doke, I'll add it to the mw hosts in deployment-prep and try it out. [22:35:36] 06Release-Engineering-Team, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2686338 (10Paladox) @jcrespo Hi, I'm wondering would you be able to do the patch that create an table... [22:36:40] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2686339 (10Andrew) [22:36:49] Project beta-scap-eqiad build #122783: 04STILL FAILING in 2 min 0 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122783/ [22:40:00] !log manual rebase on deployment-puppetmaster:/var/lib/git/operations/puppet [22:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:42:28] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:46:46] Yippee, build fixed! [22:46:47] Project beta-scap-eqiad build #122784: 09FIXED in 1 min 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122784/ [22:47:08] yippee [22:47:27] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:48:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50638 bytes in 0.045 second response time [22:48:57] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50638 bytes in 0.042 second response time [22:50:07] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 929 bytes in 0.054 second response time [22:50:18] is anyone exploring what's wrong with the beta cluster greg-g ? Or should I look into that? [22:50:19] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 2236 bytes in 0.055 second response time [22:51:48] I'm not sure what's wrong, there is one message in logstash that's blown up: Fatal error: Cls: Expected string or object in /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php on line 218 [22:52:05] PROBLEM - App Server Main HTTP Response on deployment-mediawiki06 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50638 bytes in 0.042 second response time [22:53:29] Probably poke Aaron [22:55:28] just did so in -operations [22:57:28] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [23:02:26] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:04:47] hey, sorry, was in a 3 hour long meeting/training [23:20:04] Project beta-update-databases-eqiad build #11806: 04FAILURE in 3.8 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11806/ [23:34:49] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown: LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686432 (10Jdlrobson) [23:50:13] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2686487 (10Jdforrester-WMF) [23:50:25] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2623702 (10Jdforrester-WMF) [23:50:27] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown: LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686492 (10Krenair) p:05Triage>03Unbreak! Might be something to do with https://gerrit.wikimedia.org/r/#/c/310757/ ? Not sure if this a... [23:51:54] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2686499 (10Krenair) [23:51:56] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown: LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686498 (10Krenair)