[03:29:19] PROBLEM - Puppet staleness on deployment-redis01 is CRITICAL 100.00% of data above the critical threshold [43200.0] [06:20:30] (03PS1) 10Ragesoss: Allow generic ApiError without response, fix error with token_type [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 [06:38:48] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [08:08:06] zeljkof-meeting: lost ya :D [09:03:58] zeljkof-meeting: and I received a mail telling me my self review form ended up being empty :((( [09:11:27] hashar: ouch [09:11:57] zeljkof: and I spent a whole evening writing it :] Hopefully the second pass will take less time [09:21:29] hashar: good luck :) [09:40:44] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1275656 (10hashar) >>! In T98289#1274035, @bd808 wrote: > I might as well take this at this point. :) Thanks for stepping in! I was not sure who might know about our logging processing and would b... [09:44:40] 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1275659 (10hashar) [10:11:16] 6Release-Engineering, 10MediaWiki-General-or-Unknown: mediawiki-ruby-api gem requires passing "token_type: false" for some queries - https://phabricator.wikimedia.org/T98719#1275724 (10zeljkofilipin) [10:54:23] (03CR) 10Hashar: [C: 032] Switch beta udp2log host to deployment-fluorine [tools/scap] - 10https://gerrit.wikimedia.org/r/209830 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [10:54:43] (03Merged) 10jenkins-bot: Switch beta udp2log host to deployment-fluorine [tools/scap] - 10https://gerrit.wikimedia.org/r/209830 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [11:55:27] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 66.67% of data above the critical threshold [0.0] [12:20:27] RECOVERY - Puppet failure on deployment-bastion is OK Less than 1.00% above the threshold [0.0] [13:36:43] (03PS1) 10Hashar: Tie operations-dns-lint to productionSlaves [integration/config] - 10https://gerrit.wikimedia.org/r/210049 [13:36:55] (03CR) 10Hashar: [C: 032] Tie operations-dns-lint to productionSlaves [integration/config] - 10https://gerrit.wikimedia.org/r/210049 (owner: 10Hashar) [13:39:46] (03PS1) 10Hashar: php-{name}-build to labs slaves [integration/config] - 10https://gerrit.wikimedia.org/r/210050 [13:43:57] (03Merged) 10jenkins-bot: Tie operations-dns-lint to productionSlaves [integration/config] - 10https://gerrit.wikimedia.org/r/210049 (owner: 10Hashar) [13:44:14] (03CR) 10Hashar: [C: 032] php-{name}-build to labs slaves [integration/config] - 10https://gerrit.wikimedia.org/r/210050 (owner: 10Hashar) [13:46:02] (03Merged) 10jenkins-bot: php-{name}-build to labs slaves [integration/config] - 10https://gerrit.wikimedia.org/r/210050 (owner: 10Hashar) [13:47:11] (03PS1) 10Hashar: Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 [13:47:22] (03CR) 10Hashar: [C: 032] Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 (owner: 10Hashar) [13:48:56] (03CR) 10jenkins-bot: [V: 04-1] Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 (owner: 10Hashar) [13:50:20] (03PS2) 10Hashar: Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 [13:50:54] (03PS1) 10Hashar: Move pep8/pyflakes jobs to labs instances [integration/config] - 10https://gerrit.wikimedia.org/r/210056 [13:54:37] !log Jenkins: removing label hasContintPackages from production slaves, it is no more needed :) [13:54:39] Logged the message, Master [14:01:30] (03CR) 10Hashar: [C: 032] Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 (owner: 10Hashar) [14:03:16] (03Merged) 10jenkins-bot: Delete unused pywikibot-core-tests [integration/config] - 10https://gerrit.wikimedia.org/r/210053 (owner: 10Hashar) [14:03:28] (03CR) 10Hashar: [C: 032] Move pep8/pyflakes jobs to labs instances [integration/config] - 10https://gerrit.wikimedia.org/r/210056 (owner: 10Hashar) [14:05:34] (03Merged) 10jenkins-bot: Move pep8/pyflakes jobs to labs instances [integration/config] - 10https://gerrit.wikimedia.org/r/210056 (owner: 10Hashar) [15:06:21] 10Continuous-Integration-Infrastructure, 10MonoBook, 10Vector: Add jenkins jobs for mediawiki/skins/CologneBlue, Nostalgia, Modern, Example, MonoBook, Vector - https://phabricator.wikimedia.org/T68926#1276232 (10Paladox) Should this be closed since they run jobs now. [15:24:05] 10Beta-Cluster: deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused - https://phabricator.wikimedia.org/T71978#1276291 (10hashar) 5Open>3Resolved a:3hashar Was resolved somehow, the error is no more showing. [15:47:07] (03PS10) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [15:49:47] (03CR) 10JanZerebecki: [C: 04-1] "The order of the apply scripts was corrrect. But the failure to load the Wikibase classes is fixed by first loading Wikibase then Wikidata" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [15:52:39] So recently there was a change merged into extension/gwtoolset (about half an hour ago). But beta cluster still shows the old version [15:53:13] According to jenkins, beta-scap-eqiad et al have run in the intervening time [15:53:23] So what's up with that? [15:54:41] relavent patch is https://gerrit.wikimedia.org/r/#/c/207329/ (note, if it matters, I manually +2 verified the patch, as unit tests are broken for that repo) [15:57:17] bawolff: hmm, I can see that change on the deployment bastion [15:58:03] I'm trying to verify if its live by looking at the git hash for gwtoolset at http://commons.wikimedia.beta.wmflabs.org/wiki/Special:Version [15:58:53] * thcipriani checks deployment-mediawiki01 [16:00:20] thcipriani: -bastion was having disk space warnings this weekend [16:00:44] greg-g: bastion has a problem, generally speaking, with /var [16:01:08] coulda stopped after the second comma [16:01:19] (er, before, to be grammatically correct, ish) [16:01:45] I feel like generally speaking would always be a non-essential appositive in that sentence, FWIW. [16:02:41] bawolff: I do see the new checkMaxPostSize function in includes/Helpers/FileChecks.php, I'm not sure why special:version hasn't updated... [16:02:52] this is on deployment-mediawiki01 [16:03:11] thcipriani: Ok, I was waiting for special:version to update before testing, I'll just test now [16:03:37] It works. Thank you for looking into it [16:03:49] thcipriani: way to just go all english major there with "a non-essential appositive" [16:03:56] bawolff: yw! [16:04:33] greg-g: I live with an English major. We have a lot of oxford-comma-related arguments :) [16:04:58] thcipriani: :) [16:06:15] 10Beta-Cluster, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7HHVM, 5Patch-For-Review: GWToolset XML upload fails with “The file that was uploaded exceeds the upload_max_filesize and/or the post_max_size directive in php.ini” on hhvm 3.6 - https://phabricator.wikimedia.org/T97415#1276399 (10Bawolff) 5... [16:22:27] 10Beta-Cluster, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7HHVM, 5Patch-For-Review: GWToolset XML upload fails with “The file that was uploaded exceeds the upload_max_filesize and/or the post_max_size directive in php.ini” on hhvm 3.6 - https://phabricator.wikimedia.org/T97415#1276468 (10JeanFred) C... [16:30:22] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'http://en.m.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 3102 bytes in 1.827 second response time [16:30:32] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50442 bytes in 0.355 second response time [16:31:22] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 3098 bytes in 1.662 second response time [16:31:52] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50442 bytes in 0.518 second response time [16:41:49] lots of this in beta: Fatal error: Object does not implement ArrayAccess in /srv/mediawiki/php-master/includes/filerepo/file/LocalFile.php on line 258 [16:44:11] hah! so not labs [16:44:16] * yuvipanda goes away [16:53:43] thcipriani: hmmmm [16:54:34] 6Release-Engineering, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: mediawiki-ruby-api gem requires passing "token_type: false" for some queries - https://phabricator.wikimedia.org/T98719#1276525 (10Ragesoss) [16:55:13] thcipriani: could be [16:55:14] ttps://phabricator.wikimedia.org/rMW429a22cd088d9ca3d543b10b722f02c32979169f [16:55:18] https://phabricator.wikimedia.org/rMW429a22cd088d9ca3d543b10b722f02c32979169f [16:57:58] greg-g: well, this went out right before shinken exploded, but a lot of stuff went out around that time: https://gerrit.wikimedia.org/r/#/c/210073/1 [16:58:31] thcipriani: I just filed https://phabricator.wikimedia.org/T98754 [16:58:44] kk [17:23:56] 10Beta-Cluster, 10Deployment-Systems, 10Traffic, 6operations: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1276698 (10BBlack) 3NEW [17:33:32] 10Beta-Cluster, 10Deployment-Systems, 10Traffic, 6operations: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1276736 (10BBlack) Paste of induced 503s related to gzip: https://phabricator.wikimedia.org/P633 , where the fetching fails with: ``` 13 FetchError c Junk aft... [17:45:21] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30076 bytes in 0.879 second response time [17:45:31] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47735 bytes in 0.931 second response time [17:46:15] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 48030 bytes in 1.022 second response time [17:46:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47735 bytes in 1.132 second response time [17:48:49] yay, recoveries! [18:01:56] yuvipanda: Started learning go over the weekend -- https://github.com/bd808/ggml [18:06:02] 6Release-Engineering, 10Wikidata, 7Composer: enable use of a composer created autoloader in extensions deployed to production - https://phabricator.wikimedia.org/T97560#1276909 (10JeroenDeDauw) This would indeed be useful for the Wikidata project, especially when we start getting extensions on top of Wikibas... [18:15:48] 6Release-Engineering, 10Continuous-Integration-Config, 10Wikidata, 7Composer: enable use of a composer created autoloader in extensions deployed to production - https://phabricator.wikimedia.org/T97560#1276955 (10greg) [18:26:41] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #221: ABORTED in 4 hr 0 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/221/ [18:34:34] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1276978 (10bearND) I just ran the Android Gradle build on a plain Ubuntu 14 system. The only thing I needed to install was: sudo apt-get update sudo... [19:00:25] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<100.00%) [19:15:33] bd808: mmmm nice :) [19:15:49] bd808: do you think we should start introducing go into the infrastructure? [19:15:55] right now I’m getting rid of all the perl [19:16:12] not sure honestly [19:16:38] shipping python libs is a pain [19:16:52] but making a proper package from go probably is too [19:17:01] (03PS11) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [19:17:19] mostly I wanted to play with the idea and I've been meaning to try go for something so ... profit [19:17:22] bd808: I’ve gotten fairly used to making python debs. [19:17:26] bd808: oh yeah, +1 on that [19:17:30] although the error handling bugs me [19:17:38] it’s like the worst of C and Java [19:17:58] it does seem a bit weird [19:18:13] (03CR) 10JanZerebecki: [C: 04-1] "Needs to be merged before this: https://gerrit.wikimedia.org/r/210117" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [19:18:15] bubbling them up takes a bit of boilerplate [19:18:30] bd808: we’re introducing the first bits of Go in our infrastructure shortly (etcd) [19:24:32] bah. My monolog config for beta isn't right :/ [19:25:21] SiteConfiguration isn't doing what I want it to. Now I have to read that code again :/ [19:27:08] :( [19:28:05] Found the problem but not sure what the fix is [19:28:30] it's not SiteConfiguration's fault it's wmfLabsOverrideSettings() [19:29:13] I have a setting (wmgMonologChannels) that is an array [19:29:37] for beta I want to add things to the wmgMonologChannels[default] array [19:29:52] but wmfLabsOverrideSettings() doesn't support that [19:30:21] so I just end up with my new wmgMonologChannels[default] stuff replacing the global config [19:31:20] I wonder how many things in that file will be messed up if I fix the merge logic to work the way I want it to? [19:37:24] merging config settings is a hard problem >.> [19:47:10] 6Release-Engineering, 10MediaWiki-Debug-Logging, 15User-Bd808-Test: wmfLabsOverrideSettings() doesn't merge settings recursively - https://phabricator.wikimedia.org/T98772#1277224 (10bd808) 3NEW [19:47:29] legoktm: ^ there's the description of my problem [19:47:55] I wonder if there is an "easy" fix. Do we have a dblist-like collection for all beta hosts? [19:48:03] * bd808 looks [19:48:28] all-labs.dblist? [19:50:37] good morning [19:50:43] bd808: should it be +wmgMonologChannels ? [19:50:54] I guess what I'd really want for this is a new "beta" tag in $wikiTags when we are executing the beta cluster [19:51:09] we lack puppet collection on beta so not easy to find out all instances :/ [19:51:33] legoktm: that won't fix it because we aren't using SiteConfiguration to do this merge [19:51:37] it's a local hack [19:51:45] oh ew [19:52:03] legoktm: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings-labs.php#L34-L43 [19:53:01] oh... does the siteParamsCallback block above mean that I just need to use '+beta' instead of 'default'? [19:58:01] (03PS5) 10Hashar: Add OOJsUIAjaxLogin extension [integration/config] - 10https://gerrit.wikimedia.org/r/207758 (owner: 10Florianschmidtwelzow) [20:03:25] greg-g: is our team in charge of Gerrit (maintenance/upgrade)? [20:04:00] hashar: "in charge of" is such a strong phrase [20:04:07] hashar: but, yeah [20:04:11] no one else is [20:04:19] that is the point :D [20:04:31] are we in charge of it or is it orphaned ? :} [20:04:41] "unfunded mandate" is a pretty good phrase for it :) [20:04:42] in charge being .. Chad unfortunately [20:04:45] right [20:04:47] ahah [20:05:02] poor "Gerrit tribe" [20:09:58] (03CR) 10Hashar: "Would you mind using npm to run jshint and jsonlint? You will also want to add the i18n linter (banana-lint). Then Jenkins will just run" [integration/config] - 10https://gerrit.wikimedia.org/r/207758 (owner: 10Florianschmidtwelzow) [20:12:36] (03CR) 10Hashar: Disable tests on deployment branches where we have removed them (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/209629 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:14:27] (03PS3) 10Hashar: Disable tests on deployment branches where we have removed them [integration/config] - 10https://gerrit.wikimedia.org/r/209629 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:16:00] (03CR) 10Hashar: [C: 032] "I have removed the tiny mistake and rebased the change. Congrats Awight for finding out the Zuul branch: filter system :}" [integration/config] - 10https://gerrit.wikimedia.org/r/209629 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:17:47] (03Merged) 10jenkins-bot: Disable tests on deployment branches where we have removed them [integration/config] - 10https://gerrit.wikimedia.org/r/209629 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:18:14] (03CR) 10Hashar: "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/209629 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:23:59] hashar: Thanks! [20:24:11] awight: hey! I haven't tested it though :/ [20:24:43] awight: I hope one day you will find out how to run them concurrently :D [20:25:27] hashar: Rats, looks like this patch doesn't apply to the civicrm job? [20:28:09] awight: I am not sure :} [20:28:58] awight: the filter is applied on the job mwext-DonationInterface-testextension-zend [20:29:13] regardless of the Gerrit repo/project that triggers the job [20:29:15] the filter being tied to the job [20:29:34] (03PS1) 10Awight: Prevent running phpunit tests on CiviCRM deployment branch [integration/config] - 10https://gerrit.wikimedia.org/r/210179 (https://phabricator.wikimedia.org/T94586) [20:33:17] :) [20:34:09] awight: so if you look at the Zuul diff on https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/4092/console [20:34:25] you can see that wikimedia-fundraising-civicrm is triggered by a bunch of Gerrit projects [20:34:35] yes [20:35:04] hrm, right so the branch clause should only apply to the wikimedia-fundraising-crm repo [20:35:06] the branch filter apply to the change on a given repo [20:35:11] I guess it's harmless for the others? [20:35:21] I guess [20:35:24] :( [20:35:32] they probably don't have a deployment repo do they? [20:35:49] right, they're generally pulled in as submodules or composer libs [20:36:09] Is there a way to restrict my rule to only apply to the -crm repo? [20:36:18] Just for the sake of cleanliness... [20:36:31] nop :D [20:36:35] you would need a different job [20:36:43] but in theory [20:37:06] if one of the other repo has a 'deployment' branch, the job can be ignored [20:37:18] since the change will end up being tested as a submodule later on [20:37:24] yikes! That's fine with us, though [20:37:41] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL 22.22% of data above the critical threshold [0.0] [20:37:41] (03CR) 10Hashar: [C: 032] Prevent running phpunit tests on CiviCRM deployment branch [integration/config] - 10https://gerrit.wikimedia.org/r/210179 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:37:46] thx! [20:37:54] I am quite happy you found out all that civicrm testing madness [20:38:42] 10Continuous-Integration-Infrastructure, 10Wikimedia-Fundraising-CiviCRM: CI for Civi: provision and run tests under Jenkins/Zuul - https://phabricator.wikimedia.org/T86103#1277390 (10awight) [20:39:14] Me too, it's paying off already [20:40:16] hopefully you find new creative tests to run :} [20:40:26] (03Merged) 10jenkins-bot: Prevent running phpunit tests on CiviCRM deployment branch [integration/config] - 10https://gerrit.wikimedia.org/r/210179 (https://phabricator.wikimedia.org/T94586) (owner: 10Awight) [20:41:24] awight: deployed [20:41:59] awesome, that will make fundraising deployment much less painful [20:42:37] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App, 5Patch-For-Review: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1277402 (10hashar) >>! In T88494#1277341, @gerritbot wrote: > Change 210177 had a related patch set uploaded (by Hashar): > contint:... [20:46:37] hashar: your/do CI/jenkins stuff right? [20:47:08] JohnFLewis: nope, he doesn't [20:47:19] legoktm: bah, who does? :) [20:47:30] (03PS12) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [20:47:42] RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK Less than 1.00% above the threshold [0.0] [20:48:17] (03CR) 10JanZerebecki: "PS7 is only a rebase." [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [20:48:18] JohnFLewis: heh, I was kidding [20:48:35] I don't know this region enough for the jokes :p [20:48:49] what do you need help with? [20:49:22] legoktm: the "operations-puppet-tox-py27" test keeps failing randomly on operations patches and it should probably be looked at/disabled or so [20:49:31] um [20:49:43] examples? [20:49:48] it's just a normal tox job [20:50:07] legoktm: it’s failing on jobs that don’t touch any python scripts at all [20:50:17] and randomly - a recheck fixes it [20:50:27] well that sounds like the test it's running is flaky? [20:50:36] I wonder who wrote that test... [20:50:40] https://gerrit.wikimedia.org/r/#/c/209874/ e.g. [20:50:58] oh [20:50:59] 15:44:26 ERROR: Problem fetching from origin / origin - could be unavailable. Continuing anyway [20:50:59] 15:44:26 ha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=hudson.plugins.git.GitException: Command "fetch -t origin refs/zuul/production/Zfbb0a775522e4c358b618c02c91dd05e" returned status code 128: [20:50:59] 15:44:26 stdout: [20:50:59] 15:44:26 stderr: error: object file .git/objects/eb/2474568ba4663702ed2927a4ba715f61bedd1f is empty [20:51:01] 15:44:26 fatal: loose object eb2474568ba4663702ed2927a4ba715f61bedd1f (stored in .git/objects/eb/2474568ba4663702ed2927a4ba715f61bedd1f) is corrupt [20:51:03] that's a jenkins issue [20:51:17] JohnFLewis: I do as well as other folks in this channel. legoktm definitely has as much knowledge as me :} [20:51:29] probably one of the workspaces is corrupt [20:51:33] :( [20:51:40] hashar: I see you and assume it with you so :) [20:51:42] we should just convert it to use generic tox-py27 [20:51:43] maybe something got repacked [20:52:05] oh, but is cloning ops/puppet slow? [20:53:13] legoktm: the failing jobs were all on integration-slave-precise-1012/ [20:53:42] I'll delete that workspace then [20:54:21] legoktm@integration-slave-precise-1012:/mnt/jenkins-workspace/workspace/operations-puppet-tox-py27$ sudo git status [20:54:21] error: object file .git/objects/b4/8ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f is empty [20:54:21] fatal: loose object b48ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f (stored in .git/objects/b4/8ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f) is corrupt [20:55:16] time to git fsck [20:55:16] !log deleted operations-puppet-tox-py27 workspace on integration-slave-precise-1012, it was corrupt (fatal: loose object b48ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f (stored in .git/objects/b4/8ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f) is corrupt) [20:55:19] Logged the message, Master [20:55:20] or maybe just delete [20:55:21] :) [20:59:46] (03PS13) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [21:00:52] 6Release-Engineering, 10Wikimedia-Hackathon-2015, 10Wikipedia-Android-App: Create end-to-end test for Wikipedia Android app - https://phabricator.wikimedia.org/T90177#1277439 (10BGerstle-WMF) Any updates on this? Would love to hack on it at Lyon. [21:02:51] 6Release-Engineering, 10Wikimedia-Hackathon-2015, 10Wikipedia-Android-App: Create end-to-end test for Wikipedia Android app - https://phabricator.wikimedia.org/T90177#1277450 (10greg) >>! In T90177#1277439, @BGerstle-WMF wrote: > Any updates on this? Would love to hack on it at Lyon. No. @zeljkofilipin / @... [21:05:28] (03PS1) 10Hashar: Build Android mobile app with gradlew [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) [21:24:29] (03PS2) 10Hashar: Switch Android mobile app from maven to gradlew [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) [21:26:03] (03CR) 10Hashar: "Made it to run each step in sequence so:" [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) (owner: 10Hashar) [21:29:03] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App, 5Patch-For-Review: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1277616 (10hashar) a:3bearND In short @bearND made it absolutely trivial to build the Android app. The Jenkins job proposed at htt... [21:30:26] (03CR) 10JanZerebecki: "Deployed to Jenkins: mwext-WikidataQuality-npm, mwext-WikidataQuality-qunit, mwext-WikidataQuality-repo-tests-mysql-hhvm, mwext-WikidataQu" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [21:40:23] (03PS14) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [21:41:29] (03CR) 10JanZerebecki: "Made those two new HHVM phpunit jobs non-voting." [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [21:43:50] (03CR) 10JanZerebecki: [C: 031] "Please merge and deploy to zuul." [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [22:04:56] 6Release-Engineering, 10MediaWiki-Debug-Logging, 5Patch-For-Review, 15User-Bd808-Test: wmfLabsOverrideSettings() doesn't merge settings recursively - https://phabricator.wikimedia.org/T98772#1277727 (10bd808) Merging config recursively properly is a non-trivial problem. The easier thing to do in this case... [22:05:16] 6Release-Engineering, 10MediaWiki-Debug-Logging, 5Patch-For-Review, 15User-Bd808-Test: wmfLabsOverrideSettings() doesn't merge settings recursively - https://phabricator.wikimedia.org/T98772#1277728 (10bd808) 5Open>3Resolved a:3bd808 [22:07:36] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1277733 (10bd808) Sort of related, the Monolog config for beta was messed up making most things go to the default `wfDebug` udp2log file and not to Logstash at all. This was fixed with {T98772} by m... [22:10:19] bd808 <3 also what’s left to close that bug? [22:10:37] I was just wondering that myself [22:10:53] heh [22:10:55] I need to check and see if deployment-bastion is aggregating syslog [22:14:59] !log Removed role::logging::mediawiki from deployment-bastion [22:15:01] Logged the message, Master [22:16:16] 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1277763 (10ori) >>! In T89912#1192271, @greg wrote: > Quoting the description from @ori: >> HHVM has been locking up in production, typically right after a big deployment which touches lots of file. Typically only one or two... [22:21:21] !log Stopped udplog; purged package and removed config files on depolyment-bastion [22:24:21] !log removed duplicate local group mwdeploy from deployment-bastion that was shadowing the ldap group of the same name [22:24:26] Logged the message, Master [22:24:40] thcipriani: ^ not sure when that showed up :/ [22:25:22] that's not good. We deleted that once, I thought. Checking SAL. [22:25:46] I just noticed it because a forced puppet run said "changing group 'mwdeploy' to 'mwdeploy'" [22:26:24] This looks better in a re-run: "group changed '992' to 'mwdeploy'" [22:26:46] I want to see if l10nupdate is still there [22:27:02] no, that one stayed gone. [22:27:09] there is a local l10nupdate group, yes [22:27:14] maybe not a user [22:27:38] I removed that user: it was causing scap ownership problems the same way mwdeploy was. [22:27:50] around the same time you removed mwdeploy [22:28:07] *nod* if puppet runs and ldap burps then it will create local users unfortunately [22:28:27] seems that way. [22:29:15] !log removed duplicate local group l10nupdate from deployment-bastion that was shadowing the ldap group of the same name [22:29:18] Logged the message, Master [22:30:35] It is at least in part because Puppet is not aware that we are using ldap for auth. it just does a system lookup for the user/group name. If it fails then it creates a local user/group as needed [22:30:55] and most of our puppet config doesn't specify a uid/gid for the resources [22:31:40] it's only a problem in beta because of nfs. prod wouldn't care if the uid/gid didn't match [22:32:14] beta hopefully can stop caring at some point [22:32:16] it seems un-puppet-like for it to create/groups rather than just die and vomit out an error. [22:32:43] like when it tries to create a file in a folder that doesn't exist [22:32:54] The manifests have user{} and group{} declarations [22:33:07] so it's just doing the right thing as far as it knows [22:33:08] ah, right. [22:36:54] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1277870 (10bd808) I removed `role::logging::mediawiki` from deployment-bastion and tried to manually clean up the services and config that were orphaned by that role being removed. There are certain... [22:37:13] bd808: so I guess the actual deletion is what’s left... [22:41:13] It still has role::syslog::centralserver too [22:41:19] looking at that now [22:42:34] My first question is "do we need central syslog?" [22:42:41] (03PS1) 10Legoktm: Whitelist devunt - Josa extension maintainer [integration/config] - 10https://gerrit.wikimedia.org/r/210219 [22:42:51] thcipriani: thoughts? [22:43:27] Things are definitely being written to deployment-bastion:/data/project/logs/syslog [22:43:43] doesn’t logstash do that better? [22:43:50] * thcipriani looking [22:44:20] yuvipanda: It could if we created a role to tell the servers to log to logstash instead [22:44:35] (03CR) 10Legoktm: [C: 032] Whitelist devunt - Josa extension maintainer [integration/config] - 10https://gerrit.wikimedia.org/r/210219 (owner: 10Legoktm) [22:44:52] I'm actually not sure yet how the rsyslogs get told to log to the center. I should look that up [22:46:12] (03Merged) 10jenkins-bot: Whitelist devunt - Josa extension maintainer [integration/config] - 10https://gerrit.wikimedia.org/r/210219 (owner: 10Legoktm) [22:48:41] it looks like there is no reason to use syslog vs logstash in beta, at least from a quick glance through the puppet roles and phab code. [22:49:33] I found out what sends the logs -- base::remote-syslog [22:49:49] and it uses a realm switch [22:49:55] joy [22:50:14] that class is gross [22:50:32] had a dash in name, must be :) [22:50:51] !log deploying https://gerrit.wikimedia.org/r/210219 [22:50:54] Logged the message, Master [22:51:01] yuvipanda: read it and weep -- https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/remote-syslog.pp [22:51:13] oh dear [22:51:25] bd808: it’s not as bad as the parsoid role [22:51:34] heh [22:51:41] I’m sure you’ve looked at that one :) [22:52:26] thcipriani: we never did https://gerrit.wikimedia.org/r/#/c/193082/ [22:52:59] yeah, the -oid roles were pretty close, iirc [22:54:25] yuvipanda: the parsoid::beta role was what initial sent me down the rabbit hole of wanting to figure out how to automate running trebuchet. That's what ran me off from beta cluster work the last time I left ;) [22:54:44] bd808: :) it ran me off too :) But I made a patch to fix it. [22:55:00] it just needs someone to apply, babysit (I tested and it works, but no idea how to test jenkins) and update IPs [22:57:06] what tests the tester? Automating trebuchet would be a Good Thing™ seems unintuitive that it doesn't autorun when everything else does. [22:58:31] thcipriani: it would be a great thing. the magic to do it is dark and mysterious [22:58:52] legoktm, what is the purpose of that changeset? [22:59:24] thcipriani: fwiw, I've been using ansible with some success for restbase staging deployments [22:59:29] devunt: jenkins was only running php lint checks whenever you uploaded a new patch, and would run the full tests whenever it was +2'd. now it'll run the full tests whenever you upload a new patchset [22:59:39] thcipriani: The default operational mode of trebuchet is to assume that a human will read things and decide if the fetch and checkout steps worked [22:59:42] https://github.com/gwicke/ansible-playground/blob/master/roles/restbase/deploy.yml [23:00:04] I am definitely way more familiar with trebuchet than I'd like to be :\ [23:00:32] I actually like most of it. salt is the part that seems a bad fit to me [23:00:40] added a vagrant config upstream: https://github.com/trebuchet-deploy/trebuchet/pull/16 [23:00:52] I gave up on salt for remote command execution for tools. using pssh locally now [23:01:00] pretending that async things are sync causes no end of headaches [23:01:14] bd808: it's just that "trebuchet" is a very apt metaphor for its mode of operation. [23:01:21] legoktm, regardless of project? or just on Josa [23:02:02] which is to say, I feel like we live in an age of more elegant modes of launching code [23:02:18] HAHAHA [23:02:30] like, you launch it and then hope it works? [23:02:45] I guess we should write 'rail gun' or 'spacex' ;) [23:02:54] exactly :) [23:03:03] also you have to click ‘unpack’ before it attacks and then in the meantime your enemy cavalry / infantry has destroyed it because even though it has good range resistance it doesn’t have good meelee resistance? [23:03:09] * yuvipanda misses Age of Empires [23:03:47] I got into some pretty heated "discussions" for undertaking to make scap less shitty rather than figuring out how to use trebuchet for MW deploys [23:04:14] good luck to the next person to tilt at these windmills [23:04:47] I REMEMBER THOSE! [23:04:48] I think that scap actually works fairly well, certainly provides more feedback than trebuchet, plus flying-pig FTW [23:05:05] thcipriani: it used to be a bunch of bash scripts :) [23:05:10] and perl [23:05:18] it was a train wreck [23:05:21] very similar to the toollabs webservice trainwreck [23:05:37] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #668: FAILURE in 54 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/668/ [23:06:38] probably similar origins: need another feature; I'll just put it over here [23:06:42] I think it always takes a lot of work and heated discussions to not blindly shuffle into the default end-state of all projects: trainwreck :) [23:06:51] yeah [23:06:53] heh [23:07:18] devunt: for all projects. Basically we trust you not to run evil code on the test servers [23:07:57] scap had gotten to the point that nobody really knew how it worked. I found at least 3 non-trivial bugs that had been caused by trivial changes over time [23:08:16] the "just run scap twice" bug was the best [23:08:39] oh. thank you [23:08:42] new l10n never worked; Sam's fix was just to run the script again (which always fixed things) [23:09:01] :D [23:09:10] I finally found the place where Sam had broken things by moveing one command up 3 lines in the file [23:09:36] And in the process of fixing it I took out the whole prod cluster for 6 minutes [23:09:49] 404's for all pages on all wikis [23:09:57] it was terrifying [23:10:19] * yuvipanda hopes his toollabs escapades are a lot less terrifying [23:10:39] well. That's probably why no one ever tried to fix it :) [23:13:05] yuvipanda: what's the "right" way to fix these host exclusions? https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/remote-syslog.pp#L2 [23:13:21] hiera set to true globally and false for the target hosts? [23:13:55] bd808: basically, yeah [23:13:58] for now at least. [23:23:16] yuvipanda: heh. That role sends syslog from *everything* in the labs realm to deployment-bastion [23:23:26] how wonderful [23:23:28] only firewalls stop it [23:23:42] * bd808 will have a patch in a bit [23:31:21] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL 100.00% of data above the critical threshold [0.0] [23:31:25] PROBLEM - Puppet staleness on deployment-urldownloader is CRITICAL 100.00% of data above the critical threshold [43200.0] [23:31:43] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL 100.00% of data above the critical threshold [0.0] [23:35:48] yuvipanda: Empire Earth > Age of Empires [23:35:55] never played that [23:35:56] Sierra vs. MS [23:36:13] it's like AoE just even better:) [23:36:19] AoE is Ensemble Studios [23:36:22] :P [23:36:33] EE is Stainless Steel Studios