[00:12:50] (03PS1) 10Legoktm: Add flake8 jobs for operations/software/tools-webservice [integration/config] - 10https://gerrit.wikimedia.org/r/210251 (https://phabricator.wikimedia.org/T98805) [00:13:25] (03CR) 10Legoktm: [C: 032] Add flake8 jobs for operations/software/tools-webservice [integration/config] - 10https://gerrit.wikimedia.org/r/210251 (https://phabricator.wikimedia.org/T98805) (owner: 10Legoktm) [00:15:00] (03Merged) 10jenkins-bot: Add flake8 jobs for operations/software/tools-webservice [integration/config] - 10https://gerrit.wikimedia.org/r/210251 (https://phabricator.wikimedia.org/T98805) (owner: 10Legoktm) [00:16:09] !log deploying https://gerrit.wikimedia.org/r/210251 [00:16:24] Logged the message, Master [00:36:54] !log Cherry-picked https://gerrit.wikimedia.org/r/#/c/210253/ [00:36:57] Logged the message, Master [00:40:03] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL 12.50% of data above the critical threshold [0.0] [00:48:30] !log beta cluster central syslog going to logstash rather than deployment-bastion (see https://gerrit.wikimedia.org/r/#/c/210253) [00:48:32] Logged the message, Master [00:48:39] yuvipanda: ^ [00:48:48] bd808: \o/; [00:48:58] bd808: want me to merge / babysit the patch today? or should I wait for tomorrow? [00:49:40] I think it needs some eyes. I changed a bunch of little things there [00:49:59] it at least needs some puppet compiler testing that I don't know how to do [00:50:04] RECOVERY - Puppet failure on deployment-mediawiki01 is OK Less than 1.00% above the threshold [0.0] [00:50:07] bd808: yeah, and base. [00:50:21] bd808: https://wikitech.wikimedia.org/wiki/Puppet_Testing [00:50:22] yeah. it touches all the things [00:50:23] second section [00:53:58] pcc puked on all the hosts I picked :( -- https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/767/console [00:54:55] "Notice: hiera(): Cannot load backend nuyaml: cannot load such file -- hiera/backend/nuyaml_backend" [00:55:06] pcc is teh broken :( [00:55:57] 10Beta-Cluster, 6operations, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1278221 (10Krenair) [00:57:44] bd808: yeah :( [00:58:11] why am I still working at 19:00 local time :( [01:04:23] bd808: yeah, you shouldn't [01:04:28] * yuvipanda plans on going home at 7PM today [01:04:36] average should shift towards 7PM and not midnight... [01:04:41] * legoktm hugs bd808 [01:04:56] * yuvipanda hugs bd808 too [01:05:08] * bd808 runs away from hug monsters [01:05:46] > Can't run, you're still trapped [01:05:59] > Now you get eaten by a grue [01:07:26] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1278255 (10bd808) Syslogs are now going to deployment-logstash1 instead of deployment-bastion via a cherry-pick of https://gerrit.wikimedia.org/r/210253. I removed `role::syslog::centralserver` fro... [01:07:59] * bd808 xyzzy [01:08:17] * bd808 teleports out of the maze of hugs and gues [01:08:22] *grues [01:08:25] :) [02:25:08] Yippee, build fixed! [02:25:09] Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce build #97: FIXED in 3 min 7 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce/97/ [02:35:40] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #569: FAILURE in 2 min 40 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/569/ [05:35:00] Yippee, build fixed! [05:35:00] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #416: FIXED in 32 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/416/ [06:05:12] 6Release-Engineering, 6Phabricator: Next Phabricator upgrade on 2015-05-20 (tentative) - https://phabricator.wikimedia.org/T98451#1278520 (10Qgil) [07:28:00] 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1278622 (10fgiunchedi) sure, context is https://gerrit.wikimedia.org/r/#/c/201006/ where a python virtualenv is shipped in git, also note that sentry is... [07:59:05] (03PS3) 10Hashar: Switch Android mobile app from maven to gradlew [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) [08:38:32] 10Deployment-Systems, 6Release-Engineering: Use subrepos instead of git submodules for deployed MediaWiki extensions - https://phabricator.wikimedia.org/T98834#1278678 (10mmodell) 3NEW a:3mmodell [08:55:44] (03PS4) 10Hashar: Switch Android mobile app from maven to gradlew [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) [08:56:29] PROBLEM - Apertium APY on deployment-sca02 is CRITICAL: Connection refused [08:56:41] PROBLEM - Content Translation Server on deployment-sca02 is CRITICAL: Connection refused [08:58:51] PROBLEM - Mathoid on deployment-sca02 is CRITICAL: Connection refused [08:59:53] PROBLEM - Citoid on deployment-sca02 is CRITICAL: Connection refused [09:01:04] (03CR) 10Hashar: [C: 032] "Good enough first step for now. Can be improved later on." [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) (owner: 10Hashar) [09:02:53] (03Merged) 10jenkins-bot: Switch Android mobile app from maven to gradlew [integration/config] - 10https://gerrit.wikimedia.org/r/210197 (https://phabricator.wikimedia.org/T88494) (owner: 10Hashar) [09:06:28] RECOVERY - Apertium APY on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 4063 bytes in 0.011 second response time [09:08:18] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App, 5Patch-For-Review: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1278712 (10hashar) 5Open>3Resolved Following up some discussions with @bearND, there is no two jobs running different goals: **... [09:08:20] PROBLEM - Puppet failure on deployment-sca02 is CRITICAL 100.00% of data above the critical threshold [0.0] [09:09:52] RECOVERY - Citoid on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 1326 bytes in 0.017 second response time [09:23:52] RECOVERY - Mathoid on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 301 bytes in 0.012 second response time [09:27:18] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [09:37:18] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.021 second response time [09:44:50] Yippee, build fixed! [09:44:51] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #482: FIXED in 7 min 50 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/482/ [09:47:12] (03PS1) 10Hashar: Fix fundraising deployment branch filters [integration/config] - 10https://gerrit.wikimedia.org/r/210291 (https://phabricator.wikimedia.org/T94586) [09:47:22] (03CR) 10Hashar: [C: 032] Fix fundraising deployment branch filters [integration/config] - 10https://gerrit.wikimedia.org/r/210291 (https://phabricator.wikimedia.org/T94586) (owner: 10Hashar) [09:49:01] (03Merged) 10jenkins-bot: Fix fundraising deployment branch filters [integration/config] - 10https://gerrit.wikimedia.org/r/210291 (https://phabricator.wikimedia.org/T94586) (owner: 10Hashar) [10:45:48] PROBLEM - Puppet staleness on deployment-restbase02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [11:13:12] 6Release-Engineering, 6Phabricator: Adding users to CC on Phabricator security tasks doesn't add them to the view/edit policy - https://phabricator.wikimedia.org/T94565#1279007 (10mmodell) 5Open>3Invalid a:3mmodell I'm closing this one, Please reopen if you run into a legitimate instance of this bug. If... [11:37:49] 10Deployment-Systems, 6Release-Engineering: Use subrepos instead of git submodules for deployed MediaWiki extensions - https://phabricator.wikimedia.org/T98834#1279062 (10mmodell) One thing I'm a little unsure of: sub-sub-modules - currently 3 of the deployed submodules have their own submodules: * ./vendor... [11:47:55] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #222: STILL FAILING in 3 hr 30 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/222/ [12:00:27] (03PS15) 10Hashar: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [12:00:36] (03CR) 10Hashar: [C: 032] "\o/" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [12:02:28] (03Merged) 10jenkins-bot: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [12:04:02] (03CR) 10Hashar: "jzerebecki : change deployed :)" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (https://phabricator.wikimedia.org/T97529) (owner: 10Soeren.oldag) [12:33:37] 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1279163 (10hashar) I have manually upgraded zuul on all slaves. zuul-cloner has: --cache-no-hardlinks CACHE_NO_HARDLINKS... [12:34:58] 10Continuous-Integration-Infrastructure: reduce copies of mediawiki/core in workspaces - https://phabricator.wikimedia.org/T93703#1143683 (10hashar) On review of a zuul-cloner patch on [[ https://review.openstack.org/#/c/117626/ | upstream Gerrit ]]. Jeremy Stanley noticed the hard linking is skipped when the m... [12:57:20] 10Beta-Cluster, 6operations, 5Patch-For-Review, 7Puppet: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1279199 (10ArielGlenn) If we can get this done in a day or two, then yes, let's go ahead and have the deployment user and group be per repo. Otherwise I'll... [13:11:25] (03PS1) 10Hashar: Delete operations-puppet-tabs (done by puppet-lint) [integration/config] - 10https://gerrit.wikimedia.org/r/210341 [13:11:34] (03CR) 10Hashar: [C: 032] Delete operations-puppet-tabs (done by puppet-lint) [integration/config] - 10https://gerrit.wikimedia.org/r/210341 (owner: 10Hashar) [13:14:13] (03Merged) 10jenkins-bot: Delete operations-puppet-tabs (done by puppet-lint) [integration/config] - 10https://gerrit.wikimedia.org/r/210341 (owner: 10Hashar) [13:25:40] (03PS1) 10Hashar: Remove operations-puppet-spec [integration/config] - 10https://gerrit.wikimedia.org/r/210342 [13:25:58] (03CR) 10Hashar: [C: 032] Remove operations-puppet-spec [integration/config] - 10https://gerrit.wikimedia.org/r/210342 (owner: 10Hashar) [13:27:44] 6Release-Engineering, 10Wikimedia-Hackathon-2015, 10Wikipedia-Android-App: Create end-to-end test for Wikipedia Android app - https://phabricator.wikimedia.org/T90177#1053692 (10zeljkofilipin) >>! In T90177#1277450, @greg wrote: > @zeljkofilipin / @dduvall / @etonkovidova: would this be something you'd like... [13:27:45] (03Merged) 10jenkins-bot: Remove operations-puppet-spec [integration/config] - 10https://gerrit.wikimedia.org/r/210342 (owner: 10Hashar) [13:28:20] 6Release-Engineering, 10Wikimedia-Hackathon-2015, 10Wikipedia-Android-App: Create end-to-end test for Wikipedia Android app - https://phabricator.wikimedia.org/T90177#1279262 (10zeljkofilipin) >>! In T90177#1277439, @BGerstle-WMF wrote: > Any updates on this? Would love to hack on it at Lyon. What kind of u... [13:45:03] (03PS1) 10Hashar: Drop operations/software python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210347 [13:45:24] (03CR) 10Hashar: [C: 032] Drop operations/software python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210347 (owner: 10Hashar) [13:51:48] (03PS1) 10Hashar: Drop redactatron python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210351 [13:54:20] (03Merged) 10jenkins-bot: Drop operations/software python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210347 (owner: 10Hashar) [13:58:00] (03CR) 10Hashar: [C: 032] Drop redactatron python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210351 (owner: 10Hashar) [13:59:51] (03Merged) 10jenkins-bot: Drop redactatron python jobs [integration/config] - 10https://gerrit.wikimedia.org/r/210351 (owner: 10Hashar) [14:01:09] the continuous integration weekly meeting is starting now in #wikimedia-office . Short agenda is https://www.mediawiki.org/wiki/Continuous_integration/Meetings/2015-05-12 [14:15:55] 10Continuous-Integration-Infrastructure: Grant Zuul deploy rights to Jan Zerebecki - https://phabricator.wikimedia.org/T98865#1279480 (10hashar) 3NEW [14:30:32] (03Abandoned) 10Hashar: Add Job to run ResourcesTest for WikibaseRepo [integration/config] - 10https://gerrit.wikimedia.org/r/166024 (https://phabricator.wikimedia.org/T93404) (owner: 10Hashar) [14:36:46] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<20.00%) [14:39:09] Project browsertests-MobileFrontend-SmokeTests-linux-chrome-sauce build #116: FAILURE in 11 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-SmokeTests-linux-chrome-sauce/116/ [14:49:51] (03PS2) 10Zfilipin: Allow generic ApiError without response, fix error with token_type [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [14:50:14] (03CR) 10jenkins-bot: [V: 04-1] Allow generic ApiError without response, fix error with token_type [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [14:51:33] 10Deployment-Systems, 6Release-Engineering: Determine weekly triage meeting for Deployment Systems - https://phabricator.wikimedia.org/T98206#1279645 (10thcipriani) ## Proposal Rudimentary calendar stalking seems to suggest that the 30 min window immediately following deployment cabal is open for all of us:... [14:51:45] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [14:51:57] hi releng. mediawiki/core's master appears to be broken currently. [14:52:40] example bogus failure: https://gerrit.wikimedia.org/r/#/c/205719/ https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm/7969/console [14:53:46] MatmaRex: please fill a task! :} [14:54:14] MatmaRex: it might be hhvm related [14:54:38] where? you guys renamed a bunch of projects recently, i think [14:54:55] #Continuous-Integration-Infrastructure would do [14:55:01] in case of mistake we retriage them appropriately [14:55:56] MatmaRex: only one rename :) CI-Infra vs CI-Config [14:56:02] 10Continuous-Integration-Infrastructure: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core - https://phabricator.wikimedia.org/T98876#1279657 (10matmarex) 3NEW [14:56:32] (03CR) 10Zfilipin: "Thanks for the patch! I am leaving this for Dan to review. He is on vacation this week, do not expect the review until next week." [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [14:57:11] 10Browser-Tests, 5Patch-For-Review: mediawiki-ruby-api gem requires passing "token_type: false" for some queries - https://phabricator.wikimedia.org/T98719#1279666 (10zeljkofilipin) p:5Triage>3Normal [15:08:03] 10Continuous-Integration-Infrastructure, 7HHVM: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core - https://phabricator.wikimedia.org/T98876#1279707 (10hashar) From the console output: ``` HHVM 3.6.1 is installed. There were 7 errors: 1) WfBaseConvertTest::testDigitToBase2 with data set #0 (... [15:09:27] 10Continuous-Integration-Infrastructure, 7HHVM: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1279720 (10hashar) [15:10:31] so [15:10:55] !log mediawiki-phpunit-hhvm Jenkins job is broken due to an hhvm upgrade {{bug|T98876}} [15:10:58] Logged the message, Master [15:13:22] 10Continuous-Integration-Infrastructure, 7HHVM: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1279733 (10hashar) a:3hashar So I probably screwed it up when running a mass apt-get... [15:18:15] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 3098 bytes in 2.055 second response time [15:18:20] !log downgrading hhvm on CI slaves [15:18:23] Logged the message, Master [15:18:32] 10Continuous-Integration-Infrastructure, 7HHVM: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1279742 (10hashar) I found the old .deb in /var/cache/apt/archives: ``` archives/hhvm-l... [15:18:51] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50489 bytes in 1.004 second response time [15:21:22] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'http://en.m.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 3102 bytes in 3.051 second response time [15:21:30] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50490 bytes in 0.588 second response time [15:28:53] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #531: FAILURE in 53 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/531/ [15:31:35] whoa, dejavu: Fatal error: Object does not implement ArrayAccess in /srv/mediawiki/php-master/extensions/FeaturedFeeds/FeaturedFeeds.body.php on line 38 [15:33:06] thcipriani: another one like that? [15:33:25] thcipriani: just in beta, right? (I mean, you didn't see this in prod during your swat) [15:33:28] yeah, I _think_ that's why we're seeing problems in beta [15:33:30] 10Continuous-Integration-Infrastructure, 7HHVM: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1279816 (10hashar) p:5Unbreak!>3Normal The job is passing again so lowering the pri... [15:33:46] no, swat was all small, different stuff than this [15:33:48] thcipriani: report a task, cc AaronSchulz [15:33:52] kk [15:33:56] actually, assign to Aaron [15:43:31] 10Beta-Cluster, 10Graphoid: Deploy Graphoid on Beta Cluster - https://phabricator.wikimedia.org/T97606#1279890 (10Yurik) Cool, identical to https://wikitech.wikimedia.org/wiki/Graphoid Thanks! [15:47:14] 10Continuous-Integration-Infrastructure: Unattended upgrade seems to only run daily instead of hourly - https://phabricator.wikimedia.org/T98885#1279919 (10hashar) 3NEW [15:58:21] 10Beta-Cluster: deployment-bastion.eqiad.wmflabs `/var/` keeps filling up - https://phabricator.wikimedia.org/T98886#1279989 (10thcipriani) 3NEW [16:10:34] 10Beta-Cluster, 6Release-Engineering: Determine weekly triage meeting for Beta Cluster - https://phabricator.wikimedia.org/T98204#1280048 (10mmodell) It's been proposed to do this right after the deployment cabal meeting on Mondays. [16:26:31] 6Release-Engineering, 6Engineering-Community, 6Team-Practices, 3ECT-May-2015: RelEng team offsite - May 2015 - Pre Wikimedia Hackathon - https://phabricator.wikimedia.org/T89036#1280117 (10Rfarrand) [16:29:06] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL 100.00% of data above the critical threshold [0.0] [16:31:20] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30118 bytes in 2.654 second response time [16:31:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46824 bytes in 0.627 second response time [16:33:12] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47111 bytes in 0.686 second response time [16:33:51] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46815 bytes in 1.209 second response time [16:39:27] (03PS3) 10Ragesoss: Allow generic ApiError without response, fix error with token_type [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) [16:40:49] (03CR) 10Ragesoss: "Thanks! I think PS3 should satify rubocop." [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [16:47:55] (03CR) 10Zfilipin: "recheck" [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [16:48:16] (03CR) 10jenkins-bot: [V: 04-1] Allow generic ApiError without response, fix error with token_type [ruby/api] - 10https://gerrit.wikimedia.org/r/210005 (https://phabricator.wikimedia.org/T98719) (owner: 10Ragesoss) [16:59:21] 10Continuous-Integration-Infrastructure: Unattended upgrade seems to only run daily instead of hourly - https://phabricator.wikimedia.org/T98885#1280214 (10hashar) The /etc/cron.daily/apt script is not meant to run hourly. When checking the upgrade timestamp, it compares against midnight and only run once per d... [17:02:11] the HHVM magically upgrading should be fixed by https://gerrit.wikimedia.org/r/#/c/210391/ [17:05:16] 10Continuous-Integration-Infrastructure, 7HHVM, 5Patch-For-Review: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1280224 (10hashar) I have cherry picked the gerrit change to disab... [17:05:43] in case HHVM ends up being upgraded magically [17:05:44] https://phabricator.wikimedia.org/T98876#1279742 [17:05:47] has all the commands [17:06:26] !log Disabled Debian unattended upgrade via https://gerrit.wikimedia.org/r/210391 to prevent it from magically upgrading HHVM and causing build to fail ( {{bug:T98876}} ) [17:06:29] should be good [17:06:33] have sweet dreams [17:06:39] hashar: sounds like fun :) [17:06:49] thx [17:16:32] 10Continuous-Integration-Infrastructure, 7HHVM, 5Patch-For-Review: "mediawiki-phpunit-hhvm" failures on all changes in mediawiki/core due to hhvm upgrade from 3.3.1+dfsg1-1+wm3.1 to 3.6.1+dfsg1-1+wm2 - https://phabricator.wikimedia.org/T98876#1280248 (10hashar) So puppet deployed it and that would prevent HH... [17:30:12] 10Beta-Cluster, 10Analytics-EventLogging, 10VisualEditor: Beta cluster is sending VisualEditor events to production bits.wikimedia.org/statsv - https://phabricator.wikimedia.org/T98196#1280307 (10ori) 5Open>3Resolved a:3ori No, this problem would predate that, and setting the URL to be relative actuall... [17:30:35] 10Beta-Cluster, 10Analytics-EventLogging, 10VisualEditor: Beta cluster is sending VisualEditor events to production bits.wikimedia.org/statsv - https://phabricator.wikimedia.org/T98196#1280315 (10Jdforrester-WMF) Aha, thanks! [17:31:45] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App, 5Patch-For-Review: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1280318 (10bearND) @hashar Thank you very much for getting this done. I think we keep the lint task separate since it takes so lon... [18:22:03] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #223: ABORTED in 4 hr 0 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/223/ [19:05:57] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #316: FAILURE in 12 min: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/316/ [19:29:36] 10Beta-Cluster: deployment-bastion.eqiad.wmflabs `/var/` keeps filling up - https://phabricator.wikimedia.org/T98886#1280597 (10hashar) Process accounting and /var being too small got filled as {T91354} already. Yuvi provided new labs images that have a unified and larger root partition. So we will need to recr... [19:37:44] !log Applied cherry-picks for logstash config: https://gerrit.wikimedia.org/r/#/c/210277 & https://gerrit.wikimedia.org/r/#/c/210278 [20:23:04] 10Continuous-Integration-Infrastructure, 10Wikipedia-Android-App, 5Patch-For-Review: Android app build: Gradle checkstyle + app build - https://phabricator.wikimedia.org/T88494#1280759 (10hashar) I think the "long" build time is due to gradle downloading materials. Once cached it is fast and lint is faster t... [21:06:00] 10Beta-Cluster: deployment-bastion.eqiad.wmflabs `/var/` keeps filling up - https://phabricator.wikimedia.org/T98886#1280954 (10thcipriani) 5Open>3Resolved a:3thcipriani Closing as duplicate of T91354 [21:07:18] bd808: any more merges needed to finish up the logging on beta stuff? [21:07:56] yeah... the scary one -- https://gerrit.wikimedia.org/r/#/c/210253/ [21:08:46] bd808: let me look at it and do it now [21:09:42] ok, confirmed that lithium == syslog [21:10:16] bd808: should it have an ensure => absent for enable = false? [21:10:22] bd808: because otherwise it’ll keep running [21:15:46] yuvipanda: hmmm.. I think removal will be taken care of by other rules (recursive on the etc dir) [21:15:54] let me verify that [21:16:02] bd808: ok! if so, can you put a comment there? [21:17:25] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [21:17:32] 10Continuous-Integration-Infrastructure: Zuul repositories have too many refs causing slow updates - https://phabricator.wikimedia.org/T70481#1280969 (10hashar) Someone has bring the topic on the openstack-infra mailling list. So I followed up on the reviews that were pending on https://review.openstack.org/#/c/... [21:18:01] yuvipanda: {{done}} (and I was right) [21:18:08] bd808: \o/ sweet [21:18:19] bd808: so if I’m reading this right there should basically be no changes on $RANDOMHOSTINPROD [21:18:24] puppet diff should be empty [21:18:40] If I didn't mess up, yes [21:18:55] when I tried pcc it puked [21:19:00] yeah [21:19:02] it pukes often [21:24:53] thcipriani: and yeah deployment-bastion needs to be rebuild using an instance with the new partitioning scheme :/ [21:24:58] thcipriani: that is a bit tedious though [21:25:16] hashar: partman recipes [21:25:19] brandon black also filled a bug to get the beta cluster varnish to be switched from Trusty to Jessie to match prod [21:25:47] mutante: I don't think we can resize partitions on labs instance though [21:25:56] they are not managed by LVM [21:26:40] hashar: i didnt mean resizing as in "parted", i meant that in production when installing a server and PXE boot it, [21:27:02] we tell it to use a certain partman recipe and then it will end up with that partioning scheme and be able to autoinstall without user interaction [21:27:16] maybe that would be possible for instances too [21:27:43] so having to repeatedly install the same thing would be much faster [21:36:18] mutante: yeah that is doable as well :] [21:36:38] the current partitioning scheme should be fine though [21:36:41] and I am off [21:36:44] night/bed etc [21:36:54] g'night! [21:41:33] well that was confusing because it was a reply to [21:41:37] "needs to be rebuild using an instance with the new partitioning scheme :/" [21:47:02] RECOVERY - Puppet staleness on integration-saltmaster is OK Less than 1.00% above the threshold [3600.0] [21:52:42] dear releng, I'm having some admin issues with the Event logging instance on the beta cluster: deployment-eventlogging02 [21:53:21] namely, /var has filled up and I'd like to clean up the innodb data files there, but I don't have root access to the mysql instance [21:55:31] milimetric: hmm, you can probalby recover the password... [21:55:58] yuvipanda: I didn't want to mess anything else up [21:56:04] in case other people had set it to something, etc. [21:56:07] milimetric: hmm, do you know who set up eventlogging? [21:56:12] nope [21:56:20] milimetric: you can create a backdoor root2 account :) [21:57:01] true... but I think that's why I asked for help. If we don't all know what happens when I try to act as an ops person, that's a good thing. We should keep it that way :) [21:57:36] milimetric: :) so in general I’ve no idea who’s responsible for eventlogging on beta, so hard to point to someone in particular and say ‘ask them' [21:57:51] "analytics"? [21:58:04] ^ +1 [21:58:05] but, can't we just give milimetric root on the project? [21:58:12] he does have root [21:58:22] oh, just the mysql "root" pw? [21:58:24] whoever setup evetnlogging02 set a mysql root password [21:58:27] * greg-g nods [21:58:28] and now we don’t know where it is [21:58:33] quality [21:58:39] lastlog ? [21:58:40] it’s not in /root/.my.cnf [21:58:42] .... is our middle name [21:58:47] which is where I’ll expect it to be [21:58:54] check who logged in on that instance [21:59:04] then blame them [21:59:43] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1281086 (10bd808) 5Open>3Resolved I think all beta should be using NFS for now is images and homedirs. [21:59:44] ok, /me backs away slowly :) [21:59:47] :) [21:59:56] this is getting into some blamey thing. No worries, I'll deal with it somehow [22:00:02] delete some random files or something [22:00:04] yeah [22:00:06] blame is bad [22:00:12] milimetric: so I suggest creating a root2 account [22:00:21] wait, wait [22:00:24] let me find ways to do that [22:00:27] bd808++ rock on [22:00:34] yeah, bd808 <3 [22:00:36] you just said "no idea who’s responsible for eventlogging on beta" [22:00:43] and that was how to find out [22:01:34] mutante: not really. it’s used by a lot of people, doesn’t mean they set it up [22:01:53] that many people have root there? [22:02:40] yeah, it's a rooty instance [22:02:46] people use it to test EL changes [22:02:47] and because we have so many roots, we are making root2? [22:03:03] mutante: yes, this is part of beta cluster and if more opsen had even vague ideas of how it’s doing the world would be a better place :) [22:03:09] I guess what I should do is move lib/mysql from /var where it's killing space to somewhere with more space [22:03:16] but I completely lack the ops talent to do this [22:03:22] wait, what? [22:03:24] milimetric: yeah, moving it to ::srv would be useful [22:03:49] milimetric: do you want me to do that? [22:03:58] er, another prod/beta diff? [22:04:07] hmm [22:04:16] yuvipanda: /srv isn't even mounted here [22:04:47] yuvipanda: if it's easier to point me at docs, I can do it, I don't want to be a pain [22:05:08] yuvipanda: another "recreate the instance" instance? [22:05:19] greg-g: yeah, that’s one solution but it’s not the full solution here. [22:05:40] greg-g: because I think EL on prod has its dbs on a *huge* /var - because it’s a database, it’s on separate disks and what not [22:06:08] milimetric: nah, let’s figure out a better solution for this so you guys can support it better :) [22:06:27] milimetric: do you know which machines in prod closely mirror this instance? [22:06:50] yuvipanda: why not role::labs::lvm::srv and a symlink? [22:07:21] bd808: so that was what I was suggesting, and then greg-g brought up that that’s a prod / beta difference, and now I’m not sure what to do [22:07:22] bd808's solution sounds good [22:07:24] i saw somebody needing a password, i suggested a command to look up who set the password. apparently that is bad blaming and there is something about ops and responsibility .. ok [22:07:25] maybe there's even a hiera param for mysql's storage dir? [22:07:30] yuvipanda: it's different from prod anyway [22:07:32] bd808: yeah, so that’s an ideal one. [22:07:37] because the mysql cluster in prod is a separate machine [22:07:43] aaah [22:08:00] All I know about EL is "blame O.ri" [22:08:05] so it’s an amalgam of several roles that have different hosts in prod moved into one in beta? [22:08:35] * milimetric didn't mean to make mutante feel bad [22:08:38] milimetric: if ^ is the case, then ideal long term solution, especially if you’re pointing people to beta to test, is to make that also be more accurately represented in beta, I think [22:09:04] yuvipanda: yes, but EL is also changing pretty rapidly [22:09:13] we're bringing it into the hadoop pipeline [22:09:22] milimetric: hmm, I see. [22:10:00] greg-g: see ^ - are they expected to repro that in beta too? or…? I don’t think the full analytics infrastructure is replicated in beta [22:10:47] it's not, but we are actively trying to figure out how to allow people to keep testing their instrumentation [22:11:20] milimetric: right. so longer term, I guess you guys will need to prioritize and set up EL on beta. [22:11:25] it's just we don't know yet what event logging will look like even a month from now [22:11:30] milimetric: super short term, I guess a symlink will work [22:11:38] it's just we don't know yet what event logging will look like even a month from now [22:11:53] right [22:12:29] ok, works for me. I'll give the symlink thing a shot [22:12:38] milimetric: I just enabled /srv partition on that host [22:12:44] milimetric: it should be provisioned shortly [22:12:52] milimetric: oh wait, puppet is disabled on that host [22:13:08] I can run puppet, I'm already on it [22:13:12] milimetric: https://phabricator.wikimedia.org/T96921 [22:13:25] milimetric: sweet. so you’ll get a /srv when you have puppet running [22:13:38] milimetric: but I think this should be a bigger conversation you guys have within your team at some point [22:14:23] i don't think "replicate current production EL in beta cluster" is a discussion we'll have. [22:14:41] But we're definitely already having the "replicate the eventual EL setup in the beta cluster" discussion [22:14:43] milimetric: sure, but ‘support people who need to test their EL on betacluster' [22:14:49] in whatever form that is :) [22:14:56] yes, definitely [22:15:13] :) [22:15:15] I’ll brb [22:19:14] :( it seems this issue that prevents puppet from running is way above my head. I'm going to give up now and let real ops people do this. ottomata is sadly on vacation, and we don't have any other puppet knowledge on the team [22:20:47] milimetric: :( ok [22:23:01] bd808: heh, that’s 186G of logs [22:23:33] So, did we give up trying to figure out who set the root password? Maybe an email to some analytics list? [22:23:35] bd808: on NFS, that is. [22:23:44] yuvipanda: yikes [22:24:06] we are going to need more aggressive cleanup rules on deployment-fluorine [22:24:09] bd808: not deleting right now, will do at some point. [22:24:24] * yuvipanda bd808: better than the 1T single log file we found on tools >_> [22:24:32] what root password are we looking for? [22:24:37] mysql on deployment-eventlogging02? [22:24:45] yeah [22:24:46] Krenair: yeah [22:24:48] isn't that 'secret'? [22:24:56] lol really? [22:25:08] haha [22:25:09] it is [22:25:10] milimetric: ^ [22:25:19] Krenair: where does this piece of info come from? [22:25:23] wikitech [22:25:26] :) [22:25:30] thx [22:25:31] https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs#Database [22:25:34] now we have to change it to "toomanysecrets" [22:26:14] * yuvipanda puts on hat of shame for not even thinking of looking at wikitech [22:26:16] that was hard :p [22:26:35] heh [22:26:59] I didn't think to look 'cause why would a root password be there. [22:27:05] it's beta [22:27:16] yeah, but there's still PII in that db! [22:27:46] "there's still PII in that db" other than IP addresses? [22:27:49] we don't put anything particularly private in beta because you don't even need an NDA to get access [22:28:53] remember kids: don't use your real production password in beta [22:29:42] bd808: IP addresses are salted, but data captured by EL can be sensitive in combination [22:29:44] even if you did it would still be a bad idea [22:29:51] 10Beta-Cluster, 6Labs, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1281134 (10yuvipanda) Awesome, thanks bd808 :) about 186G of logs are in /data/project/logs/archive, I'll delete them in a week. Does that sound ok? [22:30:28] *salted and hashed that is [22:31:10] it's just there seems to be a pretty big discrepancy between how much care we take of production EL data and beta EL data when they're technically the same and can include real users just the same [22:38:01] why would they be the same? [22:38:30] beta is not production and is not to be trusted like production is [22:43:55] I mean, the collection of the data is the same [22:44:23] and even if we don't trust it as much as production, we should still limit our attack surface area within reason [22:44:45] like, I wouldn't create a public websocket on the internet and start pushing all beta cluster EL events to it [22:51:23] RECOVERY - Free space - all mounts on deployment-eventlogging02 is OK All targets OK [22:52:36] milimetric, well, in theory it should still be accessible to members of the deployment-prep project only [22:52:54] is it the password-on-wiki that you're unhappy about? because that can be solved [22:53:26] Krenair: it's ok, this is all changing soon anyway. [22:53:39] just thinking out loud. Thanks for the help, I freed up a bit of space [22:53:41] the eventlogging setup? [22:53:50] yeah, the setup doesn't scale [22:54:00] we have the search team asking us to instrument monster amounts of data [22:54:18] we need to put it in kafka and figure out the rest of the pipeline from there [23:06:56] Yippee, build fixed! [23:06:56] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #669: FIXED in 55 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/669/ [23:15:00] (03PS1) 10Thcipriani: Change beta-update-databases to python script [integration/config] - 10https://gerrit.wikimedia.org/r/210619 (https://phabricator.wikimedia.org/T96199)