[01:18:47] Yippee, build fixed! [01:18:47] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #242: FIXED in 46 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/242/ [01:55:07] Yippee, build fixed! [01:55:08] Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce build #59: FIXED in 1 min 6 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-os_x_10.10-iphone-sauce/59/ [02:35:44] Yippee, build fixed! [02:35:44] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #561: FIXED in 2 min 43 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/561/ [02:41:59] Yippee, build fixed! [02:41:59] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #525: FIXED in 58 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/525/ [03:15:38] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<11.11%) [05:06:47] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:34:52] Yippee, build fixed! [05:34:53] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #408: FIXED in 32 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/408/ [07:05:38] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [07:23:40] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #430: FAILURE in 58 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/430/ [09:06:44] zeljkof: thanks a ton! [09:15:55] hasharOut: no problem :) [10:17:04] (03PS2) 10Zfilipin: Remove support for MEDIAWIKI_PASSWORD_VARIABLE [selenium] - 10https://gerrit.wikimedia.org/r/208119 (owner: 10Dduvall) [10:17:10] (03CR) 10Zfilipin: [C: 032] Remove support for MEDIAWIKI_PASSWORD_VARIABLE [selenium] - 10https://gerrit.wikimedia.org/r/208119 (owner: 10Dduvall) [10:18:21] (03Merged) 10jenkins-bot: Remove support for MEDIAWIKI_PASSWORD_VARIABLE [selenium] - 10https://gerrit.wikimedia.org/r/208119 (owner: 10Dduvall) [10:19:19] (03CR) 10Zfilipin: [C: 032] Log browser test results to Raita [integration/config] - 10https://gerrit.wikimedia.org/r/208403 (owner: 10Dduvall) [10:21:19] (03Merged) 10jenkins-bot: Log browser test results to Raita [integration/config] - 10https://gerrit.wikimedia.org/r/208403 (owner: 10Dduvall) [10:27:15] (03CR) 10Zfilipin: "All browsertests* jobs are updated." [integration/config] - 10https://gerrit.wikimedia.org/r/208403 (owner: 10Dduvall) [10:45:49] PROBLEM - Puppet staleness on deployment-restbase02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [12:35:31] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:54:56] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1256661 (10Christopher) @mmodell, OK. Sprint is current with D12603 28.04.2015 upstream. Have you decided what upstream commit will be used? [13:00:31] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [13:26:02] (03PS1) 10Hashar: Change operations-puppet-typos to use shallow git [integration/config] - 10https://gerrit.wikimedia.org/r/208633 [13:28:35] (03CR) 10Hashar: [C: 032] Change operations-puppet-typos to use shallow git [integration/config] - 10https://gerrit.wikimedia.org/r/208633 (owner: 10Hashar) [13:30:22] (03Merged) 10jenkins-bot: Change operations-puppet-typos to use shallow git [integration/config] - 10https://gerrit.wikimedia.org/r/208633 (owner: 10Hashar) [13:33:57] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Migrate all jobs to labs slaves - https://phabricator.wikimedia.org/T86659#1256751 (10hashar) [13:34:12] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Migrate all jobs to labs slaves - https://phabricator.wikimedia.org/T86659#973540 (10hashar) [13:34:55] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Migrate all jobs to labs slaves - https://phabricator.wikimedia.org/T86659#973540 (10hashar) I have added some jobs that do not have any `node:` and hence roam on any of our slaves (the Jenkins XML has: `canRoam>true`). [13:44:37] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1256783 (10mmodell) @christopher: https://github.com/phacility/phabricator/commit/1dfe1f49f78064e479284748b47d385e520f0a96 Unless you see a problem with that? [13:52:01] (03CR) 1020after4: [C: 031] make-release: Don't re-list all bundled extensions for each release [tools/release] - 10https://gerrit.wikimedia.org/r/201246 (owner: 10Legoktm) [13:52:16] (03CR) 1020after4: [C: 031] make-release: Add option to list all bundled extensions [tools/release] - 10https://gerrit.wikimedia.org/r/201247 (owner: 10Legoktm) [14:02:07] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:04:40] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1256838 (10mmodell) @christopher: phab-01 is now running [[ https://git.wikimedia.org/commit/phabricator%2Fphabricator.git/a8abd75e978dbac6cc185ce23fda36c44fde218a | release/201... [14:11:31] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:25:51] Project beta-scap-eqiad build #51559: FAILURE in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/51559/ [14:35:18] Yippee, build fixed! [14:35:19] Project beta-scap-eqiad build #51560: FIXED in 1 min 8 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/51560/ [14:46:07] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [14:46:23] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [14:47:30] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:47:48] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:48:10] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [14:52:38] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1257035 (10demon) >>! In T90683#1255823, @mmodell wrote: > > >>>! In T90683#125... [15:07:30] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:09:08] morning zeljkof! How's it going? :) [15:09:17] AndyRussG: good morning [15:09:40] all good here [15:09:46] Cool! [15:09:50] how is your week started? [15:10:25] Not to bad... this morning's rush to get the kids to school was pretty calm, so that's good! [15:11:04] This afternoon is gonna be hectic tho since I have meetings, and my older daughter also needs help preparing a presentation she'll give at school tomorrow ;) [15:11:21] She's mostly ready, though, so it should go OK [15:12:43] (I'm actually thinking about seeing if I can take a couple of my vacation/sick days this week, to deal with a big backlog of non-work stuff like that... but yeah! all good here too :) ) [15:12:48] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [15:13:10] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:13:17] Wanna schedule a hangout later this week to debug the intermittent CentralNotice browser test failures? I really want to dig into it some and try to run the tests on Sauce labs from my own machine... [15:13:52] I set up cucumber finally on this machine and am able to run the tests locally... [15:14:49] Also, I disabled temporarily the e-mails to fr-tech for browser tests (without going through Gerrit--just made the change and updated the jobs) because it's not useful to the rest of the team right now to get so many false negatives... Hope that's OK! [15:15:15] zeljkof: ^ [15:16:08] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [15:16:13] AndyRussG: sure, my calendar is always up to date, just pick a free slot this week :) [15:16:22] RECOVERY - Puppet failure on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:16:32] zeljkof: K, fantastic! thanks :) [15:16:38] also, disabling something in the web interface is really short term fix [15:17:05] I have updated all browsertests* jobs this morning (my time) with a fix, so your change is maybe already overwritten [15:17:18] zeljkof: yes, agreed! actually I did it from the command line, just about 5 minutes ago [15:17:46] AndyRussG: changes from the command line also get overwritten if they are not in JJB repo [15:18:05] I have deployed the latest master a few hours ago, so... [15:18:10] yeah... it's OK if it gets overwritten, not a big deal [15:19:11] but yes, my idea is just to make it temporary, for a few days, on the assumption that we can get things to stabilize. I'm imagining it may be a timeout/network/sauce labs performance issue, since it seems pretty random, though more frequent on the mobile tests (which I think may take a bit longer) [15:22:49] ...also, just a heads-up, it might make sense for us to create a bunch more .feature tests in the coming months, since we're heading into some new functionality that will rely on more recent browser tech (localstorage, specifically) [15:23:28] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:26:12] zeljkof: K I grabbed a slot... thanks! [15:27:31] AndyRussG: see you on Thursday! :) [15:39:48] 7Blocked-on-RelEng, 6Release-Engineering, 6Multimedia, 6Reading-Infrastructure-Team, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257170 (10faidon) >>! In T84956#1166762, @hashar wrote: > @Gilles and I had a quick conf call yesterday. Seems the Debian packa... [15:46:23] 6Release-Engineering, 6Multimedia, 6Reading-Infrastructure-Team, 5Patch-For-Review, 7Puppet: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257199 (10bd808) [15:46:41] PROBLEM - Host deployment-sca01 is DOWN: CRITICAL - Host Unreachable (10.68.17.54) [15:47:39] RECOVERY - Host deployment-sca01 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [15:49:22] 6Release-Engineering, 6Multimedia, 6Reading-Infrastructure-Team, 5Patch-For-Review, 7Puppet: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257228 (10bd808) I removed the scrum of scrums and blocked on releng tags. I'll work with @tgr to come up with a basic plan on how t... [15:51:26] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:52:40] I'm about to poke the salt config on deployment-salt, which will require a restart. You'll hardly notice. It's only for testing something... [15:56:18] 10Beta-Cluster, 10Graphoid: Deploy Graphoid on Beta Cluster - https://phabricator.wikimedia.org/T97606#1257255 (10akosiaris) @mobrovac, deployment-prep's parsoidcache would be autoupdated but https://gerrit.wikimedia.org/r/#/c/208644/ was missing. Done and parsoidcache is updated. That being said, it is not re... [15:57:00] already done [16:01:25] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:29:07] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:29:07] (03PS3) 10Florianschmidtwelzow: Add OOJsUIAjaxLogin extension [integration/config] - 10https://gerrit.wikimedia.org/r/207758 [17:36:39] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<22.22%) [17:46:36] !log integration-slave-precise-1014 died trying to clone mediawiki/core.git with "fatal: destination path 'src' already exists and is not an empty directory." [17:46:41] Logged the message, Master [17:51:38] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [18:19:23] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [18:21:31] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:23:02] 10Beta-Cluster, 7Regression: All http://bits.beta.wmflabs.org/static/master/* file urls return HTTP 403 Forbidden - https://phabricator.wikimedia.org/T98046#1258077 (10Krinkle) 3NEW [18:24:07] 10Beta-Cluster: Occasionally getting 403 HTTP Method not allowed from bits - https://phabricator.wikimedia.org/T93021#1258095 (10Krinkle) 5Open>3Resolved a:3Krinkle Works for me. [18:24:15] 10Beta-Cluster, 7Regression: All http://bits.beta.wmflabs.org/static/master/* file urls return HTTP 403 Forbidden - https://phabricator.wikimedia.org/T98046#1258098 (10Krinkle) [18:24:27] 10Beta-Cluster, 10Deployment-Systems, 7Regression: All http://bits.beta.wmflabs.org/static/master/* file urls return HTTP 403 Forbidden - https://phabricator.wikimedia.org/T98046#1258077 (10Krinkle) [18:29:17] greg-g: hey, here? [18:29:58] JohnLewis: I am, what's up? [18:31:00] greg-g: who is best to talk with about scap? [18:31:14] twentyafterfour: ^d [18:31:47] greg-g: thanks :) /me waits for one to ack their available [18:31:53] *they're [18:35:19] <^d> hrm yo? [18:37:34] ^d: I'm looking at the scap module and it has variables for tin.eqiad.wmnet and was wondering if changing those variables is all that is necessary for making scap work on a different server if it had the same environment. [18:39:26] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.025 second response time [18:39:44] <^d> JohnLewis: Halfway work. The scap destinations currently aren't puppetized well. I have a WIP patch for moving that into hiera. [18:39:51] <^d> Then it should Just Work after configuring [18:40:37] ^d: right. because I'm working on creating a few patches for setting up mira in codfw and scap needs to be on it so :) [18:41:08] <^d> https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:dsh-cleanup,n,z is part of it [18:45:25] ^d: I guess a scapless mira is fine for now :) [18:47:40] (03CR) 10Catrope: "This doesn't seem to have been deployed? https://gerrit.wikimedia.org/r/#/c/206013/ PS10 was still tested with the old set of jobs." [integration/config] - 10https://gerrit.wikimedia.org/r/208342 (owner: 10Catrope) [18:56:53] qa-morebots: halp [18:56:53] I am a logbot running on tools-exec-1213. [18:56:54] Messages are logged to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL. [18:56:54] To log a message, type !log . [18:58:34] (03CR) 10Legoktm: "[11:57:55] Oh shit" [integration/config] - 10https://gerrit.wikimedia.org/r/208342 (owner: 10Catrope) [18:58:57] (03PS2) 10Legoktm: Run jsduck for Flow [integration/config] - 10https://gerrit.wikimedia.org/r/208339 (owner: 10Catrope) [19:00:24] (03CR) 10Legoktm: [C: 032] Run jsduck for Flow [integration/config] - 10https://gerrit.wikimedia.org/r/208339 (owner: 10Catrope) [19:02:40] (03Merged) 10jenkins-bot: Run jsduck for Flow [integration/config] - 10https://gerrit.wikimedia.org/r/208339 (owner: 10Catrope) [19:03:41] !log deploying https://gerrit.wikimedia.org/r/208339 [19:03:46] Logged the message, Master [19:33:37] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1258368 (10coren) @greg: Can I get approval language for this request, please? [19:44:36] ^demon|lunch / greg-g: so far I have https://gerrit.wikimedia.org/r/#/c/208723/ for mira. while mergable now, it'll have tin as the rsync host and so. (scap needs hiera :) ) [20:02:27] JohnLewis: FWIW, you could add a section to the scap.cfg file for codfw in the interim to override the rsync host [20:02:49] not that that is desirable, just don't know your timeline [20:03:12] thcipriani: timeline is infinite really. a good solution whenever :) [20:05:09] thcipriani: I was thinking of moving the rsync_host variable only to hiera for now for this but I'm unsure [20:05:22] hierization of scap stuffs is probably the best solution in the near term, there are a few patches out there. [20:06:03] are there any for the scap::master variables? if not I'll patch one up [20:07:54] so, one approach I had at one point, was to use /etc/scap.cfg [20:08:33] which can be used to override the default scap config file. Honestly not sure why I went that direction rather than parameterize the _actual_ scap config file [20:09:10] here is that patch if it's helpful at all: https://gerrit.wikimedia.org/r/#/c/198173/2 [20:11:55] thcipriani: hm. [20:14:06] JohnLewis: that's a fairly rough solution, and a bit non-obvious at that [20:15:02] thcipriani: I'm debating whether just putting all the config into hiera common and then overriding it in per dc config as a solution that is nice [20:18:36] I wish there was a configuration level somewhere between host and datacenter in there tbh: my patch's original intent was to make scap configurable for clusters of hosts in beta [20:18:42] anyone knows how to force sync beta-labs? [20:18:46] its borked :( [20:18:54] http://en.wikipedia.beta.wmflabs.org/wiki/Wikipedia:Edit_filter/False_positives [20:22:32] thcipriani: hiera can be configured on global, datacenter, group and host levels [20:22:49] (iirc) [20:23:00] JohnLewis: I meant scap.cfg [20:23:12] oh :p [20:23:23] re. that I don't know anymore :) [20:23:24] yurik: looks like deployment-bastion may be stuck... [20:23:26] @seen andre__ [20:23:45] https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/ [20:23:57] @summon bugmeister [20:24:06] thcipriani: I'll make a hiera patch though and let you look at it as a solution for now then any work you do can go off it [20:24:18] * yurik doesn't like bastion [20:24:31] especially when it gets stuck [20:24:39] JohnLewis: sounds good to me [20:26:37] well, looks like according to https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Jenkins_execution_lock I just need to take that node offline and bring it back online [20:55:28] !log marking node deployment-bastion offline due to suck jenkins execution lock [20:55:31] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:35] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:36] NFS is being switched over folks [20:55:37] thcipriani: ^ why is mw still dying when NFS dies? [20:55:37] I thought we made deployment-prep not depend on NFS [20:55:38] PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:38] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:38] PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:38] PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:40] thcipriani: https://gerrit.wikimedia.org/r/#/c/208801/ [20:55:42] yuvipanda: I'm not sure what NFS changes you're referring to. Coren made some changes to idmapd recently on deployment-prep. [20:55:42] thcipriani: no I mean, months ago we used to have the mediawiki code on NFS [20:55:42] thcipriani: and then it’d die when NFS dies [20:55:43] thcipriani: but right now nothing except the images should be on NFS [20:55:43] thcipriani: so I’m not sure why it dies when NFS dies [20:55:44] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:45] bd808: could you take a look at https://gerrit.wikimedia.org/r/#/c/208801/ when you get a few spare moments? [20:55:54] JohnLewis: sure. I can take a look in a couple of hours [20:56:41] RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.002 second response time [20:56:50] bd808: okay :) [21:00:07] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:01:27] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 48379 bytes in 0.940 second response time [21:01:48] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29242 bytes in 0.744 second response time [21:02:02] RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.002 second response time [21:02:08] RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.002 second response time [21:03:34] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 48100 bytes in 0.694 second response time [21:04:51] hmm, looks like load on these boxes exploded without NFS [21:05:00] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 48085 bytes in 0.863 second response time [21:06:09] thcipriani: I wonder if we should disable NFS for homedirs on deployment-prep [21:06:14] hmmm [21:06:20] ideally we’d just make NFS more reliable [21:06:39] I think NFS should be as needed and not default? :) [21:06:45] chasemp: that too. [21:07:01] the majority, not even just half of use cases seem like borderline abuse [21:07:31] yes [21:08:42] huh, what are the use cases of NFS /home dirs in deployment-prep? [21:09:14] I think we mean there are none that we know of that are legit but maybe some are ...illegit? [21:09:25] possibly too legit [21:09:26] to legit to quit [21:09:30] beat ya [21:09:36] and I typo'd [21:09:47] I wasn't going to throw it in your face or anything [21:10:28] greg-g: chasemp thcipriani so killing NFS homedirs will be a good idea, methinks. [21:10:31] and ‘closer to prod' [21:10:45] we can kill NFS in general after we setup swift, but I guess that’s a bit farther off [21:11:03] yuvipanda: right, just, why are they being used? what uses them? [21:11:18] * greg-g may have missed that in the backscroll [21:11:20] greg-g: thumbnails use NFS [21:11:27] greg-g: outside of that, nothing. [21:11:29] that's not home dirs though, right? [21:12:25] greg-g: no [21:12:29] so homedirs can / shoudl go [21:12:35] and should just be on the deployment-bastion [21:12:39] (and backed up in some form) [21:12:41] (maybe) [21:12:52] right [21:13:10] * greg-g is confused maybe, but it doesn't matter [21:13:44] greg-g: why confused? [21:13:56] 1. we aren’t useing homedirs on NFS on deployment-prep for anything useful (afaik) [21:13:58] 2. NFS BAD [21:14:05] 3. Let’s not do (1) [21:14:10] might need some work on the wikitech side tho [21:15:51] * greg-g agrees [21:16:06] Just.. [21:16:08] 21:07 < chasemp> the majority, not even just half of use cases seem like borderline abuse [21:16:12] what are the use cases? [21:16:24] (was my original question) [21:16:33] I meant in labs as a whole [21:16:42] there, which I think yuvi knew but was probably confusing [21:16:54] looks like there is still some logging going to nfs: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/files/apache/beta/logging.conf#L5 [21:17:15] chasemp: ahhhhhhhh [21:17:24] I thought that was in reference to deployment-prep [21:17:26] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [21:17:32] see what I get for coming in part way and asking questions? [21:17:46] thcipriani: we should kill that - that might’ve also been the cause of the http servers not responding [21:17:48] logging to nfs is like text book bad behavior afaik :D [21:17:54] don’t we have logstash [21:18:02] chasemp: guess where everything logs to in toollabs [21:18:04] yuvipanda: there is a beta logstash [21:18:39] logstash-beta.wmflabs.org not sure what all uses it :\ [21:19:04] still dead :(( formatjson, you came back to haunt me... [21:19:27] but there are a ton of these in logstash, which is probably the problem: (116)Stale file handle: AH00646: Error writing to /data/project/logs/apache-access.log [21:20:10] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:20:30] thcipriani: yup. [21:31:52] hmm, deployment-bastion is still locked up [21:35:17] !log disconnecting deployment-bastion and reconnecting, again [21:35:21] Logged the message, Master [21:40:19] dagburnd-it [21:40:59] !log deployment-bastion still not accepting jobs from jenkins [21:41:01] Logged the message, Master [21:46:43] thcipriani: your only hope is to restart Jenkins [21:47:00] thcipriani: there is a deadlock in some java code :/ [21:48:00] Project beta-update-databases-eqiad build #9377: FAILURE in 2 hr 27 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9377/ [21:49:25] hashar: ^ what was that? [21:50:14] thcipriani: I have cancelled it [21:50:23] ah, gotcha [21:50:28] thcipriani: for some reason the executors on deployment-bastion are all stuck but one [21:50:55] maybe we can just kill all beta related jobs [21:51:04] on the Zuul page https://integration.wikimedia.org/zuul/ [21:51:17] there is a few operations/mediawiki-config changes pending in 'postmerge' [21:51:23] all stuck because of deployment-bastion [21:53:52] I tried the: mark offline, disconnect, restart slave agent a few times, so I'm ready to try a new approach [21:54:45] yeah, I've had successes after 3 tries of that, I haven't stuck with it for more :) [21:55:04] thcipriani: on the main jenkins page, there is a build queue listed [21:55:13] that shows that a beta-scap-eqiad is waiting [21:55:15] after 3 it feels like if I did it a 4th time I'd have to be hit with a cluebat [21:55:44] pointing the mouse over shows it is waiting for beta-mediawiki-config-update-eqiad build 2433 [21:55:49] but it has completed! [21:56:51] I just see "waiting for next available executor" etc. [21:57:52] everything else seems to be waiting on beta-scap-eqiad [21:58:19] yes [21:58:33] "waiting for next available executor" means the gearman plugin hung usually [21:58:36] there is only one instance of scap that can run [21:58:44] so all jobs have scap as a "blocker" [21:58:51] but for some reason scap is actually blocked: / [21:59:30] https://phabricator.wikimedia.org/T72597 [21:59:47] !log disconnected reconnected Jenkins Gearman client [21:59:49] Logged the message, Master [21:59:59] (that is in https://integration.wikimedia.org/ci/configure ) [22:00:24] disconnecting the gearman client unlocked beta-scap-eqiad ! [22:03:43] I'm still seeing jobs waiting in executor starvation [22:03:55] yeah [22:03:57] https://integration.wikimedia.org/ci/monitoring [22:04:04] that gives a bunch of monitoring utilities [22:04:12] at the bottom there is a list of the threads [22:04:21] ex: Gearman worker deployment-bastion.eqiad_exec-0 [22:04:48] that shows the threads being blocked [22:05:18] http://paste.debian.net/plain/171429 [22:08:43] so what's left to do at this point? Time to restart jenkins? [22:11:40] yeah [22:11:50] thcipriani: you can head to https://integration.wikimedia.org/ci/manage [22:11:53] at the bottom [22:11:59] [Prepare for Shutdown] [22:12:06] then let the jobs finish [22:12:23] !log preparing jenkins for shutdown [22:12:25] Logged the message, Master [22:12:37] the browser tests I usually kill them since they run once per day anyway [22:13:08] so at this point [22:13:12] Jenkins no more trigger anything new [22:13:21] and Zuul holds the jobs to be run [22:13:43] I have killed the browser tests [22:14:02] okie doke, safe restart time? [22:14:14] yup [22:14:37] though I don't know whether it can be done from the web interface [22:15:13] so at this point I head to gallium.wikimedia.org and restart [22:15:37] sudo -u jenkins /etc/init.d/jenkins stop [22:15:47] I started a safe restart [22:15:51] ah [22:16:01] also, I don't have sudo on gallium [22:16:09] OHHH [22:16:12] we should fix that [22:16:16] sudo which sudo prompts me for password :( [22:17:59] ok it is processing again [22:17:59] https://integration.wikimedia.org/zuul/ [22:18:12] the Jenkins backend is ready before the web interface :] [22:18:50] !log jenkins restarted [22:18:53] Logged the message, Master [22:19:12] hashar: thanks! [22:19:32] yeah [22:19:35] it is still stuck though :/ [22:20:02] yeah...blerg. [22:20:47] !log restarting Jenkins from gallium :/ [22:20:50] Logged the message, Master [22:21:00] thcipriani: we need to get you added to the ci admins group [22:21:14] some yaml file in operations/puppet [22:21:25] ./modules/admin/data/data.yaml [22:21:53] hmm wait [22:22:15] 'ALL = (jenkins) NOPASSWD: ALL' [22:22:27] you should have proper sudo access! [22:23:01] hmm, I'm in the contint-admins [22:24:02] so, I guess I can restart jenkins but not do: sudo which [x] :P [22:24:05] that's sudo as jenkins user only [22:27:34] so jobs are now only blocked on https://integration.wikimedia.org/ci/job/beta-scap-eqiad/51584/console [22:27:41] which is busy running the l10n cache [22:35:22] thcipriani: there is a lot of work to refactor our Jenkins / beta deployment stuff :/ [22:35:40] I will talk about it extensively in Annecy [22:36:08] chase even proposed to rebuild gallium entirely :D [22:36:16] sounds like a fun time :) [22:36:34] yeah [22:36:39] so much to be done [22:36:55] if we end up with an exhaustive task lists that would be good [22:37:08] and I really want to have all the team to know as much as I [22:37:52] * hashar meanwhile in Vagrant/puppet, it does not serve puppet:// files grrr [22:38:24] hashar: in mediawiki-vagrant? It should [22:38:36] oh [22:38:41] that is another vagrant :] [22:38:48] ==> default: Error: /Stage[main]/Passwords::Root/Ssh::Userkey[root]/File[/etc/ssh/userkeys/root]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///private/ssh/root-authorized-keys [22:38:56] I am missing a fileserver.conf or something :] [22:39:16] bd808: I had the crazy idea of bootstraping an image for wmflabs out of operations/puppet and labs/private :D [22:39:38] heh. I've had the same crazy idea but never gotten too far on it [22:40:03] will share my result on wednesday I think [22:40:49] To have multiple root repos you will probably need to mess with the puppet.options array in your Vagrantfile [22:40:57] yup [22:42:03] blerg. Is it stuck again? Seems like when you restart it runs one job on deployment-bastion then gets stuck again. [22:42:18] hashar: https://ask.puppetlabs.com/question/2646/how-to-separate-3rd-party-modules-from-own-modules/ [22:43:37] thcipriani: yeah it is stuck [22:43:50] thcipriani: there are several mediawiki/config jobs that are still in Zuul [22:44:05] I guess they somehow cause the deadlock [22:44:14] bd808: I will make sure you get the announce somehow :] [22:44:32] thcipriani: so it is probably easier to just stop zuul to flush the queues [22:46:02] hashar: cool. twentyafter.four even made a task wanting it -- https://phabricator.wikimedia.org/T76128 [22:49:05] !log restarted Zuul to clear out a bunch of operations/mediawiki-config.git jobs [22:49:07] Logged the message, Master [22:50:48] !log Manually retriggering last change of operations/mediawiki-config.git with: zuul enqueue --trigger gerrit --pipeline postmerge --project operations/mediawiki-config --change 208822,1 [22:50:50] Logged the message, Master [22:57:45] bd808: '--fileserverconfig', '/vagrant/fileserver.conf', !!! [22:58:03] ==> default: Info: mount[files]: allowing * access [22:58:03] ==> default: Info: mount[private]: allowing * access [22:58:05] ahah [22:58:13] vagrant is all green! [22:58:45] with ::base from operations/puppet.git [22:59:33] thcipriani: Jenkins/beta etc should be fine now :/ [23:00:03] hashar: \o/ when in doubt, restart everything :) [23:00:16] thank you for walking me through that. [23:00:33] in the realm of java, I am blind :( [23:00:46] surely someone being able to hook a debugger would figure it out [23:23:32] PROBLEM - Host integration-labsvagrant is DOWN: CRITICAL - Host Unreachable (10.68.16.4) [23:24:32] RECOVERY - Host integration-labsvagrant is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [23:25:48] PROBLEM - Host integration-slave-trusty-1014 is DOWN: CRITICAL - Host Unreachable (10.68.18.29) [23:49:13] bah [23:49:18] deadlock again :( [23:49:48] !log restarted Jenkins [23:49:53] Logged the message, Master [23:49:59] !log restarted Jenkins (deadlock with deployment-bastion) [23:50:02] Logged the message, Master [23:54:57] RECOVERY - Host integration-slave-trusty-1014 is UP: PING OK - Packet loss = 0%, RTA = 72.57 ms [23:56:10] PROBLEM - Host integration-slave-trusty-1017 is DOWN: CRITICAL - Host Unreachable (10.68.17.136)