[00:30:25] 10Browser-Tests, 10Gather, 6Mobile-Web, 10MobileFrontend, 3Mobile-Web-Sprint-48-Voyage-of-the-Damned: Audit existing browser tests - https://phabricator.wikimedia.org/T101071#1332520 (10Jdlrobson) TODO: * Skipped tests shouldn't send e-mail notifications * @dduval and @jdlrobson to explore why this test... [00:49:03] 10Deployment-Systems, 6Release-Engineering: Use subrepos instead of git submodules for deployed MediaWiki extensions - https://phabricator.wikimedia.org/T98834#1332566 (10mmodell) @jdforrester-wmf: It wouldn't necessarily have to take place in the main VE repo, this could be done via an intermediate merge rep... [00:55:19] 10Deployment-Systems, 6Release-Engineering: Use subrepos instead of git submodules for deployed MediaWiki extensions - https://phabricator.wikimedia.org/T98834#1332586 (10mmodell) This is really a detail specific to the wmf release branches of mediawiki, it will really only come into play when applying commit... [01:03:09] 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1332623 (10mmodell) [01:03:14] 10Browser-Tests, 6Release-Engineering, 6Mobile-Web: Introduce @skip tag in mediawiki selenium - https://phabricator.wikimedia.org/T101062#1332626 (10kaldari) 5Open>3Resolved a:3kaldari This was implemented for Mobile-Web in https://gerrit.wikimedia.org/r/#/c/215542/ [01:44:51] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Generate code coverage reports for extensions - https://phabricator.wikimedia.org/T71685#1332687 (10Legoktm) >>! In T71685#1328430, @phuedx wrote: > @Legoktm: Can we get MobileFrontend and Gather added to the list of extensions that you're generating... [01:48:51] 10Beta-Cluster, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#1332691 (10Mattflaschen) >>! In T100515#1318782, @hashar wrote: > I guess on prod you are using a dedicated one or another one. Nope, we're doing... [02:20:06] (03CR) 10Krinkle: [C: 04-1] "As outlined in the past with Antoine, avoid using 'make build' for this purpose. It incurs additional complexities that aren't worth the b" [integration/config] - 10https://gerrit.wikimedia.org/r/191046 (https://phabricator.wikimedia.org/T74794) (owner: 10Hashar) [02:21:51] (03CR) 10Krinkle: "If migration is too much work, it can be bypassed by specifying the relevant shell command (e.g. grunt docs) in package.json/scripts/doc a" [integration/config] - 10https://gerrit.wikimedia.org/r/191046 (https://phabricator.wikimedia.org/T74794) (owner: 10Hashar) [02:52:47] (03PS1) 10Krinkle: Enable npm job for CategoryTree [integration/config] - 10https://gerrit.wikimedia.org/r/215571 [02:53:03] (03CR) 10Krinkle: [C: 032] Enable npm job for CategoryTree [integration/config] - 10https://gerrit.wikimedia.org/r/215571 (owner: 10Krinkle) [02:58:34] (03Merged) 10jenkins-bot: Enable npm job for CategoryTree [integration/config] - 10https://gerrit.wikimedia.org/r/215571 (owner: 10Krinkle) [02:58:39] 6Release-Engineering, 6operations: Try out hack (>! In T91590#1332329, @Legoktm wrote: > Also HHVM's linter is significantly slower than PHP5: https://github.com/JakubOnderka/PHP-Parallel-Lint/issues/47 This is a general... [03:04:53] 6Release-Engineering, 6operations: Try out hack (>! In T91590#1332763, @bd808 wrote: >>>! In T91590#1332329, @Legoktm wrote: >> Also HHVM's linter is significantly slower than PHP5: https://github.com/JakubOnderka/PHP-Para... [03:07:16] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/215571 [03:07:21] Logged the message, Master [04:19:20] 10Browser-Tests, 6Release-Engineering, 6Mobile-Web: Introduce @skip tag in mediawiki selenium - https://phabricator.wikimedia.org/T101062#1332911 (10Jdlrobson) 5Resolved>3Open This is for a generic component [04:23:54] 10Browser-Tests, 10MediaWiki-extensions-GuidedTour: Add Cucumber browser tests for GuidedTour - https://phabricator.wikimedia.org/T92154#1332915 (10Mattflaschen) It's not a priority for me right now since GuidedTour isn't being actively developed at the moment. [05:08:18] 10Deployment-Systems, 6Release-Engineering: Use subrepos instead of git submodules for deployed MediaWiki extensions - https://phabricator.wikimedia.org/T98834#1332933 (10Mattflaschen) [05:24:21] Project beta-scap-eqiad build #55583: FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/55583/ [06:40:51] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [06:56:26] andre__: https://phabricator.wikimedia.org/T97642 ? [06:57:00] matanya: I fail to see the urgency and why people feel like they need to ping me all of the time, tbh [06:57:08] yes, I am aware of it as it is assigned to me. [06:57:17] yes, many other things are also assigned to me. Yes, I will get there. [06:57:45] andre__: people are eager to help :) [06:58:25] I'm not sure if approx. 1.2 people pinging me daily and making me switch to the IRC window and interrupt other stuff is a good trade-off. :P [06:58:42] So yes, I'll do that soon. And thanks for the ping :) [06:58:58] (/me not even ironic; thanks for the reminder) [06:59:39] at your service andre__ didn't know there was a backround there. [06:59:44] hehe [07:00:42] RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK [08:19:29] addshore: new release of mwcs came out yesterday :D [08:19:35] :D [08:20:11] About a minute after it came out, I used it in the composer dependencies of Extension:SmiteSpam, my GSoC project. [08:20:15] Such a fun moment. [08:20:29] :) [08:20:33] Good work! :) [08:21:19] thanks :) [08:21:39] I remember when I first took a look at it and added the tests :P [08:21:46] and always meant to come back and add more! ;p [08:44:09] (03PS1) 10Polybuildr: Update README.md code formatting [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/215580 [08:46:29] (03CR) 10Polybuildr: "Refer to https://github.com/polybuildr/mediawiki-tools-codesniffer/blob/master/README.md to see working example." [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/215580 (owner: 10Polybuildr) [08:46:41] addshore: ^ [08:53:45] (03CR) 10Addshore: [C: 032] Update README.md code formatting [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/215580 (owner: 10Polybuildr) [08:59:05] (03PS3) 10Hashar: Update Git plugin configuration [integration/config] - 10https://gerrit.wikimedia.org/r/215335 (https://phabricator.wikimedia.org/T101105) [08:59:39] addshore: jzerebecki: you want to update your JJB copy :-} [08:59:46] I have pushed some changes yesterday [08:59:57] (03CR) 10Hashar: [C: 032] "Going to refresh them" [integration/config] - 10https://gerrit.wikimedia.org/r/215335 (https://phabricator.wikimedia.org/T101105) (owner: 10Hashar) [09:04:09] (03Merged) 10jenkins-bot: Update Git plugin configuration [integration/config] - 10https://gerrit.wikimedia.org/r/215335 (https://phabricator.wikimedia.org/T101105) (owner: 10Hashar) [09:10:26] !log Refershing almost all jenkins jobs to take in account the Jenkins Git plugin upgrade https://phabricator.wikimedia.org/T101105 [09:10:30] Logged the message, Master [09:22:25] PROBLEM - Puppet failure on deployment-mx is CRITICAL 100.00% of data above the critical threshold [0.0] [09:44:50] (03CR) 10Hashar: "I have refreshed all the jobs." [integration/config] - 10https://gerrit.wikimedia.org/r/215335 (https://phabricator.wikimedia.org/T101105) (owner: 10Hashar) [09:49:46] 10Continuous-Integration-Infrastructure, 6Editing-Department, 10VisualEditor, 5Patch-For-Review, 7Regression: Submodule not being updated in Jenkins jobs - https://phabricator.wikimedia.org/T101105#1333209 (10hashar) I have refreshed all the jobs. Should be fine now. I sent a mail to the qa list to have... [10:02:45] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #451: FAILURE in 41 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/451/ [10:03:48] !log Further updated JJB fork c7231fe..f966521 [10:03:53] Logged the message, Master [10:04:44] 10Continuous-Integration-Infrastructure, 6Editing-Department, 10VisualEditor, 5Patch-For-Review, 7Regression: Submodule not being updated in Jenkins jobs - https://phabricator.wikimedia.org/T101105#1333221 (10hashar) 5Open>3Resolved [10:07:35] 10Continuous-Integration-Infrastructure, 7Regression: ERROR: Failed to notify endpoint 'HTTP:http://127.0.0.1:8001/jenkins_endpoint' - https://phabricator.wikimedia.org/T93321#1333234 (10hashar) On Monday we bumped JJB. I am now bumping it to merge commit 4135e143 which is the patch I wrote https://review.ope... [10:08:23] !log Update JJB fork again f966521..4135e14 . Will remove the http notification to zuul {{bug:T93321}}. REFRESHING ALL JOBS! [10:08:28] Logged the message, Master [10:32:47] 10Continuous-Integration-Infrastructure, 7Regression: ERROR: Failed to notify endpoint 'HTTP:http://127.0.0.1:8001/jenkins_endpoint' - https://phabricator.wikimedia.org/T93321#1333276 (10hashar) 5Open>3Resolved Jobs are still being refreshed but I confirmed the http notification is gone and jobs are proper... [10:42:18] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [10:57:16] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.027 second response time [11:33:11] 10Beta-Cluster, 10ContentTranslation-cxserver: CXServer on beta is writing Logs to NFS - https://phabricator.wikimedia.org/T101240#1333386 (10yuvipanda) 3NEW [11:38:16] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [11:42:42] PROBLEM - Content Translation Server on deployment-sca02 is CRITICAL: Connection refused [11:57:57] duh [11:58:19] !log Cherry-picked 213840 to test logstash [11:58:24] Logged the message, Master [11:58:37] That's good Kartik as per documentation. [12:00:50] 10Beta-Cluster: beta-scap-eqiad broken since June 3rd 5:24am UTC - https://phabricator.wikimedia.org/T101252#1333501 (10hashar) 3NEW [12:05:48] 10Beta-Cluster: beta-scap-eqiad broken since June 3rd 5:24am UTC - https://phabricator.wikimedia.org/T101252#1333510 (10hashar) That is caused by https://gerrit.wikimedia.org/r/#/c/213469/ (Add PHP error logging to Sentry extension) for T85188. It introduces a PHP dependency for `raven/raven` in composer.json w... [12:06:42] PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL deployment-prep.deployment-videoscaler01.diskspace._var.byte_percentfree (<10.00%) [12:14:50] Yippee, build fixed! [12:14:51] Project beta-scap-eqiad build #55630: FIXED in 1 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/55630/ [12:15:34] 10Beta-Cluster, 5Patch-For-Review: beta-scap-eqiad broken since June 3rd 5:24am UTC - https://phabricator.wikimedia.org/T101252#1333530 (10hashar) p:5Triage>3Unbreak! [12:15:39] 10Beta-Cluster, 5Patch-For-Review: beta-scap-eqiad broken since June 3rd 5:24am UTC - https://phabricator.wikimedia.org/T101252#1333532 (10hashar) 5Open>3Resolved a:3hashar I have reverted the Sentry patch and commented about it on T85188 Triggered a build of [[ https://integration.wikimedia.org/ci/job/... [12:21:04] 10Beta-Cluster, 10ContentTranslation-cxserver, 5Patch-For-Review: CXServer on beta is writing Logs to NFS - https://phabricator.wikimedia.org/T101240#1333540 (10hashar) a:3yuvipanda [13:08:34] and off [13:08:36] see you tomorrow [13:21:06] 10Continuous-Integration-Infrastructure: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1333664 (10faidon) >>! In T98003#1320527, @hashar wrote: > I created a single Jessie slave to report on package/puppet/upstart errors. Tracking is T94836. It is not a priority thou... [13:58:16] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.023 second response time [14:02:45] RECOVERY - Content Translation Server on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.024 second response time [14:19:17] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [14:34:57] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #522: FAILURE in 8 min 56 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/522/ [14:59:19] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.036 second response time [16:21:21] 10Continuous-Integration-Infrastructure, 6operations: Build a new version of php-luasandbox and hhvm-luasandbox, and deploy to integration hosts - https://phabricator.wikimedia.org/T101275#1334197 (10Anomie) 3NEW [16:24:34] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 7I18n: LuaStandalone timeout is sometimes reported as read error - https://phabricator.wikimedia.org/T96912#1334222 (10Anomie) 5Open>3Resolved [16:24:38] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 7I18n: LuaStandalone timeout is sometimes reported as read error - https://phabricator.wikimedia.org/T96912#1229176 (10Anomie) This was probably fixed with [[https://gerrit.wikimedia.org/r/#/c/213586/|Gerrit change 213586]]. If this s... [17:03:18] bd808: Possibly easy idea: could we mount /var/www somewhere like we do the stuff in /srv? [17:03:28] That docroot is useful for debugging and experimentation [17:03:45] (somewhere available to the host OS, that is) [17:12:02] 10Browser-Tests, 6Release-Engineering: Introduce @skip tag in mediawiki selenium - https://phabricator.wikimedia.org/T101062#1334384 (10kaldari) [17:23:23] 10Browser-Tests, 6Release-Engineering: Introduce @skip tag in mediawiki selenium - https://phabricator.wikimedia.org/T101062#1334446 (10greg) a:5kaldari>3dduvall [17:29:12] marxarelli: I assume that is right^ ? [17:34:01] 10Browser-Tests, 6Release-Engineering: Introduce @skip tag in mediawiki selenium - https://phabricator.wikimedia.org/T101062#1334484 (10dduvall) a:5dduvall>3None [17:34:19] greg-g: probably shouldn't assign it until someone is working on it [17:36:09] * greg-g nods [17:43:15] ostriches: yeah I think we could do that easily. We could move it to $VAGRANT/srv/www in the host os and update the apache config to read from there. If we did it with a hiera var then we could let people put it wherever they wanted. [17:43:34] That sounds like a plan [17:44:28] * bd808 dreams of having things setup so that he can put everything in the VM [17:58:09] 6Release-Engineering, 7user-notice: Shorten/Simplify MW train deploy cadence to Tu->W->Th - https://phabricator.wikimedia.org/T97553#1334671 (10mmodell) In the announcement email, @Greg wrote: >== Transition == >Transitions from one cadence to another are hard. Here's how we'll be >doing this transition: > >We... [18:15:47] twentyafterfour: phab generally writes to apache's error log, right? [18:16:20] Ah nope, nvm [18:16:29] phabricator_error.log [18:16:30] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1334834 (10Aklapper) 5stalled>3Open Went on vacations. Got back. Found people asking for stuff. Created key. Signed L3. wikitech username: ak... [18:16:54] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1334838 (10Aklapper) a:5Aklapper>3None [18:20:04] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1334847 (10Matanya) a:3Matanya [18:21:31] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL 100.00% of data above the critical threshold [0.0] [18:24:52] !log updating deployment-salt puppet in prep for use_dnsmasq=false [18:24:57] Logged the message, Master [18:26:24] ostriches: yeah [18:27:04] * ostriches is poking git-http-backend a tad [18:27:13] Seeing if we can at least hosting our diffusion repos as mirrors [18:28:12] thcipriani: gotta go rescue tiny plants from thunderstorm, brb [18:28:20] kk [18:29:48] ostriches: as mirrors? [18:30:31] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1334910 (10Matanya) a:5Matanya>3Dzahn [18:31:20] twentyafterfour: So you can clone the repos from diffusion. [18:31:47] The problem is opening up ssh [18:31:56] we can set up https cloning I think [18:32:13] Yeah https cloning is all I'm doing now [18:32:18] R/O :) [18:32:33] Actually I'm not sure if phabricator supports that when it's mirroring remote repos, I'd ask in #phabricator they might know [18:32:52] I think it should [18:32:59] Config/UI seems to indicate it is [18:33:07] I'm trying to debug why it won't work yet tho :) [18:33:49] hmm, maybe you have to authenticate? are the repos set to fully public? [18:33:58] https://phabricator.wikimedia.org/diffusion/UINF/edit/serve/ [18:34:19] Visible To Public (No Login Required) [18:34:38] thcipriani: I’m back, at least partially :) Ping if you run into trouble. [18:34:46] Set 1 was getting git-http-backend into $PATH (done, will have a puppet patch shortly to shore it up) [18:34:55] *step [18:35:03] Now I'm still getting 500 and nothing in the log [18:35:48] andrewbogott: uh...puppetmaster restart doesn't seem to want to come back up [18:35:59] PROBLEM - Puppet failure on deployment-db2 is CRITICAL 20.00% of data above the critical threshold [0.0] [18:36:05] ok, on deployment-salt? [18:36:16] yeah [18:36:29] wonder if it's finishing a run, or it's just hung up. [18:36:37] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL 20.00% of data above the critical threshold [0.0] [18:36:38] twentyafterfour: I'm going to undo my testing and poke this later, I don't wanna leave it half-working [18:39:18] ok .. I think the problem is sudo settings [18:39:38] ostriches: you just triggered a bunch of sudo failure emails wrt git-http-backend [18:39:45] PROBLEM - Puppet failure on deployment-salt is CRITICAL 50.00% of data above the critical threshold [0.0] [18:40:15] PROBLEM - Puppet failure on deployment-db1 is CRITICAL 70.00% of data above the critical threshold [0.0] [18:40:25] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL 44.44% of data above the critical threshold [0.0] [18:40:41] andrewbogott: I'm guessing pid 1212 is just going to have to get killed [18:41:05] PROBLEM - Puppet failure on deployment-fluorine is CRITICAL 66.67% of data above the critical threshold [0.0] [18:41:28] thcipriani: yep, that seems to’ve done it. [18:41:32] No idea why it got stuck :( [18:42:00] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL 60.00% of data above the critical threshold [0.0] [18:43:16] ok, rerunning on deployment-salt one time more for good measure, then we'll flip the switch [18:44:23] alright, seems to have run [18:44:24] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL 30.00% of data above the critical threshold [0.0] [18:46:10] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL 66.67% of data above the critical threshold [0.0] [18:46:18] PROBLEM - Puppet failure on deployment-sca02 is CRITICAL 60.00% of data above the critical threshold [0.0] [18:47:26] just looked into the deployment-elastic08 puppet failure, to make sure no updates to the puppet repo blew anything up. Seems like these failures are a result of restarting the puppetmaster [18:48:19] andrewbogott: [x] update puppet master; [ ] use_dnsmasq: false in hiera, is that the proper next step? [18:49:00] change the puppetmaster name in ldap and hiera — that’s done already? [18:49:44] well, puppetmaster is updated in heira, should override ldap [18:49:58] yep [18:50:04] alright, here goes [18:50:17] !log change use_dnsmasq: false for deployment-prep [18:50:22] Logged the message, Master [18:51:11] running puppet on d-salt [18:53:23] alright, updated master [18:54:06] I'd say we are in good shape for the new cadence, I've never seen the fatal errors so clean. Barely any OOMs, just a bunch of mysql failures which I don't fully understand but it's a known issue [18:54:47] RECOVERY - Puppet failure on deployment-salt is OK Less than 1.00% above the threshold [0.0] [18:58:38] andrewbogott: looks like I have to update resolv.conf manually before each agent recognizes the new server address [19:00:03] RECOVERY - Puppet failure on deployment-salt is OK Less than 1.00% above the threshold [0.0] [19:00:23] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1335010 (10Dzahn) 5Open>3Resolved user has been created on bast1001 and on iridium. linked andre__ to example for ProxyCommand setup. let us k... [19:00:47] PROBLEM - Puppet failure on deployment-salt is CRITICAL 20.00% of data above the critical threshold [0.0] [19:01:13] 6Release-Engineering, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1335012 (10Dzahn) ``` [iridium:/etc/sudoers.d] $ sudo cat phabricator-admin # This file is managed by Puppet! %phabricator-admin ALL = NOPASSWD:... [19:01:17] thcipriani: that makes sense. I’m not sure why that didn’t happen to me; I’ll set up a new test and investigate. [19:01:38] thcipriani: still manageable, right? [19:01:51] yeah, ndb really [19:03:11] cool. Deployment-elastic08: first one I updated, went just fine. [19:05:17] RECOVERY - Puppet failure on deployment-db1 is OK Less than 1.00% above the threshold [0.0] [19:05:44] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:05:49] andrewbogott: here's a thought, if a puppet run happens, the resolv.conf updates, then you can _just_ update puppet.conf, that might be why it didn't happen to you? [19:06:00] RECOVERY - Puppet failure on deployment-db2 is OK Less than 1.00% above the threshold [0.0] [19:06:08] RECOVERY - Puppet failure on deployment-fluorine is OK Less than 1.00% above the threshold [0.0] [19:06:32] thcipriani: so you mean running puppet before changing puppet.conf works? [19:06:34] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:06:36] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL 20.00% of data above the critical threshold [0.0] [19:06:39] It [19:06:44] It’s possible that that’s what I did. [19:07:08] maybe? [19:07:25] hm, probably if you remove use_dnsmasq and then run puppet on the clients before running on the master it works. [19:07:36] PROBLEM - Puppet failure on deployment-test is CRITICAL 30.00% of data above the critical threshold [0.0] [19:07:38] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:08:03] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL 50.00% of data above the critical threshold [0.0] [19:08:06] PROBLEM - Puppet failure on deployment-upload is CRITICAL 44.44% of data above the critical threshold [0.0] [19:08:33] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:08:37] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:09:07] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:10:21] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL 44.44% of data above the critical threshold [0.0] [19:10:27] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:10:45] RECOVERY - Puppet failure on deployment-salt is OK Less than 1.00% above the threshold [0.0] [19:11:53] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:12:10] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:12:28] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 33.33% of data above the critical threshold [0.0] [19:12:42] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:13:08] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL 44.44% of data above the critical threshold [0.0] [19:14:22] RECOVERY - Puppet failure on deployment-elastic08 is OK Less than 1.00% above the threshold [0.0] [19:14:24] PROBLEM - Puppet failure on deployment-stream is CRITICAL 30.00% of data above the critical threshold [0.0] [19:15:10] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL 66.67% of data above the critical threshold [0.0] [19:16:14] PROBLEM - Puppet failure on deployment-db1 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:16:18] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:16:58] PROBLEM - Puppet failure on deployment-db2 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:17:00] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:17:06] PROBLEM - Puppet failure on deployment-fluorine is CRITICAL 22.22% of data above the critical threshold [0.0] [19:23:33] twentyafterfour: Sounds fixable, we'll figure it out later. [19:24:41] ostriches: just needs sudoers adjustments, those are documented in the diffusion setup instructions [19:25:32] Ah yes [19:25:36] I should do that :) [19:26:21] RECOVERY - Puppet failure on deployment-sca02 is OK Less than 1.00% above the threshold [0.0] [19:27:27] RECOVERY - Puppet failure on deployment-bastion is OK Less than 1.00% above the threshold [0.0] [19:29:53] all the puppet fail is already known, right [19:30:06] mutante: yup, just changing labs dns stuffs [19:30:15] thcipriani: ok, cool [19:32:06] RECOVERY - Puppet failure on deployment-fluorine is OK Less than 1.00% above the threshold [0.0] [19:33:30] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 33.33% of data above the critical threshold [0.0] [19:34:08] RECOVERY - Puppet failure on deployment-mediawiki01 is OK Less than 1.00% above the threshold [0.0] [19:38:03] RECOVERY - Puppet failure on deployment-mathoid is OK Less than 1.00% above the threshold [0.0] [19:40:41] RECOVERY - Puppet failure on deployment-zookeeper01 is OK Less than 1.00% above the threshold [0.0] [19:43:06] RECOVERY - Puppet failure on deployment-upload is OK Less than 1.00% above the threshold [0.0] [19:43:26] RECOVERY - Puppet failure on deployment-bastion is OK Less than 1.00% above the threshold [0.0] [19:43:30] RECOVERY - Puppet failure on deployment-memc03 is OK Less than 1.00% above the threshold [0.0] [19:46:40] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:47:08] RECOVERY - Puppet failure on deployment-mediawiki02 is OK Less than 1.00% above the threshold [0.0] [19:48:36] RECOVERY - Puppet failure on deployment-jobrunner01 is OK Less than 1.00% above the threshold [0.0] [19:51:22] andrewbogott: can you look at deployment-stream quickly? Complaining about something :\ [19:51:28] yep [19:51:55] RECOVERY - Puppet failure on deployment-sentry2 is OK Less than 1.00% above the threshold [0.0] [19:52:41] RECOVERY - Puppet failure on deployment-pdf02 is OK Less than 1.00% above the threshold [0.0] [19:54:09] oh wait, did puppetmaster crap out? [19:54:23] I restarted it [19:54:26] probably needlessly [19:54:47] ah [19:57:28] thcipriani: I generated a new cert for deployment-stream and it seems happy now. Clearly there are 101 races in this process. [19:57:40] heh, indeed. [19:57:55] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:58:05] RECOVERY - Puppet failure on deployment-zotero01 is OK Less than 1.00% above the threshold [0.0] [19:59:29] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 44.44% of data above the critical threshold [0.0] [20:03:23] andrewbogott: huh, deployment-cache-bits01, seemingly there may be a bigger issue there. Could you poke that one to verify? [20:03:33] yep [20:08:03] thcipriani: other than some real errors in the puppet config it seems ok. [20:08:16] I ran it once, had to sign the cert on deployment-salt, ran again. [20:08:43] yeah, real errors is what I suspected :\ [20:09:26] RECOVERY - Puppet failure on deployment-stream is OK Less than 1.00% above the threshold [0.0] [20:10:07] thcipriani: init script stuff, could have to do with services configured for jessie but running on precise [20:10:08] RECOVERY - Puppet failure on deployment-apertium01 is OK Less than 1.00% above the threshold [0.0] [20:10:10] dunno [20:10:41] I'll check for phab tickets related to it once I'm done here [20:11:12] RECOVERY - Puppet failure on deployment-db1 is OK Less than 1.00% above the threshold [0.0] [20:14:50] (03PS1) 10Dduvall: Push commits to 0.4 by default [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 [20:15:16] (03CR) 10jenkins-bot: [V: 04-1] Push commits to 0.4 by default [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 (owner: 10Dduvall) [20:15:54] (03PS2) 10Dduvall: Push 0.4 commits to remote 0.4 branch by default [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 [20:16:38] (03CR) 10jenkins-bot: [V: 04-1] Push 0.4 commits to remote 0.4 branch by default [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 (owner: 10Dduvall) [20:16:58] RECOVERY - Puppet failure on deployment-db2 is OK Less than 1.00% above the threshold [0.0] [20:19:02] marxarelli: rspec-core is not part of the bundle. !!! [20:19:22] hashar: it's an old branch [20:20:26] RECOVERY - Puppet failure on deployment-memc02 is OK Less than 1.00% above the threshold [0.0] [20:22:25] !log deployment-bastion Jenkins slave is stalled again :-( No code update happening on beta cluster [20:22:31] Logged the message, Master [20:24:05] thcipriani: you in a good stopping place soon, or do you want to push back our 1:1? [20:24:37] greg-g: I'm close to wrapping up, I can pause for a wee bit too [20:24:45] kk [20:24:48] didn't want to interrupt [20:24:52] andrewbogott: everything seems to be fairly uneventful, thanks for your help :) [20:25:06] great! Sorry it was so much work. [20:25:14] I guess the only major project left is integration? [20:25:18] PROBLEM - Puppet staleness on deployment-restbase01 is CRITICAL 100.00% of data above the critical threshold [43200.0] [20:25:22] RECOVERY - Puppet failure on deployment-videoscaler01 is OK Less than 1.00% above the threshold [0.0] [20:25:49] yup, integration will be the next one. I wonder if there's just a salt state that could be written :) [20:26:36] RECOVERY - Puppet failure on deployment-logstash1 is OK Less than 1.00% above the threshold [0.0] [20:27:55] RECOVERY - Puppet failure on deployment-sentry2 is OK Less than 1.00% above the threshold [0.0] [20:28:48] !log Restarting Jenkins to release a deadlock [20:28:52] Logged the message, Master [20:29:07] thcipriani: thanks a ton for handling this ! :-} [20:29:09] PROBLEM - Puppet staleness on deployment-restbase02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [20:29:13] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #658: ABORTED in 3 min 13 sec: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/658/ [20:29:32] hashar: what does the `files:` filter do in jjb, again? [20:29:44] there is no such thing :-} [20:29:47] it is in zuul [20:29:50] does it only execute the job if files matching the pattern exist? [20:29:52] hashar: np #breaking_down_silos :) [20:29:53] yeah [20:30:00] thcipriani: #together !!! [20:30:09] thcipriani: I’m sure it could be done via salt, even just with cmd.run and sed. [20:30:10] thcipriani: you got it by the dns issue with staging haven't you ? [20:30:32] marxarelli: so when a new patch notif is received by zuul, that contains the list of files that have been changed [20:30:45] marxarelli: so we can prevent running a job unless some specific file is changed [20:31:05] marxarelli: an example is we validate the composer json file only if the patch actually change it [20:31:33] RECOVERY - Puppet failure on deployment-redis01 is OK Less than 1.00% above the threshold [0.0] [20:31:55] hashar: ok. maybe i'm better off excluding the rspec job for just the 0.4 branch of mediawiki-selenium [20:31:59] RECOVERY - Puppet failure on deployment-memc04 is OK Less than 1.00% above the threshold [0.0] [20:32:27] marxarelli: yeah that is doable [20:32:32] marxarelli: in zuul something like: [20:32:36] - job: whatever-rspec [20:32:51] branch: (!:^0.4$) [20:33:05] the problem is the job is probably shared by multiple repos :-/ [20:35:14] (03PS1) 10Dduvall: Don't run rspec for pre-1.0 branches of mediawiki_selenium [integration/config] - 10https://gerrit.wikimedia.org/r/215770 [20:35:19] hashar: ^ [20:36:34] oh [20:37:19] (03CR) 10Hashar: [C: 031] "GO go go !!!" [integration/config] - 10https://gerrit.wikimedia.org/r/215770 (owner: 10Dduvall) [20:37:27] marxarelli: I will let you +2 and deploy it :-} [20:37:51] there is a fabfile.py at the root of the repo for convenience (thanks legoktm ) [20:38:50] hashar: what's a fabfile? :) [20:39:16] oooh, fabric [20:39:17] neat [20:39:22] yeah a python deployment tool [20:39:24] yet another one [20:39:36] I like fabric solely because ... python :-} [20:39:50] Bryan Davis looked at it before porting scap from bash to python [20:39:54] but eventually had to dismiss it [20:40:30] RECOVERY - Puppet failure on deployment-cxserver03 is OK Less than 1.00% above the threshold [0.0] [20:41:20] RECOVERY - Puppet failure on deployment-elastic05 is OK Less than 1.00% above the threshold [0.0] [20:45:11] Jenkis is back [20:45:28] but deployment-bastion is still deadlocked apparently (: [20:47:49] !log Reloading Zuul to deploy I96649bc92a387021a32d354c374ad844e1680db2 [20:47:53] Logged the message, Master [20:49:17] !log restarted zuul entirely to remove some stalled jobs [20:49:19] marxarelli: ^^^ [20:49:20] sorry [20:49:22] Logged the message, Master [20:49:32] PROBLEM - Puppet failure on integration-dev is CRITICAL 33.33% of data above the critical threshold [0.0] [20:50:03] hashar: doh. my changes to layout.yaml didn't take for some reason [20:50:28] maybe the regex is wrong ? [20:50:31] or you forgot to git pull [20:51:21] hmm no [20:51:26] marxarelli: I only +1ed https://gerrit.wikimedia.org/r/#/c/215770/ :} [20:51:46] note that you did a filter for mediawiki-selenium-bundle-rspec' [20:51:51] but that job is not triggered apparently [20:51:59] there are mediawiki-selenium-gembuild [20:52:04] and bundle-yard / bundle-rspec [20:53:37] the jobs have been unified [20:53:44] and are now shared by multiple repos :-( [20:54:26] marxarelli: if you plan to add rspec, the best is probably to force merge that dummy change that just update .gitreview [21:01:20] bed time sorry :/ [21:05:06] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:05:07] (03CR) 10Polybuildr: "Ping?" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/153399 (owner: 10Addshore) [21:05:22] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:07:15] PROBLEM - Puppet failure on deployment-db1 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:07:19] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:08:01] PROBLEM - Puppet failure on deployment-db2 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:08:11] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:09:15] PROBLEM - Puppet failure on integration-vmbuilder-trusty is CRITICAL 44.44% of data above the critical threshold [0.0] [21:09:45] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:10:23] PROBLEM - Puppet failure on deployment-stream is CRITICAL 60.00% of data above the critical threshold [0.0] [21:12:18] PROBLEM - Puppet failure on deployment-sca02 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:12:24] PROBLEM - Puppet failure on integration-raita is CRITICAL 20.00% of data above the critical threshold [0.0] [21:13:00] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:14:06] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL 22.22% of data above the critical threshold [0.0] [21:14:06] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:14:34] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:15:07] (03Abandoned) 10Dduvall: Don't run rspec for pre-1.0 branches of mediawiki_selenium [integration/config] - 10https://gerrit.wikimedia.org/r/215770 (owner: 10Dduvall) [21:16:29] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:16:45] PROBLEM - Puppet failure on deployment-salt is CRITICAL 60.00% of data above the critical threshold [0.0] [21:17:25] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [21:17:37] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:18:39] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:18:53] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:19:38] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:20:35] PROBLEM - Puppet failure on deployment-bastion is CRITICAL 66.67% of data above the critical threshold [0.0] [21:21:09] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL 33.33% of data above the critical threshold [0.0] [21:21:29] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:21:31] PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:21:45] (03PS3) 10Dduvall: Fixup the 0.4 branch to work with CI jobs [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 [21:22:39] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:23:05] PROBLEM - Puppet failure on deployment-fluorine is CRITICAL 44.44% of data above the critical threshold [0.0] [21:23:17] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:23:27] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:24:07] PROBLEM - Puppet failure on deployment-upload is CRITICAL 44.44% of data above the critical threshold [0.0] [21:24:41] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:25:03] PROBLEM - Puppet failure on integration-publisher is CRITICAL 44.44% of data above the critical threshold [0.0] [21:26:03] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:26:05] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL 57.14% of data above the critical threshold [0.0] [21:26:10] RECOVERY - Puppet failure on deployment-elastic06 is OK Less than 1.00% above the threshold [0.0] [21:26:16] ummmmmmmm [21:29:46] (03CR) 10Dduvall: [C: 032] Fixup the 0.4 branch to work with CI jobs [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 (owner: 10Dduvall) [21:31:00] (03Merged) 10jenkins-bot: Fixup the 0.4 branch to work with CI jobs [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215760 (owner: 10Dduvall) [21:31:02] see -labs, shit is down [21:32:15] RECOVERY - Puppet failure on deployment-db1 is OK Less than 1.00% above the threshold [0.0] [21:34:10] wee [21:34:42] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK Less than 1.00% above the threshold [0.0] [21:35:22] RECOVERY - Puppet failure on deployment-elastic08 is OK Less than 1.00% above the threshold [0.0] [21:35:26] RECOVERY - Puppet failure on deployment-stream is OK Less than 1.00% above the threshold [0.0] [21:36:02] RECOVERY - Puppet failure on deployment-redis02 is OK Less than 1.00% above the threshold [0.0] [21:37:16] RECOVERY - Puppet failure on deployment-sca02 is OK Less than 1.00% above the threshold [0.0] [21:37:23] RECOVERY - Puppet failure on deployment-elastic05 is OK Less than 1.00% above the threshold [0.0] [21:37:59] RECOVERY - Puppet failure on deployment-db2 is OK Less than 1.00% above the threshold [0.0] [21:39:03] RECOVERY - Puppet failure on deployment-zotero01 is OK Less than 1.00% above the threshold [0.0] [21:39:33] RECOVERY - Puppet failure on deployment-memc03 is OK Less than 1.00% above the threshold [0.0] [21:39:36] (03PS1) 10Dduvall: Check for session ID before updating SauceLabs job [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215796 (https://phabricator.wikimedia.org/T101304) [21:41:27] RECOVERY - Puppet failure on deployment-cxserver03 is OK Less than 1.00% above the threshold [0.0] [21:41:49] RECOVERY - Puppet failure on deployment-salt is OK Less than 1.00% above the threshold [0.0] [21:42:20] RECOVERY - Puppet failure on integration-raita is OK Less than 1.00% above the threshold [0.0] [21:43:02] RECOVERY - Puppet failure on deployment-memc04 is OK Less than 1.00% above the threshold [0.0] [21:47:14] (03PS1) 10Dduvall: Releasing patch version 0.4.3 [selenium] (0.4) - 10https://gerrit.wikimedia.org/r/215799 [21:48:03]