[05:06:47] <shinken-wm>	 PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused  
[05:06:47] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:07:04] <YuviPanda>	 as you can see
[05:07:07] <YuviPanda>	 shinken’s back up :)
[05:08:50] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:10:32] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:11:21] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:11:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:11:39] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%)  
[06:36:38] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[08:49:01] <wikibugs>	 10Continuous-Integration, 10Gather: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1227033 (10hashar)
[08:54:26] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227041 (10phuedx) 5Open>3Resolved All yer patches are merged @Jdlrobson.
[08:54:28] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1227043 (10phuedx)
[09:00:02] <wikibugs>	 10Continuous-Integration, 10Gather: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1227050 (10hashar) @Jdlrobson wrote: > PS. @hashar we really need to make these jobs something that developers get for free when they setup an extension. Maybe this is something we...
[09:11:28] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227055 (10hashar) Logged in with my LDAP account, I have manually triggered runs for the three...
[09:37:34] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227102 (10hashar) I have poked @fgiunchedi about the Trusty packages.  Rebuild it out of the integration/zuul.git  deb...
[10:53:57] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227205 (10fgiunchedi) 5Open>3Resolved {{done}}
[10:53:59] <wikibugs>	 10Continuous-Integration: Zuul: upgrade to latest upstream version - https://phabricator.wikimedia.org/T48354#1227209 (10fgiunchedi)
[10:54:01] <wikibugs>	 10Continuous-Integration: Upgrade Zuul server to latest upstream - https://phabricator.wikimedia.org/T94409#1227208 (10fgiunchedi)
[10:54:03] <wikibugs>	 10Continuous-Integration, 7Zuul: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1227207 (10fgiunchedi)
[11:06:19] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[11:13:50] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]  
[11:16:25] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[11:20:29] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[11:31:15] <hashar>	 !log integration: Zuul package has been uploaded for Trusty!   Deleting the .deb from /home/hashar/
[11:31:18] <qa-morebots>	 Logged the message, Master
[11:34:06] <hashar>	 !log integration: apt-get upgrade on integration-slave-trusty* instances
[11:34:09] <qa-morebots>	 Logged the message, Master
[11:40:16] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[11:50:14] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0]  
[12:48:04] <hashar>	 !log beta: Andrew B. starting to migrate beta cluster instances on new virt servers
[12:48:07] <qa-morebots>	 Logged the message, Master
[12:54:56] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227342 (10hashar) 5Resolved>3Open Reopening since the builds I triggered earlier have some...
[12:54:58] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1227344 (10hashar)
[12:59:40] <wikibugs>	 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227356 (10hashar) a:3hashar Created instance i-00000b91 m1.small with image "ubuntu-12.04-precise" and integration-saltmaster.eqiad.wmflabs.   https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000...
[13:12:33] <grrrit-wm>	 (03PS3) 10Hashar: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:18:24] <shinken-wm>	 PROBLEM - Host deployment-bastion is DOWN: PING CRITICAL - Packet loss = 100%  
[13:18:50] <shinken-wm>	 RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms  
[13:23:38] <grrrit-wm>	 (03PS4) 10Hashar: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:25:22] <grrrit-wm>	 (03CR) 10Hashar: "That needs DOC_SUBPATH to be set which is done only if the job name is suffixed with '-publish'." (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:25:27] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:27:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:34:26] <grrrit-wm>	 (03CR) 10Hashar: "So the publish job worked just fine https://integration.wikimedia.org/ci/job/mediawiki-vagrant-puppet-doc-publish/642/" [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:35:11] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Migrate all jobs to labs slaves - https://phabricator.wikimedia.org/T86659#1227420 (10hashar)
[13:38:24] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "Suffix the job name with '-publish' and that will be good :)" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/204982 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm)
[13:38:28] <shinken-wm>	 PROBLEM - Puppet failure on integration-saltmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[13:51:32] <jzerebecki>	 !log integration-slave-trusty-1015:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/node_modules
[13:51:34] <qa-morebots>	 Logged the message, Master
[13:53:29] <shinken-wm>	 RECOVERY - Puppet failure on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:58:37] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1227501 (10Aklapper) >>! In T95469#1226330, @mmodell wrote: > @christopher: I'd be ok with having the dates on the default form but that isn't really my call.  Same here: Would...
[14:04:34] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1227540 (10chasemp) This blocking task has not be resolved for to facilitate upgrade.   Pursuant to:  >>! In T95469#1217373, @chasemp wrote: >>>! In T95469#1199733, @ksmith wrot...
[14:10:06] <wikibugs>	 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227545 (10hashar) I have applied `role::salt::masters::labs::project_master` and ran puppet.  I took: * public key from `/etc/salt/pki/master/master.pub`  * fingerprint via `salt-key -f /etc/salt/pki/...
[14:15:32] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1227570 (10SBisson) An error in ext/popups was preventing most our tests from running. After https://gerrit.wikimedia.org/r/#/c/202976/ is merged, we'll what...
[14:19:16] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[14:19:26] <wikibugs>	 7Blocked-on-RelEng, 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, and 3 others: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1227614 (10coren)
[14:19:29] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.073 second response time  
[14:19:30] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1791 bytes in 3.037 second response time  
[14:21:06] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1534 bytes in 6.801 second response time  
[14:21:34] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1534 bytes in 3.056 second response time  
[14:21:54] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1792 bytes in 6.183 second response time  
[14:23:10] <wikibugs>	 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227647 (10hashar) 5Open>3Resolved The salt autosigner is part of puppet class `puppetmaster::autosigner`. I have applied it and that creates the cron: ``` * * * * * /usr/local/sbin/puppetsigner.py...
[14:28:30] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[14:29:27] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47129 bytes in 0.639 second response time  
[14:29:29] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47221 bytes in 0.656 second response time  
[14:31:01] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.560 second response time  
[14:31:31] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.534 second response time  
[14:31:49] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 0.544 second response time  
[14:31:55] <wikibugs>	 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227674 (10hashar) Announced on QA list https://lists.wikimedia.org/pipermail/qa/2015-April/002246.html
[14:34:41] <hashar>	 !log beta: failures on instances are due to them being moved on different openstack compute nodes (virt***)
[14:34:44] <qa-morebots>	 Logged the message, Master
[14:37:11] <Coren>	 greg-g: (or someone else): I need one of you to give me a hand to make sure the deployment-prep is ready for idmap being turned off tomorrow.  Shouldn't take very long (1-2h tops)
[14:39:08] <greg-g>	 I have no idea what idmap is
[14:42:00] <Coren>	 greg-g: I've emailed regularily on the topic.  :-)  The short story: (a) precise instances that have not been rebooted since Apr 23 need to be rebooted, and (b) we need to make sure that anything that is written to /data/project is owned by a user that is either in LDAP or managed by puppet.
[14:42:31] <Coren>	 greg-g: In deployment-prep's case, (b) is already mostly done (maybe all) when we moved the users to LDAP a while ago.
[14:43:33] <Coren>	 greg-g: So it's really 99% just rebooting the instances in whichever order is least disruptive and making sure they come back up happy.
[14:43:55] <greg-g>	 Coren: to where?
[14:43:58] <greg-g>	 (didyou email)
[14:44:14] <Coren>	 labs-announce, labs-l and, iirc, wikitech-l for most.
[14:44:18] <greg-g>	 thcipriani: ^
[14:45:53] <thcipriani>	 yeah, I can help with that
[14:46:18] <thcipriani>	 least disruptive order would be something I'd need to explore a bit
[14:48:13] <Coren>	 thcipriani: Sure.  From my side, the order is really immaterial so it's really for the project's benefit.  "pull-out-the-bandaid and reboot 'em all now" vs "ginglerly, in a precise order".  :-)
[14:49:29] <thcipriani>	 Coren: right. I'll probably want to get advice from hashar on that.
[14:50:19] <Coren>	 thcipriani: Well, he sent me to #wikimedia-releng specifically to talk with you guys.  :-)
[14:51:30] * Coren brb shortly, off to get tea.
[14:56:08] <twentyafterfour>	 mmm tea 
[15:00:16] <greg-g>	 is for the weak, french press here :)
[15:02:23] <^d>	 We used to have a french press but it's too time consuming most mornings :p
[15:03:09] <thcipriani>	 took 2nd place in the Longmont adult science fair for brewing coffee :) https://tylercipriani.com/coffee-extract/
[15:03:55] * ^d is new to coffee
[15:04:05] <greg-g>	 thcipriani: way to just roll in a with a sledgehammer there. "BAM! MATH!"
[15:04:07] <^d>	 I finally got over my childhood aversion to the taste just a few months ago
[15:05:03] <thcipriani>	 heh, the first place I went in SF, was Blue Bottle, pretty awesome.
[15:05:53] <greg-g>	 :)
[15:09:10] <^d>	 thcipriani: Have you tried Sightglass? Also very good
[15:09:47] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1227905 (10hashar) A status update:  The initial Debian packaging is ready for review https://gerrit.wikimedia....
[15:10:04] <thcipriani>	 Coren: so, looking at deployment prep, there aren't too terribly many precise instances, but there may be some ownership issues for the deployment-prep project disk, logs is owned by 996 and cxserver is owned by 995
[15:10:19] <thcipriani>	 ^d: no, but their website does look promising
[15:12:25] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227916 (10hashar) 3NEW
[15:15:59] <wikibugs>	 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227948 (10hashar) I have poked our internal ops list to get some tips and hints.
[15:19:10] <hashar>	 thcipriani: I think we can afford some downtime on beta cluster
[15:19:20] <hashar>	 if we announce it ahead of time, it is probably acceptable
[15:20:24] <thcipriani>	 hashar: kk, my initial take was take down everything at once save deployment-salt since it's the puppetmaster and everything will want to hit it when they come back up
[15:21:09] <Coren>	 thcipriani: That's only an issue if those are accessed cross-instance /and/ the uids differ.
[15:21:11] <thcipriani>	 after all other precise instances have rebooted, then reboot deployment-salt—seem reasonable?
[15:21:26] <hashar>	 ah for NFS idmap
[15:21:53] <hashar>	 the main offenders were mwdeploy / apache
[15:22:14] <Coren>	 hashar: Right, and those are known to be uniform now.
[15:22:15] <hashar>	 since they share files on /data/project  . But the users have been created on the NFS server
[15:22:27] <hashar>	 and I think some work has been done recently to ensure they are created with a stable UID
[15:22:33] <hashar>	 also I think we renamed apache to www-data
[15:22:46] <Coren>	 hashar: That's prod standard and a good idea anyways.
[15:22:55] <hashar>	 yup
[15:23:03] <hashar>	 apache was probably inherited from the Fedora era
[15:23:22] <hashar>	 (yeah we used to run Fedora)
[15:23:41] <hashar>	 then under /data/project there are a few more suspects
[15:23:51] <Coren>	 Mostly loggin afaict
[15:24:12] <hashar>	   /data/project/syslog/ files are  written by syslog-ng on deployment-bastion   but that is root:wikidev owned
[15:24:32] <hashar>	  /data/project/logs is the udp2log daemon on deployment-bastion
[15:24:36] <Coren>	 hashar: https://phabricator.wikimedia.org/T95554 has a recent list
[15:25:08] <hashar>	 ah great
[15:25:47] <hashar>	  /data/project/parsoid/ is written from the Parsoid instance -cant remember the name-
[15:26:00] <Coren>	 tl;dr: I see nothing especially worrisome in deployment-prep.  Almost all logs.
[15:26:03] <hashar>	 anyway, I have no idea what NFS idmap is and what it can disrupt :/
[15:26:39] <Coren>	 hashar: LOng story short: the only thing it can disrupt is if two instances access the same files currently with the same /username/ but different user ids.
[15:26:39] <hashar>	 the main trouble would be hosted files / thumbs in /data/project/upload7  which are written by multiple instances and owned by www-data:www-data
[15:26:49] <Coren>	 www-data is known to be fixed.
[15:26:53] <Coren>	 So not an issue.
[15:27:39] <Coren>	 It's uid 33 everywhere - and part of base so invariant.
[15:27:42] <hashar>	 so it is probably going to be fine :)
[15:28:00] <hashar>	 udp2log is quite needed 
[15:28:12] <hashar>	 iirc that relays all mediawiki logs to logstash somehow
[15:28:23] <Coren>	 Right - I don't actually expect trouble.  Doesn't mean I don't want to babysit the process.
[15:28:27] <thcipriani>	 looks like udp2log is only on flouride...
[15:28:29] <hashar>	 but maybe that relay doesn't even hit the disk
[15:28:56] <hashar>	 thcipriani: oh yeah, maybe ori migrated it out of  deployment-bastion where I have set it up orginally
[15:29:04] <hashar>	 sorry that is a bit of a mess :(
[15:29:28] <bd808>	 we don't use udp2log to get logs into logstash anymore but it is used to get on disk logs in beta cluster and prod
[15:29:40] * bd808 reads backscroll
[15:29:40] <thcipriani>	 er, sorry, udp2log is bastion-only
[15:30:25] <Coren>	 Things that are accessed by only one instance are guaranteed to be unaffected - you can't have an uid mismatch with yourself.  :-)
[15:32:00] <bd808>	 the log path for things written to /data/project/logs is [host] > udp > [deployment-bastion] > udp2log > disk
[15:32:27] <bd808>	 [host] is any node running MediaWiki in the cluster
[15:32:57] <thcipriani>	 so from the looks of it udp2log should be fine. Now wondering about deployment-salt/log
[15:36:23] <thcipriani>	 Coren: I guess I'm not seeing the issues with deployment-salt/log from https://phabricator.wikimedia.org/T95554#1196735
[15:37:24] <Coren>	 thcipriani: There may not be one - those are simply place where there are files owned by users which are simply not _guaranteed_ to have the same uid between instances.
[15:37:39] <Coren>	 Usually, those are owners managed by debian packages.
[15:37:48] <Coren>	 (As opposed to base, ldap, or puppet)
[15:39:22] <thcipriani>	 gotcha, yeah, instances seem happy and in agreement about uid==owner for those files, at least with a cursory check
[15:42:56] <thcipriani>	 hashar: where do we announce downtime for deployment-prep? Also, how do we prevent shinken spam while we're doing this?
[15:46:49] <bd808>	 thcipriani: Yuvi can hush shinken. I'd announce on labs-l, qa-l and engineering-l
[15:47:07] <bd808>	 and set the topic here when you are actually doing it
[15:47:34] <thcipriani>	 bd808: thanks!
[15:57:18] <thcipriani>	 Coren: this needs to be done by tomorrow for the idmap shut-off? Also, above, you said, "precise instances that have not been rebooted since Apr 23" need to be rebooted, which is tomorrow unless you meant last year or is a typo.
[15:57:40] <Coren>	 It's a typo.  @^#%$.  That was meant to be the 13th
[15:58:12] <Coren>	 thcipriani: Incidentally, reboot-if-idmap is a noop on boxes that don't need it.  Might save you a reboot or two.  :-)
[16:01:44] <Coren>	 thcipriani: I can try to hold off the idmap thing as long as I can if you need the extra time; I've got things that depend on it to progress but nothing will break if I delay.
[16:02:23] <thcipriani>	 Coren: kk, doing some salt spelunking now
[16:28:45] <marxarelli>	 twentyafterfour: scap deployment today?
[16:32:24] <twentyafterfour>	 marxarelli: yeah in about an hour
[16:35:03] <marxarelli>	 twentyafterfour: mind if i sit in?
[16:35:19] <twentyafterfour>	 no I don't mind at all 
[16:35:41] <twentyafterfour>	 for a preview: this is the general outline: https://etherpad.wikimedia.org/p/mmodell 
[16:36:02] <marxarelli>	 oh boy. that looks fun!
[16:39:15] <marxarelli>	 oh whoops, that's going to overlap with SoS
[16:40:08] <marxarelli>	 anyone want to attend SoS for me? greg-g? thcipriani? it's super fun. like the funnest
[16:41:28] <thcipriani>	 Coren: looks like all 19 precise boxes will need to be rebooted, after some deeper dives into files, I don't foresee any problems. I think we can reboot this afternoon, reboot all except deployment-salt then do deployment-salt, just to keep puppet happy.
[16:43:26] <greg-g>	 marxarelli: I'm a chicken, not allowed
[16:43:53] <thcipriani>	 if we want to do a reboot this afternoon, I should announce a downtime now— YuviPanda: would you be able to shush shinken while this all happens if we decide to reboot this afternoon?
[16:46:13] <thcipriani>	 marxarelli: when and how long is SoS? And how painful :P
[16:48:44] <twentyafterfour>	 thcipriani: the deployment is a two hour window 
[16:48:59] <twentyafterfour>	 11:00 to 1:00 
[16:49:06] <twentyafterfour>	 er sory marktraceur
[16:49:12] <twentyafterfour>	 pinging wrong person
[16:49:28] <twentyafterfour>	 I'm in a meeting and somewhat distracted I'm sorry
[16:50:00] <thcipriani>	 twentyafterfour: no problem
[16:50:48] <greg-g>	 thcipriani: not very painful
[16:54:19] <marxarelli>	 thcipriani: it's cake. you just fill out the etherpad with our team's updates, highlight any blockers, and answer questions should any other team be blocked by us
[16:54:54] <thcipriani>	 only 30 mins? not too bad, we're not blocked on anyone, seems like idmap may be one of the few things blocked on us. marxarelli I can jump in there if you wanna watch a deploy.
[16:55:11] <marxarelli>	 thcipriani: nice! thanks homey
[16:56:55] <thcipriani>	 marxarelli: np
[17:01:06] <thcipriani>	 marxarelli: was there anything that needed mentioning about isolated CI?
[17:03:25] <Coren>	 thcipriani: Sounds excellent.
[17:04:10] <marxarelli>	 thcipriani: you could mention the breaking up of the pool into smaller instances but unless there's a blocker i don't usually go into much detail
[17:05:07] <thcipriani>	 Coren: You around 2pm PDT to do that? I'm assuming you can stop the libvirt hosts so I don't have to go all clicky on wikitech, or open 19 shell sessions :)
[17:05:26] <thcipriani>	 marxarelli: kk, thanks
[17:06:13] <Coren>	 thcipriani: I'll be standing by for you and can reboot in batch - just give me a list of instances and I'll roll it off.
[17:16:02] <thcipriani>	 Coren: Here's the instance list and reboot groups: https://phabricator.wikimedia.org/P545
[17:16:42] <Coren>	 Esckchellent.
[17:17:07] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:17:10] <Coren>	 thcipriani: I presume there is no issue if group 1 are rebooted in batches of 2-3?
[17:17:40] * Coren doesn't want to bring the virt hosts to their knees with too many simultaneous reboots.
[17:19:04] <thcipriani>	 Coren: I don't _think_ so, but I moved the db servers to the end of the list, just to make me feel good.
[17:19:16] <Coren>	 fair 'nuf.  :-)
[17:20:31] <thcipriani>	 Coren: how long will it take, roughly, for the whole reboot? Want to give some padding in the announce.
[17:21:30] <Coren>	 Hm.  7 batches or so - count 15m to give some elbow room.  Reboots proper tend to take only a minute or so.
[17:21:40] <Coren>	 So give a 30m window and we're golden.
[17:21:55] <Coren>	 1h if you're feeling paranoid.
[17:21:59] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.525 second response time  
[17:23:21] <thcipriani>	 Coren: 1 hr it is :)
[17:39:11] <ebernhardson>	 is there a good way to get more verbose output from the browsertests? specifically i would like to see what watir is actually telling the browser to do step by step.  I've tried the --verbose option to cucumber, but that just reports on cucmber don't get anything from watir
[17:43:48] <marxarelli>	 ebernhardson: you can try setting `$DEBUG = true` in a before hook. i can't remember if that will echo the selenium commands or not, but it will at least spit out the xpath/css selectors after they're compiled
[17:44:30] <marxarelli>	 so add `Before { $DEBUG = true }` to your env.rb or step definition file
[17:45:57] <ebernhardson>	 trying that now, thanks
[17:46:10] <ebernhardson>	 had to google cucumber before hook, found that part :)
[17:48:55] <ebernhardson>	 marxarelli: ahha, thanks that is giving me more info to work with at least
[17:52:11] <bd808>	 thcipriani: you are obviously a serious unix guru. That's an epic beard you have today. :)
[17:52:59] <thcipriani>	 well, ya know, I try :)
[17:53:21] * greg-g 'll have to trim mine soon so I don't feel bad
[17:54:01] <thcipriani>	 also, to be fair, I do sport this epic beard every day
[17:54:36] <thcipriani>	 and also, to be fair, some days I feel a little outmatched by greg-g's beard.
[17:54:49] <bd808>	 I wear my beard on the inside -- https://www.youtube.com/watch?v=_yVPewAybZw
[17:54:51] <YuviPanda>	 +1 on thcipriani’s beard
[17:54:59] <YuviPanda>	 thcipriani: greg-g’s mane you mean
[17:57:03] * greg-g is growing the hair out to match
[17:57:18] * YuviPanda wishes he could do that
[17:57:23] <marxarelli>	 lately i've noticed that greg-g's head hair is encroaching on his beard's territory and wondered whether they will unite in harmony to become the Unimane or if there will be trouble
[17:57:26] <YuviPanda>	 alas, genes and what not
[17:58:21] <greg-g>	 marxarelli: there are pictures of me from my camp counselor days on FB that show what it looks like after 6 years of head hair growth and 6 months of beard growth
[17:58:30] <Krinkle>	 !log Creating integration-slave-trusty-1021 per T96629 (using ci1.medium type)
[17:58:35] <qa-morebots>	 Logged the message, Master
[17:58:56] <twentyafterfour>	 let me guess, the beard grows as long in 6 months as the hair does in 6 years?
[17:59:08] <marxarelli>	 greg-g: oh man, i have some of those too
[17:59:20] <twentyafterfour>	 marxarelli: deploy time 
[17:59:21] <marxarelli>	 lot's of Jesus and yeti comments
[17:59:21] <greg-g>	 thcipriani: basically, at least relative to my shoulders.
[17:59:28] <marxarelli>	 twentyafterfour: booyah
[18:00:15] <twentyafterfour>	 lets see, does tin have tmux...or you wanna use screen sharing on hangouts?
[18:00:15] <marxarelli>	 twentyafterfour: hangout probably
[18:00:16] <marxarelli>	 + tmux if necessary
[18:00:16] <twentyafterfour>	 yeah no tmux
[18:02:15] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[18:02:47] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1228381 (10RobH) a:5RobH>3chasemp I'm going to assign this to chase, only while the discussion is pending about the networking.  (Since he is discussing with @mark)....
[18:07:02] <^d>	 twentyafterfour: fwiw, 1.26wmf2 was tracking master earlier instead of its proper branch. I fixed it and scapped. You shouldn't have any problems with it but just fyi in case it broke again.
[18:07:12] <^d>	 (during swat this morning)
[18:07:21] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 130.68 ms  
[18:09:14] <YuviPanda>	 thcipriani: ^d just wanted to say thanks for all the work done on staging :)
[18:09:38] <^d>	 Lots of nice cleanup happened :)
[18:11:32] <thcipriani>	 YuviPanda: thank you—lots of mountains moved with your effort. Hopefully, we can rebuild momentum on that project within a couple of quarters.
[18:11:41] <YuviPanda>	 +1
[18:11:52] <YuviPanda>	 thcipriani: ^d what’re we going to do with all the curren hosts? let ‘em lie idle?
[18:12:31] <^d>	 While nobody's going to be working on it officially it might end up being something we work on while bored, so I wouldn't kill it all
[18:13:11] <YuviPanda>	 fair enough
[18:13:16] <thcipriani>	 I don't know how much those virt resources are needed, but it is nice to have around for the sole fact that I'm not worried about breaking it :)
[18:13:19] <YuviPanda>	 and we’re not crunched for space atm
[18:13:32] <YuviPanda>	 but I guess we can do something *if* we get crunched
[18:15:23] <^d>	 Yep, if we need extra resources the stuff in staging would be top of the list to kill imho
[18:15:44] <YuviPanda>	 cool
[18:17:26] <^d>	 We should probably finish pushing any open patches through that we still have
[18:17:51] <^d>	 for me, that's dsh + unifying at least nonexistent.conf in web_sites
[18:18:38] <YuviPanda>	 yeah, I think at least dsh
[18:18:44] <YuviPanda>	 i’ve almost 0 overlap with _joe_ now tho
[18:18:55] <YuviPanda>	 I’ll push it through over the next few days
[18:19:11] <^d>	 I'm rolling back to your PS4 instead of my PS5 on that.
[18:19:15] <^d>	 Your approach was much simpler
[18:19:18] <YuviPanda>	 yeah
[18:19:20] <^d>	 (and I could always follow up with my ideas)
[18:19:27] <YuviPanda>	 long term I like yours tho
[18:20:58] <^d>	 ugh, pushing an old patch says no new changes
[18:21:00] <^d>	 stupid gerrit
[18:21:28] <^d>	 I'll rebase then
[18:43:26] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[19:00:43] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228692 (10DannyH)
[19:01:49] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1156329 (10DannyH) Elena will check this over.
[19:08:03] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228720 (10DannyH)
[19:08:25] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:13:48] <twentyafterfour>	 anyone know why the zuul queues are so backed up?
[19:14:20] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2182.64 ms  
[19:16:41] <wikibugs>	 7Blocked-on-RelEng, 6Release-Engineering, 6Multimedia, 6Scrum-of-Scrums, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1228790 (10Tgr)
[19:22:19] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[19:22:47] <^d>	 twentyafterfour: Was about to ask the same
[19:22:58] <^d>	 jobs have returned as ok, zuul doesn't know
[19:23:19] <hasharAway>	 good evening
[19:23:47] <^d>	 sounds like gearman?
[19:23:58] <hashar>	 what is going on ?
[19:24:06] <^d>	 queues backed up
[19:24:08] <^d>	 https://integration.wikimedia.org/zuul/
[19:24:10] <hashar>	 damn
[19:24:24] <hashar>	 there are a bunch of graphs at the bottom
[19:24:30] <^d>	 The jobs on top look like the finished in jenkins but zuul doesn't have their return status yet.
[19:24:35] <^d>	 *they
[19:24:35] <hashar>	 the Zuul Geard job queue  has empty / null values
[19:24:41] <hashar>	 so that would indicate the process is stalled
[19:25:01] <^d>	 <^d> sounds like gearman?
[19:25:03] <^d>	 :D
[19:25:14] <hashar>	 and at the top
[19:25:17] <hashar>	 Queue lengths: 57 events, 39 results.
[19:25:25] <hashar>	 so Zuul has a lot of pending events and is deadlocked somehow
[19:25:35] * hashar looks at error log
[19:25:40] <hashar>	 on gallium  in /var/log/zuul/
[19:26:10] <hashar>	 NoConnectedServersError: No connected Gearman servers
[19:26:12] <hashar>	 grbmbmbm
[19:26:25] <hashar>	  3705 ?        Sl    35:06 /usr/share/python/zuul/bin/python /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf
[19:26:25] <hashar>	  3718 ?        Sl    69:08  \_ /usr/share/python/zuul/bin/python /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf
[19:26:31] <hashar>	 the first process is the Zuul scheduler/ server
[19:26:39] <hashar>	 on start it forks to spawn the gearman server
[19:26:46] <hashar>	 which listens on port 4730
[19:27:01] <hashar>	 lame way to ping it:   $ zuul-gearman.py status
[19:27:41] <hashar>	 might be stalled on some Gearman client connection. 
[19:27:55] <hashar>	 !log Zuul gearman is stalled.  Disabling Jenkins gearman client to free up connections
[19:27:58] <qa-morebots>	 Logged the message, Master
[19:28:43] <hashar>	 maybe we should a different gearman daemon
[19:29:14] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2411.58 ms  
[19:29:40] <hashar>	 ok gearman no more stalled
[19:29:56] <hashar>	 reenabling the Gearman client
[19:30:11] <hashar>	 !log Gearman went back. Reenabling Jenkins as a Gearman client
[19:30:14] <qa-morebots>	 Logged the message, Master
[19:33:04] <hashar>	 twentyafterfour: ^d: so there is a bug in Zuul gearman which cause it to stall completely :(
[19:33:17] <hashar>	 it is blocked trying to read on a socket that has no more data
[19:33:22] <hashar>	 and there is no timeout for the read call :/
[19:33:32] <hashar>	 the only way to fix it is to kill the socket
[19:33:37] <^d>	 So you disable the plugin, bounce gearman, then turn the plugin back on?
[19:33:40] <hashar>	 which is by disabling the Jenkins gearan client
[19:33:50] <hashar>	 this way that close the scoket,  free up the blocking read() call
[19:33:54] <hashar>	 and resume operation
[19:34:00] <hashar>	 yup
[19:34:05] <hashar>	 all from https://integration.wikimedia.org/ci/configure
[19:34:15] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms  
[19:34:38] <hashar>	 I havent found a way to reliably reproduce the bug :(
[19:37:17] <Krinkle>	 hashar: Hm.. too bad salt uses the internal hostname, not the custom one
[19:37:25] <Krinkle>	 '*slave*' matches nothing
[19:37:44] <hashar>	 Krinkle: yeah I have noticed that.  I guess that is because the human friendly hostname has no guarantee to be unique
[19:40:18] <hashar>	 !log reenabling Jenkins gearman client
[19:40:19] <hashar>	 damn
[19:40:20] <qa-morebots>	 Logged the message, Master
[19:40:29] <hashar>	 I flushed the jobs results per mistake
[19:41:42] <Krinkle>	 hashar: We also have two browsertest jobs stuck on IRC freenode
[19:42:01] <hashar>	 :((
[19:42:08] <Krinkle>	 How did you kill that socket again?
[19:42:11] <Krinkle>	 Can you write on https://wikitech.wikimedia.org/wiki/Release_Engineering/Argh ?
[19:42:12] <hashar>	 and I swear I have downgraded the plugin 
[19:42:48] <Krinkle>	 Ah, I added section last time. I forogt
[19:43:12] <hashar>	 so
[19:43:16] <hashar>	 lets kill Jenkins :(
[19:43:18] <Krinkle>	 Hm.. I don't see bash commands on https://phabricator.wikimedia.org/T96183 though
[19:43:21] <Krinkle>	 Yeah
[19:43:29] <Krinkle>	 I hope the Zuul queue can be preserved?
[19:43:35] <hashar>	 yeah
[19:44:09] <Krinkle>	 Ah, right. The $packit didn't fix it last time
[19:44:47] <hashar>	 yeah packit is just to terminate the CLOSE_WAIT socket
[19:44:53] <hashar>	 but the underlying java code does not react to it
[19:46:21] <twentyafterfour>	 java :-/
[19:46:29] <hashar>	 well java is just fine
[19:46:36] <hashar>	 but the lame code is causing issues
[19:46:38] <hashar>	 :D
[19:46:53] <hashar>	 surely if it was written in python or php it would be easier to fix for us 
[19:48:06] <twentyafterfour>	 well yeah, java is just so monilithic, when it fails it requires hard rebooting. it's like windows 
[19:49:15] <Krinkle>	 twentyafterfour: in my experiences Windows requires re-install of the OS if it fails.
[19:49:25] <hashar>	 I guess we will want to get rid of the IRC plugin
[19:49:30] <hashar>	 and roll out our own notification system
[19:49:34] <twentyafterfour>	 Krinkle: yeah then there's that 
[19:49:52] <twentyafterfour>	 I stopped using windows right around the time that vista came aout
[19:49:57] <twentyafterfour>	 haven't touched it since then
[19:50:09] <twentyafterfour>	 hashar: sounds like a good idea 
[19:50:10] <hashar>	 Jenkins back up
[19:50:18] <hashar>	 !log zuul/jenkins are back up (blame Jenkins)
[19:50:20] <qa-morebots>	 Logged the message, Master
[19:50:27] <Krinkle>	 I must've wasted countless hours between the ages of 8 and 14 rebooting Windows 95/98/ME computers only to encounter the same BSOD or DLL error again.
[19:51:19] <hashar>	 95 ???
[19:51:49] <hashar>	 I have stopped with XP which I nicknamed the Care Bears  OS  ( http://en.wikipedia.org/wiki/Care_Bears )
[19:52:02] <twentyafterfour>	 I think I've only seen maybe 20 kernel panics since I switched to only using unix. Most of those were Mac OS X (and I fiddled with making a hackintosh, so kernel panics were to be expected )
[19:52:14] <hashar>	 though I have stuck to Win 2K for quite a while (I had a multi proc machine and was playing games)
[19:53:29] <Krinkle>	 Ah yeah. OS X 10.2-10.4 I had kernal panics quite often. Not sure what I did wrong. 
[19:53:33] <Krinkle>	 It never happened to my parents.
[19:53:50] <Krinkle>	 But then again, they weren't really "using" the computer the way I did.
[19:54:12] <hashar>	 00:03:35.368 ERROR: '[earthquake] Url exceeds maximum length'
[19:54:14] <hashar>	 lovely message
[19:54:16] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3199.34 ms  
[19:54:24] <hashar>	 from https://integration.wikimedia.org/ci/job/mwext-Flow-qunit/5548/consoleFull
[19:54:26] <Krinkle>	 hashar: props for Ori to that EventLogging unit test.
[19:54:40] <Krinkle>	 It's an expected error being tested
[19:55:12] <hashar>	 and of course we can't capture/suppress output right?
[19:55:35] <hashar>	 oh
[19:55:44] <hashar>	 and Krinkle kudos for the new Zuul status page
[19:55:53] <hashar>	 the subway like pipelines are quite nice to see in the gate-and-submit
[19:56:15] <Krinkle>	 hashar: We can capture it actually. I implemented suppressWarnings()/restoreWarnings() in QUnit last year.
[19:56:17] <Krinkle>	 I'll add it
[19:56:20] <Krinkle>	 to EL test
[19:56:24] <hashar>	 \o/
[19:56:25] <wikibugs>	 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1228923 (10Jdlrobson) See : https://gerrit.wikimedia.org/r/#/c/205726/ this is wasting reviewers time unnecessarily.
[19:57:08] <wikibugs>	 10Continuous-Integration, 10Gather: PHPUnit tests do not get run by Jenkins for Gather commits - https://phabricator.wikimedia.org/T96904#1228924 (10Jdlrobson) 3NEW
[19:57:14] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 16%, RTA = 388.06 ms  
[19:59:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[20:00:39] <thcipriani>	 hashar: since you're around for the moment. Do you see any problem in doing deployment-prep reboots divided up like this: https://phabricator.wikimedia.org/P545
[20:00:47] <thcipriani>	 just wanted a quick sanity check
[20:01:04] <hashar>	 thcipriani: that is for the NFS idmap right?
[20:01:12] <thcipriani>	 yes
[20:01:27] <hashar>	 from a quick conversation I had with coren earlier today
[20:01:43] <hashar>	 we can probaqbly reboot everything however we want
[20:01:56] <hashar>	 any reason to have salt rebooted separately?
[20:02:09] <thcipriani>	 just becuase when instances come back up they'll want to do a puppet run
[20:02:17] <hashar>	 oh true
[20:03:12] <thcipriani>	 I couldn't think of any other reasons to group any reboots.
[20:03:24] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2126.43 ms  
[20:03:33] <hashar>	 yeah I think you can jusdt mass reboot everything
[20:04:02] <hashar>	 the deployment-cache* machines should not be hitting the NFS shares
[20:04:05] <hashar>	 might be good candidates
[20:04:31] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228987 (10DannyH)
[20:05:46] <thcipriani>	 hashar: cool, thanks for the reassurance. I scheduled the reboot for 2pm PDT.
[20:06:32] <hashar>	 I think the only potential screw up would be instances not able to write to /data/project due to some uid mismatch
[20:08:29] <thcipriani>	 I think the biggest concern is uid mismatches between machines. I spent 15 mins or so this morning reviewing owner/uid stuff—everything seemed ok
[20:08:39] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<50.00%)  
[20:08:42] <thcipriani>	 ^ "everything seemed ok" famous last words
[20:13:09] <wikibugs>	 10Browser-Tests, 10MediaWiki-extensions-UploadWizard, 6Multimedia: Fix failed UploadWizard browsertests Jenkins job - https://phabricator.wikimedia.org/T94161#1229081 (10MarkTraceur)
[20:14:25] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]  
[20:16:25] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuos-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229102 (10Krinkle) 3NEW
[20:16:46] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229109 (10Krinkle)
[20:19:41] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229116 (10Krinkle) p:5Triage>3High
[20:20:00] <thcipriani>	 !log gzipped /var/log/pacct.0 on deployment-bastion
[20:20:02] <qa-morebots>	 Logged the message, Master
[20:25:32] <aharoni>	 hallo
[20:25:48] <aharoni>	 does anybody know why is this failing? - https://gerrit.wikimedia.org/r/#/c/205260/
[20:26:03] <aharoni>	 I see nothing useful in the mwext-testextension-zend output
[20:29:23] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0]  
[20:30:47] <wikibugs>	 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: mwext-testextension-zend fails when changing namespace aliases in Scribunto - https://phabricator.wikimedia.org/T96912#1229176 (10Amire80) 3NEW
[20:33:00] <wikibugs>	 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1229198 (10Ryasmeen) p:5Triage>3...
[20:33:39] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[20:37:38] <Coren>	 thcipriani: Ima go eat a quick bite then I'm standing by.
[20:37:57] <thcipriani>	 Coren: kk, whenever you're back I'm ready
[20:58:26] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229330 (10hashar) Since new tasks land in our `Untriaged` column should we get a `Config` column to hold them?  The advantage would be to still have...
[21:01:46] <hashar>	 Krinkle: definitely in favor of separating CI tasks. That is a great idea. I replied on https://phabricator.wikimedia.org/T96908#1229330 
[21:02:08] <hashar>	 Krinkle: merely suggesting to create an additional column but that is probably not that much of a good idea
[21:02:34] <Coren>	 thcipriani: I'm ready when you are.
[21:03:26] <wikibugs>	 10Continuous-Integration: Run QUnit tests via SauceLabs - https://phabricator.wikimedia.org/T96919#1229336 (10Krinkle) 3NEW
[21:04:09] <thcipriani>	 Coren: Yup, I'm ready
[21:04:12] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms  
[21:04:18] <Krinkle>	 hashar: The problem is triaging and backlog. Without a separate project we won't know until after we triage.
[21:04:20] <Krinkle>	 That's extra work
[21:06:02] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229351 (10Krinkle) >>! In T96908#1229330, @hashar wrote: > Since new tasks land in our `Untriaged` column should we create a `Config` column to hold...
[21:06:06] <Coren>	 thcipriani: In progress.
[21:06:16] <thcipriani>	 Coren: watching
[21:06:35] <greg-g>	 thcipriani: btw: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/   :(
[21:07:04] <jdlrobson>	 \o
[21:07:07] <greg-g>	 jdlrobson: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
[21:07:29] <greg-g>	 short answer: yes, beta *cluster* (see also: https://wikitech.wikimedia.org/wiki/Labs_labs_labs  ;) ) isn't up to date right now
[21:07:58] <jdlrobson>	 doh. any estimates on when it's likely to be fixed? Have a product owner asking to test some stuff :)
[21:07:59] <wikibugs>	 10Continuous-Integration, 6Release-Engineering: Run qunit tests in IE8 (and possibly other Grade A browsers) - https://phabricator.wikimedia.org/T96432#1229355 (10Krinkle)
[21:08:15] <greg-g>	 no eta at the moment
[21:08:17] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229358 (10hashar) Perfect we are on the same line. I just wanted to make sure you had the same idea :-)
[21:08:21] <hashar>	 Krinkle: excellent thanks a ton.
[21:08:35] <hashar>	 Krinkle: then we can rename  Continuous-Integration to  Continuous-Integration-Infra ? 
[21:08:41] <Krinkle>	 Yes
[21:09:16] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2174.49 ms  
[21:09:28] <greg-g>	 twentyafterfour: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/lastBuild/console :(
[21:10:02] <jdlrobson>	 greg-g: bug i can subscribe to?
[21:10:05] <hashar>	 Krinkle: nice. I am not there tomorrow but will back on friday.  Will write the meeting minutes
[21:11:20] <hashar>	 jdlrobson: frmo thcipriani mail to engineering list  a  couple hours 
[21:11:30] <wikibugs>	 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229359 (10greg) 3NEW a:3mmodell
[21:11:38] <greg-g>	 jdlrobson: done ^ (already cc'd you)
[21:12:12] <greg-g>	 hashar: also that
[21:12:29] <greg-g>	 jdlrobson: unrelated, there will be a beta cluster outage starting -12 minutes ago ish
[21:12:34] <shinken-wm>	 PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[21:13:12] <Coren>	 thcipriani: That one isn't expected, I think.  Want to look into it?  ^^
[21:13:17] <thcipriani>	 looking now
[21:13:24] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms  
[21:13:32] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1536 bytes in 0.287 second response time  
[21:13:57] <Coren>	 (Might just be a check for puppet freshness at the wrong time)
[21:14:51] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1796 bytes in 2.359 second response time  
[21:16:15] <thcipriani>	 Coren: re-ran puppet on deployment-bastion, seems fine now
[21:16:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0]  
[21:16:42] <Coren>	 Yeah, okay, that looks like the check is sensitive to reboots.
[21:17:53] <thcipriani>	 looking at deployment-restbase01
[21:18:01] <Coren>	 deployment-eventlogging02.eqiad.wmflabs is problematic; it appears to be out of date w/ puppet master
[21:18:30] <Coren>	 Which means that the patch that is meant to be applied with a reboot probably isn't there.
[21:18:33] * Coren checks.
[21:18:48] <Coren>	 The last Puppet run was at Wed Apr  1 01:44:58 UTC 2015 (31413 minutes ago).
[21:19:13] * thcipriani looks
[21:19:16] <Coren>	 Indeed.  Should I patch it manually or was puppet supposed to be running there but wasn't?
[21:19:37] <thcipriani>	 hashar: ^ ?
[21:19:48] <thcipriani>	 Coren: I'm pretty sure puppet was supposed to be running
[21:19:52] * thcipriani looks at SAL
[21:20:43] <hashar>	 Coren: thcipriani no clue
[21:21:02] <hashar>	 for eventlogging  your best chance is to ask ori probably
[21:21:09] <thcipriani>	 huh, looks like it's been stalled before, but not record of it being paused
[21:21:25] <thcipriani>	 Feb 03 09:15 hashar: Running puppet on deployment-eventlogging02 has been stalled for 3d15h. No log :-(
[21:21:45] <hashar>	 damn
[21:22:03] <hashar>	 ah
[21:22:14] <hashar>	 analytics folks show up in `last`
[21:22:26] <hashar>	 so maybe fill a ticket about it for Analytics
[21:22:30] <hashar>	 and hold reboot ?
[21:22:31] <shinken-wm>	 RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:22:45] <Coren>	 hashar: I can apply the (trivial) fix and reboot manually.
[21:23:31] <hashar>	 yeah that would work 
[21:23:42] <hashar>	 I guess when puppet is enabled again that would be a noop for that fix
[21:23:49] <hashar>	 filling the task meanwhile
[21:25:26] <Coren>	 hashar: It will; it's just a file addition.
[21:25:35] <hashar>	 sounds sane so :)
[21:27:18] <Coren>	 thcipriani: All rebooted except for deployment-salt.
[21:28:06] <thcipriani>	 all the puppet fails from above seem to have self-corrected.
[21:28:16] <thcipriani>	 go ahead and kick salt
[21:31:44] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229393 (10hashar) 3NEW
[21:31:44] <Coren>	 thcipriani: I'm all done, and the patch applied neatly.
[21:32:46] <Coren>	 thcipriani: Everything looking okay on your end too?
[21:33:06] <thcipriani>	 Coren: yup, everything looks ok for now, sorry, digging still
[21:33:31] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229402 (10hashar) I have set a message pointing to this task by using:      puppet agent --enable; puppet agent --disable 'https://phabric...
[21:33:40] <twentyafterfour>	 Unable to read /srv/mediawiki-staging/php-1.26wmf3/extensions/CiteThisPage/CiteThisPage.php
[21:33:46] <Coren>	 thcipriani: bits.beta 500s though
[21:34:08] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[21:34:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[21:34:40] <thcipriani>	 as does mediawiki01
[21:34:41] <Coren>	 twentyafterfour: That's not mine - the only paths that can possibly affected are on NFS: /home and /data/project (well, also /data/scratch in theory)
[21:35:55] <hashar>	 twentyafterfour: sometime files get lost on the staging area :/
[21:36:19] <twentyafterfour>	 I don't think it got lost
[21:36:22] <hashar>	 twentyafterfour: though that should use  /php-master/
[21:36:56] <hashar>	 or was your CiteThisPage issue on prod?
[21:37:29] <thcipriani>	 mariadb didn't restart seemingly
[21:38:32] <hashar>	 mediawiki02  apache2 :  (116)Stale file handle: AH00646: Error writing to /data/project/logs/apache-access.log
[21:38:35] <hashar>	 bah
[21:38:48] <hashar>	 that comes from https://logstash-beta.wmflabs.org/
[21:39:18] <Coren>	 hashar: What's the instance behind that?
[21:39:25] <hashar>	 deployment-mediawiki02
[21:39:46] <Coren>	 Hah!  That one didn't reboot?
[21:39:53] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28267 bytes in 5.753 second response time  
[21:40:04] <Coren>	 Wait, it wasn't in my list.
[21:40:13] <hashar>	  /data/project/logs/ is owned by udp2log:udp2log
[21:40:20] <hashar>	 so that error probaqbly existed before
[21:40:39] <thcipriani>	 !log restarted mariadb on deployment-db{1,2}
[21:40:42] <qa-morebots>	 Logged the message, Master
[21:40:46] <thcipriani>	 er, started, I gues
[21:40:56] <twentyafterfour>	 CiteThisPage.php exists and is readable ...weird
[21:41:12] <Coren>	 hashar: It wasn't in my list because it wasn't precise.  :-)
[21:41:14] <thcipriani>	 yeah mediawiki02 is 14.04
[21:41:23] <hashar>	 ah
[21:41:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[21:41:41] * thcipriani looks at parsoid05
[21:41:46] <Coren>	 be back in 5, bio
[21:43:32] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.460 second response time  
[21:44:11] <thcipriani>	 ^ parsoid05 looks ok, too
[21:45:19] <wikibugs>	 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: mwext-testextension-zend fails when changing namespace aliases in Scribunto - https://phabricator.wikimedia.org/T96912#1229456 (10Anomie) The problem seems to be that occasionally a timeout in LuaStandalone is reported as a read failure rathe...
[21:45:48] <wikibugs>	 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: LuaStandalone timeout is sometimes reported as read error - https://phabricator.wikimedia.org/T96912#1229457 (10Anomie)
[21:49:10] <thcipriani>	 memc04 looks fine, even though shinken is upset about its puppet run
[21:55:52] <thcipriani>	 deployment-cache-text02 has some problem with the ssl key...
[21:56:31] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:59:00] <thcipriani>	 which, maybe, has been happening on all the cache servers for a week...?
[21:59:07] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[21:59:33] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0]  
[22:01:29] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[22:04:11] <Coren>	 Stupid dog seems to have an instinct "If I'm sick, I have to go to the most expensive fabric thing around first."
[22:05:02] <Coren>	 thcipriani: Are you ready to deliver a verdict?
[22:05:34] <thcipriani>	 yeah, everything is back to normal now. Normal being the same problems we had before the reboot :\
[22:05:47] <thcipriani>	 I'll send out the email
[22:06:07] <mutante>	 re: dog instincts http://www.sheldrake.org/books-by-rupert-sheldrake/dogs-that-know-when-their-owners-are-coming-home   </random>
[22:07:52] <wikibugs>	 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229530 (10Jdlrobson)
[22:07:55] <thcipriani>	 hashar: Coren: thanks for your assistance :)
[22:13:38] <hashar>	 thcipriani: congratulations!
[22:14:12] <hashar>	 hashar.sleep()
[22:14:26] <thcipriani>	 have a good night hashar 
[22:14:31] <thcipriani>	 thanks again
[22:18:57] <Coren>	 Have fun guys.  Surface any oddities with NFS to me, but I shouldn't expect any.
[22:18:58] <Coren>	 o/
[22:19:46] <Krinkle>	 legoktm: Ah, I guess you were missing the doc_subpath because the job name didn;t end in _publish
[22:19:49] <Krinkle>	 https://github.com/wikimedia/integration-config/commit/ac8bdf3a78995e434d089d1a48ceea0638328c0f
[22:26:40] <wikibugs>	 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1229604 (10chasemp) @KLans_WMF  and @Awjrichards could you guys weigh in on where you want to go from here?   I'm not sure if https://phabricator.wikimedia.org/T95469#1223742 ex...
[22:29:32] <thcipriani>	 twentyafterfour: whatever was happening with CiteThisPage during your deploy today may also be the cause of https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ saddness
[22:30:07] <thcipriani>	 Unable to read /mnt/srv/mediawiki-staging/php-master/extensions/CiteThisPage/CiteThisPage.php on mergemessagefilelist
[22:30:27] <twentyafterfour>	 thcipriani: yes 
[22:30:27] <chasemp>	 twentyafterfour: fyi hoping we can push this along https://phabricator.wikimedia.org/T95469#1229604
[22:30:31] <twentyafterfour>	 that is exactly the cause
[22:31:35] <twentyafterfour>	 fix is https://gerrit.wikimedia.org/r/#/c/205988/
[22:31:53] <twentyafterfour>	 awaiting +2 though I am about to submit a fix that avoids the issue for all extensions
[22:32:50] <thcipriani>	 cool beans.
[22:38:00] <twentyafterfour>	 https://gerrit.wikimedia.org/r/#/c/205999/ <-- thcipriani
[22:49:59] <wikibugs>	 10Deployment-Systems, 6Community-Liaison, 6Multimedia: New Feature Notification - https://phabricator.wikimedia.org/T77347#827765 (10Quiddity)
[22:50:54] <bd808>	 twentyafterfour: hah I wrote the same patch
[22:55:56] <bd808>	 twentyafterfour: I can +2 that CiteThisPage patch if you need it
[22:56:37] <twentyafterfour>	 well I guess we can abandon it if the other change fixes the same problem 
[22:57:03] <twentyafterfour>	 I already self-reviewed the patch for the release branch so I could get on with deploying 
[22:57:21] <twentyafterfour>	 I just cherry picked it to master so I wouldn't have the same problem next time
[22:59:03] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce build #29: FAILURE in 2.4 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce/29/
[23:21:05] <wmf-insecte>	 Yippee, build fixed!
[23:21:05] <wmf-insecte>	 Project beta-update-databases-eqiad build #9105: FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9105/
[23:22:51] <bd808>	 imma gonna mess with logstash in beta cluster a bit. testing changes for prod
[23:24:36] <bd808>	 twentyafterfour: Do you not join #wikimedia-dev as a general rule or are you just not there today?
[23:25:37] <twentyafterfour>	 bd808: I don't know, there are a lot of channels ;) 
[23:25:46] <bd808>	 true dat
[23:26:26] <twentyafterfour>	 I thought -dev was just a lot of botspam
[23:26:54] <bd808>	 there is a lot of botspam but good discussion too. generally of "review this plz" nature
[23:27:21] <bd808>	 It's the wikitech-l of irc IMO
[23:28:34] <twentyafterfour>	 honestly the level of bot spam we have on our channels makes them almost unbearable for me. i wish we had bot channels and chat channels separated completely
[23:28:58] <bd808>	 !log deployment-salt:/var/lib/git/operations/puppet in detached HEAD state; looks to be for cherry pick of I46e422825af2cf6f972b64e6d50040220ab08995 ?
[23:29:01] <qa-morebots>	 Logged the message, Master
[23:29:37] <bd808>	 I have my client tweaked out to make bots smaller and lighter font. Makes it easier to see the real people
[23:30:11] <bd808>	 https://github.com/bd808/Textual-Theme-bd808/blob/master/src/scripts/mute-senders.coffee
[23:33:14] <bd808>	 !log reset deployment-salt:/var/lib/git/operations/puppet HEAD to production; forced update with upstream; re-cherry-picked I46e422825af2cf6f972b64e6d50040220ab08995
[23:33:17] <qa-morebots>	 Logged the message, Master
[23:33:28] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.048 second response time  
[23:35:50] <twentyafterfour>	 bd808: that's pretty nice, I was actually working on something like that for glowingbear ..
[23:36:05] <twentyafterfour>	 I'm on linux so textual isn't an option 
[23:37:33] <bd808>	 there are a lot of things not to love about textual, but the ui being safari and easy to tweak with js is pretty nice
[23:39:12] <bd808>	 gwicke: were all the "Set up /api/v1/ entry point for restbase" puppet things you?
[23:40:19] <gwicke>	 bd808: bblack & I, yes
[23:40:34] <gwicke>	 do you mean the notifications?
[23:40:48] <bd808>	 the puppet.git activity
[23:41:05] <bd808>	 It was in a detached head state. I fixed that
[23:41:17] <bd808>	 the last cherry-pick is back on there though
[23:41:41] <gwicke>	 ah, on deployment-salt
[23:41:56] <gwicke>	 we are just getting ready to deploy that to prod
[23:42:13] <gwicke>	 so can drop it in labs
[23:42:40] <gwicke>	 once you are done with yours
[23:43:35] <bd808>	 it will clean up automagically probably. I'll make sure it doesn't get stuck
[23:43:58] <gwicke>	 k, thx!
[23:44:09] <gwicke>	 could also do a rebase -i otherwise
[23:47:22] <SMalyshev>	 how I make merge job that didn't make it through to re-initiate the merge?
[23:47:38] <SMalyshev>	 specifically https://gerrit.wikimedia.org/r/#/c/203837/
[23:48:18] <bd808>	 SMalyshev: re-review with a 0 then +2 again
[23:48:29] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47221 bytes in 0.574 second response time  
[23:49:07] <SMalyshev>	 bd808: aha, thanks, that seems to wake it up
[23:56:03] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce build #29: FAILURE in 2.5 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce/29/
[23:57:32] <bd808>	 !log cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205968 (remove redis from logstash)
[23:57:35] <qa-morebots>	 Logged the message, Master