[05:01:30] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:06:22] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 46476 bytes in 1.740 second response time
[05:36:57] <wikibugs>	 10Gerrit, 06WMF-Legal, 07Privacy: Using Gerrit/git requires the email registered via wikitech and ends ups being voluntary disclosed (break of privacy?) - https://phabricator.wikimedia.org/T151529#3149746 (10Peachey88)
[06:14:05] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[06:14:53] <shinken-wm>	 PROBLEM - Puppet run on saucelabs-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[06:40:23] <wmf-insecte>	 Project selenium-Wikibase » chrome,beta,Linux,BrowserTests build #319: 04FAILURE in 2 hr 0 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/319/
[06:49:03] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:49:52] <shinken-wm>	 RECOVERY - Puppet run on saucelabs-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:56:50] <hoo>	 hashar: re https://phabricator.wikimedia.org/T125050
[07:57:03] <hoo>	 Any chance to get this worked around in some way?
[08:12:49] <elukey>	 hashar: hello! After https://gerrit.wikimedia.org/r/#/c/345810 deployment-jobrunner02 seems broken
[08:13:30] <wikibugs_>	 10Beta-Cluster-Infrastructure: Puppet at deployment-tin is not running - https://phabricator.wikimedia.org/T162016#3149883 (10Luke081515)
[08:13:33] <elukey>	 afaics it is a bit unclear if the deployment-prep config is best to be put in Hiera puppet or https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep
[08:13:37] <wikibugs>	 10Beta-Cluster-Infrastructure: Puppet at deployment-tin is not running - https://phabricator.wikimedia.org/T162016#3149883 (10Luke081515) p:05Triage>03High
[08:14:19] <elukey>	 theoretically mediawiki_session_redis_servers should become "sessions" under "redis::shards"
[08:15:40] <elukey>	 (also "redis::shards" under deployment-prep/common is not up to date with https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep)
[08:15:51] <elukey>	 not sure how we want to amend this inconsistency
[08:15:57] <elukey>	 will wait for your expert opinion :)
[08:25:46] <wikibugs_>	 10Browser-Tests-Infrastructure, 07Documentation, 07Easy: audit/update headers in files - https://phabricator.wikimedia.org/T69141#713980 (10zeljkofilipin) This is a rough list of files that has string `qa-browsertests`  https://github.com/search?q=org%3Awikimedia+qa-browsertests&type=Code  File headers shoul...
[08:29:02] <hashar>	 elukey: Giuseppe filled a task about how beta+hiera is a mess
[08:29:39] <hashar>	 elukey: there are too many sources: wikitech Hiera namespace, Horizon,   /hieradata/labs/deployment-prep in puppet.git and the cherry picks 
[08:29:39] <hashar>	 etc
[08:29:53] <hashar>	 there are probably a dozen place where one can mess with hiera settings :(
[08:30:20] <hashar>	 the stance is more or less that:
[08:30:47] <hashar>	 - some dislike horizon because it lacks history of actions
[08:31:02] <hashar>	 - puppet.git lacks access to non opsen
[08:31:23] <hashar>	 - cherry picks are not easy to find out (have to look on the puppet master)
[08:31:32] <hashar>	 - people end up using wikitech :D
[08:31:35] <hashar>	 but yeah that is all inconsitent
[08:32:14] <hashar>	 hoo: re Scribunto in gate, beside the short analysis I did last week, no I am not looking into it :\
[08:32:49] <hoo>	 hashar: I see :/ Could we try switching away from LuaSandbox for now or something?
[08:33:01] <hoo>	 This makes all of our gate and submits fail
[08:33:05] <hoo>	 which is super annoying
[08:33:09] <elukey>	 hashar: any preferred way to fix "redis::shards" ?
[08:34:22] <hashar>	 elukey: looks like it is currently on https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep  so probably easier to fix it there?
[08:34:59] <elukey>	 hashar: looks like it.. Mind if I add redis::shards sessions in there?
[08:35:32] <hashar>	 elukey: that is probably the easiest for now :-}
[08:36:27] <hashar>	 hoo: has anyone at least tried to reproduce and track the memory usage?
[08:50:11] <hoo>	 hashar: Not as far as I know, I don't have Lua Sandbox myself
[08:54:09] <shinken-wm>	 RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:55:29] <shinken-wm>	 RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:56:45] <hashar_>	 elukey: looks good now? ^^
[08:57:12] <elukey>	 yep! :)
[08:57:44] <elukey>	 hashar: now second question :)
[08:58:24] <elukey>	 I'd like to live-hack jobrunner02 (/usr/src/hhvm/hphp/system/php/redis/Redis.php) to remove the QUIT command 
[08:58:37] <elukey>	 and see if the RST go away
[09:00:50] <hashar>	 elukey: well I guess that Redis.php class is compiled/embedded in hhvm itself
[09:01:00] <shinken-wm>	 RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[09:01:16] <shinken-wm>	 RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:01:35] <elukey>	 hashar: what I want to try is to change the file and then restart hhvm
[09:01:43] <hashar>	 potentially one can hack the jobrunner PHP service
[09:01:46] <elukey>	 theoretically that php file should end up in the jit cache
[09:01:51] <hashar>	 copy paste the HHVM Redis class to something like:  RedisDebug
[09:02:03] <hashar>	 and hack the src/RedisJobService.php to  do :   new RedisDebug();
[09:02:32] <hashar>	 or maybe we can extend the built-in Redis
[09:03:03] <elukey>	 hashar: do you think that /usr/src/hhvm/hphp/system/php/redis/Redis.php is not re-evaluated when hhvm starts?
[09:03:51] <hashar>	 let me try a monkey patch
[09:06:57] <hashar>	 elukey: what is the bug # already?
[09:07:21] <elukey>	 https://github.com/facebook/hhvm/issues/7757
[09:07:35] <elukey>	 ah you mean the phab task
[09:07:36] <elukey>	 ??
[09:07:52] <elukey>	 T125735
[09:07:53] <stashbot>	 T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735
[09:08:29] <shinken-wm>	 RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0]
[09:09:52] <hashar>	 elukey: https://gerrit.wikimedia.org/r/#/c/346117/1/src/RedisJobService.php
[09:10:01] <hashar>	 is the absolutely most horrible code I can come up with
[09:10:19] <hashar>	 that switch the jobrunner service to use a new class RedisMonkeyPatched instead of Redis
[09:10:21] <shinken-wm>	 RECOVERY - Puppet run on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:10:25] <hashar>	 which implements a different close() method
[09:10:43] <hashar>	 so maybe we can cherry pick that on jobrunner02 under /srv/deployment/mediawiki/xxxxx/something/yyy
[09:10:46] <hashar>	 restart the jobrunner service
[09:10:49] <hashar>	 and see what happen
[09:10:58] <hashar>	 (my guess is:  my code will fatal out somehow)
[09:11:55] <hashar>	 zeljkof: ^^^ sorry been messing up with hhvm/redis etc :}
[09:12:55] <zeljkof>	 hashar: joining the hangout? or want to skip today?
[09:15:00] <elukey>	 hashar: I'll follow your lead, but didn't want to waste a lot of your time, just do a quick test :(
[09:15:53] <hashar>	 elukey: let me cherry pick the patch and restart the service
[09:16:17] <elukey>	 at some point I'll have to ship beers to you hashar 
[09:16:52] <hashar>	 I avoid real life events nowadays
[09:17:00] <hashar>	 I get flooded by beers the minute I arrive at the venue
[09:17:33] <hashar>	 !log deployment-jobrunner02 : cherry picked a monkey patch for Redis::close() to prevent it from sending QUIT command ( https://gerrit.wikimedia.org/r/#/c/346117/ ) - T125735
[09:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[09:17:37] <stashbot>	 T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735
[09:18:07] <hashar>	 the jobrunner is a mess really
[09:18:33] <paladox>	 hashar hi, (not important can be read any time) But upstream have a patch for using the soy template https://gerrit-review.googlesource.com/#/c/100962/ i've tested it with my follow up https://gerrit-review.googlesource.com/#/c/101733/
[09:18:45] <paladox>	 no configuration or changes needed for us either too :)
[09:18:46] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3150082 (10WMDE-leszek) The problem reported in T161698 blocks merging any of Wikibase patches, so this is becoming more pressing for Wikidat...
[09:18:55] <paladox>	 it uses the canonical config
[09:19:24] <hashar>	 elukey: so assuming some jobs are being run,  the jobrunner should no more send QUIT to redis 
[09:19:33] * elukey runs tcpdump
[09:19:55] <paladox>	 also there are alot of users (by lots i mean a few) aggree we should backport it to stable-2.14
[09:20:05] <paladox>	 google employees too
[09:20:36] <hashar>	 paladox: the Soy template is to make polygerrit easier to hack ?
[09:20:43] <paladox>	 Yes
[09:20:47] <paladox>	 Well i think it is
[09:21:21] <paladox>	 This is on the issue page "Rather than serving the static file, serve index.html via a Soy template."
[09:21:24] <hashar>	 paladox: we can still cherry pick the patch when building our gerrit 2.14  package
[09:22:22] <shinken-wm>	 RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:22:33] <paladox>	 One problem though https://gerrit-review.googlesource.com/#/c/100962/ dosen't hook it up to any of the urls inside polygerrit's js yet. It only implements it so it can be done in a follow up. He is doing it step by step. But they are going to fix it on stable-2.14.
[09:22:50] <paladox>	 But with this patch https://gerrit-review.googlesource.com/#/c/101733/ that hooks it up.
[09:24:18] <elukey>	 hashar: I am running tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0 and (host 10.68.16.177 or host 10.68.16.231)' and I can definitely see a lot less RST going
[09:24:34] <elukey>	 (a lot less is horrible, just realized it)
[09:24:44] <hashar>	 a lot less hmm
[09:24:45] <hashar>	 so less ?
[09:24:47] <hashar>	 ;-}
[09:24:51] <elukey>	 hahaha yes 
[09:25:01] <paladox>	 hashar i got feedback on https://github.com/jenkinsci/trilead-ssh2/pull/9#issuecomment-285813590 :)
[09:25:15] <elukey>	 but still see them, so I am wondering if we are indeed not sending the qui
[09:25:18] <elukey>	 *quit
[09:25:28] <elukey>	 I'll try to check capturnins some traffic
[09:29:49] <hashar>	 elukey: there is another service running
[09:29:52] <hashar>	 the ChronRunner or something
[09:30:17] <hashar>	 and the services hit  MediaWiki locally on   /rpc.php or something
[09:30:24] <elukey>	 jobchron?
[09:30:30] <hashar>	 yeah jobchron
[09:30:33] <hashar>	 and MediaWiki itself might well invoke redis for other things
[09:30:54] <hashar>	 class RedisJobChronService extends RedisJobService
[09:31:21] <hashar>	 so now the jobchron should use the same hack
[09:31:54] <hashar>	 elukey: I have restarted jobchron as well :D
[09:32:13] <hashar>	 so it should have the same code running now
[09:32:36] <elukey>	 hashar: confirmed that the RST on jobrunner02 were due to some QUIT commands
[09:32:47] <hashar>	 \O/
[09:34:06] <elukey>	 well  that means that we are still sending the QUITs :P
[09:34:25] <elukey>	 checking again
[09:36:43] <elukey>	 not seeing any RST :)
[09:42:41] <elukey>	 too soon, I can see the RST
[09:42:56] <hashar>	 that can be something else
[09:43:24] <elukey>	 I am re-running tcpdump to check traffic :)
[09:44:31] <hashar>	 looks like jobrunner uses deployment-redis01  10.68.16.177
[09:44:50] <hashar>	 maybe the RST are for the other instance  redis02 10.68.16.231
[09:47:34] <hashar>	 or they are RST from some other service
[09:48:50] <elukey>	 nono I can see RST from jobrunner to 10.68.16.177
[09:48:57] <elukey>	 and QUIT commands in wireshard
[09:49:08] <elukey>	 (the follow TCP stream option is life saving)
[09:50:46] <elukey>	 let's try to restart hhvm (doing it now)
[09:52:58] <hashar>	 :(
[09:56:27] <hashar>	 they have different caches though
[09:56:43] <hashar>	 jobrunner / jobchron are run from command line, I would expect the cache file to be /var/cache/hhvm/cli.hhbc.sq3
[09:56:56] <hashar>	 while the hhvm service would be  the fcgi.hhbc.sq3 file
[09:57:09] <hashar>	 we can always nuke them and restart all services
[09:57:58] <elukey>	 yep.. I brutally commented one line in /usr/src/hhvm/hphp/system/php/redis/Redis.php and restarted hhvm to see if it works
[10:00:25] <hashar>	 oh
[10:00:43] <hashar>	 I don't think that is used
[10:03:02] <elukey>	 yeah unsuccessfull attemp
[10:03:07] <elukey>	 *attempt
[10:03:07] <elukey>	 :)
[10:03:08] <wikibugs>	 10Browser-Tests-Infrastructure, 10MediaWiki-General-or-Unknown, 07JavaScript, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 4 others: Port Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T139740#3150118 (10zeljkofilipin)
[10:03:23] <elukey>	 hashar: next step brutally clear cache?
[10:03:39] <hashar>	 yeah
[10:03:48] <hashar>	 and restart everything :D
[10:03:57] <hashar>	 then the RST sent could be anything
[10:07:42] <wikibugs_>	 10Beta-Cluster-Infrastructure: Puppet at deployment-tin is not running - https://phabricator.wikimedia.org/T162016#3150123 (10hashar) 05Open>03Resolved a:03elukey Catalog failed with: ``` Error: Could not retrieve catalog from remote server: Error 400 on SERVER: redis_servers is not a hash or array when ac...
[10:23:47] <elukey>	 hashar: can still see the RSTs and the QUIT commands via wireshark/tcpdump
[10:24:07] <hashar>	 still to redis01 ?
[10:24:13] <hashar>	 maybe that is some other things
[10:24:24] <hashar>	 or my patch is plain wrong :-}
[10:26:06] <elukey>	 yeah..
[10:29:35] <hashar>	 elukey: on jobrunner02 I don't see RST packets via tshark -n -f 'host 10.68.16.177'
[10:30:42] <elukey>	 it takes a bit before getting them 
[10:32:13] <elukey>	 I am using sudo tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0 and (host 10.68.16.177 or host 10.68.16.231)'
[10:33:33] <elukey>	 really rare though
[10:34:09] <elukey>	 why am I seeing the QUIT command in the tcpdump logs though?
[10:34:15] <elukey>	 grrr difficult mondays
[10:34:37] <elukey>	 10.68.19.42.51200 > 10.68.16.177.6379: Flags [R], cksum 0xa30b (correct), seq 1225108097, win 0, length 0
[10:34:40] <elukey>	 this is an example
[10:38:27] <hashar>	 at least there are no more QUIT sent
[10:39:00] <hashar>	 though I don't know whether they are reported by redis-cli MONITOR
[10:40:13] <hashar>	 elukey: most of the RST spam is gone isn't it ?
[10:40:43] <hashar>	 what would be ideal is to capture the whole TCP sequence that eventually contains a RST
[10:40:52] <hashar>	 so we can analyze what redis command got emitted in that session
[10:41:24] <elukey>	 yes this is what I have been doing, using "Follow tcp stream" in wireshark to check the data exchanged by jobrunner and redis..
[10:41:30] <elukey>	 each time finding QUIT
[10:43:48] <hashar>	 ahhh
[10:43:53] <elukey>	 but now I checked random packets (following their stream) and QUIT is not there
[10:44:02] <elukey>	 so there might be a QUIT hiding somewhere :D
[10:44:50] <hashar>	 I haven't audited the whole code
[11:07:07] <paladox>	 hasharLaunch https://github.com/jenkinsci/trilead-ssh2/pull/13/files
[11:07:15] <paladox>	 suppor for edsa keys in trilead-ssh2
[11:07:16] <paladox>	 :)
[11:07:21] <paladox>	 suppor = support
[11:08:05] <shinken-wm>	 PROBLEM - Puppet run on integration-c1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[11:37:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 07Jenkins, 07Upstream, 07WorkType-NewFunctionality: Jenkins trilead-ssh2 doesn't support our MAC/KEX algorithms - https://phabricator.wikimedia.org/T103351#3150299 (10Paladox) @hashar even better see this https://github.com/jenkinsci/trilead-ssh2/pull/13 and https:/...
[12:13:52] <wikibugs_>	 (03PS3) 10Hashar: Move fundraising dash from Node 0.10 to Node 6 [integration/config] - 10https://gerrit.wikimedia.org/r/345571 (https://phabricator.wikimedia.org/T99869)
[12:16:36] <wikibugs_>	 (03CR) 10Hashar: [C: 032] Drop node-0.10 support [integration/config] - 10https://gerrit.wikimedia.org/r/345577 (https://phabricator.wikimedia.org/T161884) (owner: 10Hashar)
[12:16:51] <wikibugs_>	 (03CR) 10Hashar: [C: 032] Move fundraising dash from Node 0.10 to Node 6 [integration/config] - 10https://gerrit.wikimedia.org/r/345571 (https://phabricator.wikimedia.org/T99869) (owner: 10Hashar)
[12:17:02] <wikibugs>	 (03PS3) 10Hashar: Drop node-0.10 support [integration/config] - 10https://gerrit.wikimedia.org/r/345577 (https://phabricator.wikimedia.org/T161884)
[12:17:09] <wikibugs_>	 (03CR) 10Hashar: Drop node-0.10 support [integration/config] - 10https://gerrit.wikimedia.org/r/345577 (https://phabricator.wikimedia.org/T161884) (owner: 10Hashar)
[12:18:20] <wikibugs>	 (03Merged) 10jenkins-bot: Move fundraising dash from Node 0.10 to Node 6 [integration/config] - 10https://gerrit.wikimedia.org/r/345571 (https://phabricator.wikimedia.org/T99869) (owner: 10Hashar)
[12:19:35] <wikibugs>	 (03CR) 10Hashar: [C: 032] Drop node-0.10 support [integration/config] - 10https://gerrit.wikimedia.org/r/345577 (https://phabricator.wikimedia.org/T161884) (owner: 10Hashar)
[12:20:26] <wikibugs_>	 10Continuous-Integration-Infrastructure (phase-out-trusty), 13Patch-For-Review: Migrate NodeJS Nodepool jobs from Trusty to Jessie - https://phabricator.wikimedia.org/T161884#3150434 (10hashar) 05Open>03Resolved a:03hashar
[12:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: Drop node-0.10 support [integration/config] - 10https://gerrit.wikimedia.org/r/345577 (https://phabricator.wikimedia.org/T161884) (owner: 10Hashar)
[12:29:24] <wikibugs_>	 (03PS1) 10Hashar: Drop skins/chameleon moved to github [integration/config] - 10https://gerrit.wikimedia.org/r/346137
[12:33:11] <wikibugs>	 (03CR) 10Hashar: [C: 032] Drop skins/chameleon moved to github [integration/config] - 10https://gerrit.wikimedia.org/r/346137 (owner: 10Hashar)
[12:34:39] <wikibugs_>	 (03Merged) 10jenkins-bot: Drop skins/chameleon moved to github [integration/config] - 10https://gerrit.wikimedia.org/r/346137 (owner: 10Hashar)
[12:46:04] <wikibugs_>	 10Continuous-Integration-Config, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: CI tests on wikimedia/portals repo: cache node_modules to save time - https://phabricator.wikimedia.org/T152386#3150486 (10hashar)
[12:46:06] <wikibugs>	 10Continuous-Integration-Config: Castor: mediawiki-core-qunit-jessie node_modules cache ineffective - https://phabricator.wikimedia.org/T159591#3150485 (10hashar)
[12:47:51] <wikibugs_>	 10Continuous-Integration-Config: Castor: mediawiki-core-qunit-jessie node_modules cache ineffective - https://phabricator.wikimedia.org/T159591#3072606 (10hashar) I found a good candidate: wikimedia/portals @Jdrewniak filled T152386 to get node_modules cached.  So I guess we can try to switch it to `npm prune &&...
[12:49:33] <wikibugs_>	 10Continuous-Integration-Config, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: CI tests on wikimedia/portals repo: cache node_modules to save time - https://phabricator.wikimedia.org/T152386#3150492 (10hashar) That came around on T159591 which is more generic. In short instead of doing: ```...
[12:53:23] <wikibugs_>	 (03PS1) 10Hashar: jjb: rm unused {name}-npm-run-{script} [integration/config] - 10https://gerrit.wikimedia.org/r/346142
[12:57:54] <wikibugs>	 (03CR) 10Hashar: [C: 032] "Noop :}" [integration/config] - 10https://gerrit.wikimedia.org/r/346142 (owner: 10Hashar)
[12:59:07] <wikibugs>	 (03Merged) 10jenkins-bot: jjb: rm unused {name}-npm-run-{script} [integration/config] - 10https://gerrit.wikimedia.org/r/346142 (owner: 10Hashar)
[13:38:57] <paladox>	 hashar woohoo i fixed almost all cases of fixing the links in https://gerrit-review.googlesource.com/#/c/101733/ :) tested all links locally and found none breaking. I even fixed the footer link :)
[13:41:47] <hashar>	 paladox: \O/
[13:41:58] <paladox>	 Im deploying it to gerrit-new now :)
[13:44:33] <Zppix>	 is it sad that i had to quote word for word wmf mission statement to prove a point xD
[13:44:50] <Zppix>	 anyway  hows jenkins holding up with precise being out of the picture
[13:45:07] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[13:49:10] <shinken-wm>	 PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[13:50:32] <wikibugs_>	 06Release-Engineering-Team, 06Operations, 06Services, 05Goal, 07kubernetes: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3150656 (10akosiaris)
[14:01:53] <wikibugs_>	 (03PS1) 10Hashar: Cache node_modules [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591)
[14:03:40] <wikibugs_>	 (03PS2) 10Hashar: Cache node_modules [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591)
[14:05:15] <wikibugs>	 (03PS3) 10Hashar: Cache node_modules [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591)
[14:08:41] <wikibugs>	 (03CR) 10Hashar: "We can probably give this a try by:" [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591) (owner: 10Hashar)
[14:12:33] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3150790 (10hoo) p:05Normal>03High
[14:14:04] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3150798 (10Paladox) We should try and make scribuntu more performant i.e. performance wise. Or at least skip the tests that cause the tests t...
[14:20:05] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:22:41] <wikibugs_>	 06Release-Engineering-Team, 06Operations, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3150814 (10mobrovac)
[14:27:50] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3150819 (10hoo) >>! In T125050#3150798, @Paladox wrote: > We should try and make scribuntu more performant i.e. performance wise. Or at least...
[14:39:46] <wikibugs>	 10Continuous-Integration-Infrastructure (phase-out-trusty): Migrate PHP5.5 jobs from Trusty to Jessie - https://phabricator.wikimedia.org/T161882#3150861 (10hashar)
[14:39:48] <wikibugs_>	 10Continuous-Integration-Config, 13Patch-For-Review: Combine composer-php55 and composer-hhvm jobs - https://phabricator.wikimedia.org/T142457#3150862 (10hashar)
[14:39:50] <wikibugs>	 10Continuous-Integration-Infrastructure (phase-out-trusty): Install PHP5.5 on jessie CI instances - https://phabricator.wikimedia.org/T144959#3150859 (10hashar) 05Open>03declined Unsurprisingly sury.org no more provides PHP 5.5 packages since it has reached end of life. https://www.patreon.com/posts/php-5-5-...
[14:42:06] <wikibugs_>	 10Continuous-Integration-Infrastructure (phase-out-trusty): Migrate PHP5.5 jobs from Trusty to Jessie - https://phabricator.wikimedia.org/T161882#3150866 (10hashar) sury.org no more provides PHP 5.5 packages since it has reached end of life and I closed the related sub task T144959.  I am tempted to phase out PH...
[15:00:50] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3150901 (10Anomie) >>! In T125050#3150798, @Paladox wrote: > We should try and make scribuntu more performant i.e. performance wise. Or at le...
[16:17:40] <paladox>	 RainbowSprinkles and mutante great news, upstream have created https://gerrit-review.googlesource.com/#/c/100962/ (using soy to create index.html for polygerrit) It only implements the backend. A follow up will fix the links. Anyways it should land on stable-2.14 hopefully soon as they aggree it should be on there. I did a follow up here https://gerrit-review.googlesource.com/#/c/101733/ for me to test and it indeeds works. I fixed all the
[16:17:40] <paladox>	  links i see no breakages. The footer link works in gwt and polygerrit now (for prefixed urls).
[16:18:23] <paladox>	 no rewrites needed. and no configuation changes needed too. Uses the canonical config in gerrit which we have already set.
[16:29:35] <wikibugs>	 10Browser-Tests-Infrastructure, 10MediaWiki-General-or-Unknown, 07JavaScript, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 4 others: Port Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T139740#3151313 (10zeljkofilipin)
[18:23:33] <wikibugs_>	 10Continuous-Integration-Config: Raise priority for operations-mw-config-composer-hhvm-jessie from the gate-and-submit pipeline - https://phabricator.wikimedia.org/T162076#3151697 (10Dereckson)
[18:33:07] <wikibugs_>	 10Continuous-Integration-Config: Raise priority for operations-mw-config-composer-hhvm-jessie from the gate-and-submit pipeline - https://phabricator.wikimedia.org/T162076#3151697 (10Paladox) It is already higher priority    - name: operations/mediawiki-config     check:       - operations-mw-config-php55lint...
[18:53:24] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-docker-1000 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:12:35] <paladox>	 hashar the polygerrit changes was merged in master and stable-2.14 :), now all there needs to be is a follow up to hook up the implementation.
[19:21:34] <Zppix>	 paladox:  when do we start using polygerrit
[19:21:41] <Zppix>	 in prod
[19:25:39] <hashar>	 paladox: nice :)
[19:26:40] <hashar>	 Zppix: subscribe to the task https://phabricator.wikimedia.org/T156120 and eventually when we decide to plan the upgrade some activity will happen on that task
[19:27:29] <Zppix>	 is there a way to unsubscribe from things that were subscribed to months ago
[19:27:30] <Zppix>	 only
[19:33:23] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-docker-1000 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:37:10] <paladox>	 yep also Zppix it will happen when ever we upgrade to gerrit 2.14 + when ever upstream finish doing the fixes.
[19:40:10] <wikibugs>	 10Continuous-Integration-Infrastructure (Little Steps Sprint): Raise priority for operations-mw-config-composer-hhvm-jessie from the gate-and-submit pipeline - https://phabricator.wikimedia.org/T162076#3152078 (10hashar)
[20:00:10] <wikibugs_>	 10Scap (Scap3-Adoption-Phase1), 10RESTBase, 06Services (doing), 15User-mobrovac: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3152129 (10mobrovac) a:03mobrovac
[20:00:25] <wikibugs>	 10Continuous-Integration-Infrastructure (Little Steps Sprint): Raise priority for operations-mw-config-composer-hhvm-jessie from the gate-and-submit pipeline - https://phabricator.wikimedia.org/T162076#3152132 (10hashar) Zuul keeps metrics for the various pipeline / project etc.  I created a graph that represent...
[20:03:28] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:19:28] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[20:37:02] <hashar>	 !log jenkins: disabled/reenabled gearman plugin to unlock the beta cluster related jobs
[20:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:39:03] <hashar>	 !log Nodepool: holding instance ci-trusty-wikimedia-597386  in an attempt debug Wikibase/Scribunto memory usage exploding T125050
[20:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:39:06] <stashbot>	 T125050: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050
[20:43:16] <bearND>	 !log Update mobileapps to fdd4e31
[20:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[21:02:00] <wikibugs_>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3152295 (10hashar) I manually rebuild mwext-testextension-php55-composer-trusty for Wikibase.  Instructed nodepool to not delete the instance...
[21:07:03] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3152305 (10hashar) Next thing. On Jenkins we have a monitoring system named JavaMelody which can be used to inspect an instance, specially th...
[21:35:28] <paladox>	 hashar i guess we can go all the way upto eddsa for the ssh key :) https://github.com/jenkinsci/trilead-ssh2/pull/12
[21:35:31] <paladox>	 https://github.com/jenkinsci/trilead-ssh2/pull/13
[21:42:05] <paladox>	 who knew i will be calling my mail oath soon http://www.theverge.com/2017/4/3/15166872/aol-verizon-oath-announced-merger-rebranding-new-name-logo
[22:43:27] <shinken-wm>	 PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[23:15:53] <shinken-wm>	 PROBLEM - Puppet run on saucelabs-03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:16:06] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[23:43:49] <Reedy>	 https://cloudplatform.googleblog.com/2017/03/how-release-canaries-can-save-your-bacon-CRE-life-lessons.html
[23:50:50] <shinken-wm>	 RECOVERY - Puppet run on saucelabs-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:52:53] <wikibugs_>	 (03CR) 10Krinkle: Cache node_modules (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591) (owner: 10Hashar)
[23:53:06] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Cache node_modules [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591) (owner: 10Hashar)