[00:10:38] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661034 (10Krenair)
[00:22:37] <shinken-wm>	 PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[00:36:50] <shinken-wm>	 PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[00:41:14] <wikibugs>	 10Deployment-Systems, 03Scap3, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2661081 (10thcipriani) 05Resolved>03Open @fgiunchedi could I get you to upload scap_3.3.0-1 to carbon?  Version contains fixes for {T134156} and {T145373}.  (cc: @mobrovac)
[00:47:34] <thcipriani>	 I got/caused deployment-mediawiki04's puppet problem, should be fixed.
[00:54:58] <shinken-wm>	 PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[01:01:53] <shinken-wm>	 RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:02:37] <shinken-wm>	 RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:30:25] <wikibugs>	 10MediaWiki-Codesniffer: Undefined index: parenthesis_closer in SpaceBeforeControlStructureBraceSniff.php - https://phabricator.wikimedia.org/T146439#2661144 (10MaxSem)
[01:30:32] <MaxSem>	 legoktm, ^
[01:34:17] <legoktm>	 thank you
[01:34:59] <shinken-wm>	 RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:55:57] <shinken-wm>	 PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[02:31:03] <wikibugs>	 10Continuous-Integration-Config: All code style CI tests should be run even if some fail - https://phabricator.wikimedia.org/T146445#2661267 (10Tgr)
[02:35:59] <shinken-wm>	 RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0]
[03:37:18] <shinken-wm>	 PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[04:12:18] <shinken-wm>	 RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0]
[04:15:24] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-db1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[04:19:05] <wmf-insecte>	 Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #150: 04FAILURE in 23 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/150/
[04:27:03] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-db2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[04:44:40] <shinken-wm>	 PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[05:24:41] <shinken-wm>	 RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[05:45:43] <shinken-wm>	 PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[06:03:54] <shinken-wm>	 PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[06:43:54] <shinken-wm>	 RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:50:45] <shinken-wm>	 RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[07:17:00] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2661458 (10jcrespo) There was a refactoring of mariadb's puppet code...
[07:18:42] <wikibugs>	 10MediaWiki-Codesniffer: Undefined index: parenthesis_closer in SpaceBeforeControlStructureBraceSniff.php - https://phabricator.wikimedia.org/T146439#2661462 (10Paladox) Caused by https://phabricator.wikimedia.org/rMCSN42c7dca89f1ac314795802f8fd0bcc1ef62bba1b
[08:05:01] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661511 (10Paladox)
[08:12:00] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661526 (10Paladox) Oh looks like the repo was already created at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/libs/Assert
[08:12:38] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661527 (10Legoktm) >>! In T114330#2661497, @thiemowmde wrote: > Yesterday I wasted a full hour trying to rebase open pull requests on htt...
[08:31:04] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661542 (10Aklapper)
[09:04:53] <shinken-wm>	 PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[09:44:51] <shinken-wm>	 RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:49:31] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661797 (10hashar)
[09:50:29] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2606121 (10hashar) For the after effect, ORES jobs failed following deployment T146461  Which is entirely related to us having no monitoring/alarming about jobs, that is...
[09:51:00] <wikibugs>	 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661805 (10Paladox)
[09:51:21] <wikibugs>	 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2556665 (10Paladox) Gerrit 2.12.5 has been released so we can update to that release now.
[09:52:00] <wikibugs>	 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2661810 (10Paladox)
[09:52:03] <wikibugs>	 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661809 (10Paladox)
[09:53:26] <wikibugs>	 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2556665 (10Paladox)
[10:01:06] <wikibugs>	 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2661837 (10Paladox) This if fixed in gerrit 2.12.5 which has been released now.
[10:10:17] <wikibugs>	 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661851 (10hashar)
[10:10:32] <wikibugs>	 10Gerrit: Update gerrit to 2.13 - https://phabricator.wikimedia.org/T146350#2658303 (10hashar)
[10:11:05] <hashar>	 paladox: I have linked each upgrade gerrit to X  tasks to each other :]
[10:42:23] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[11:06:54] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661955 (10Ladsgroup)
[11:22:25] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:24:42] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662046 (10MoritzMuehlenhoff) mira is now running jessie. Please give it some more testing, for migrating tin, we could mira temporarily make the...
[12:35:53] <shinken-wm>	 PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[12:53:15] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662100 (10thiemowmde) To clarify: I can clone https://github.com/wikimedia/Assert and checkout all the branches from the open pull requests just fine. The on...
[13:15:53] <shinken-wm>	 RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:20:42] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662117 (10Krenair) >>! In T144578#2662046, @MoritzMuehlenhoff wrote: > mira is now running jessie. Please give it some more testing  The next de...
[13:31:36] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662149 (10Krenair) >>! In T114330#2661527, @Legoktm wrote: >> * Who does have admin access to this GitHub repo? >  > I guess anyone in the wikimedia org? I'm...
[13:34:01] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662165 (10Krenair) >>! In T114330#2662100, @thiemowmde wrote: > add people (preferably @daniel and myself, since we are among the main contributors) as admin...
[13:41:16] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Change tmpfs to /srv instead of /mnt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/312330 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar)
[13:41:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Change tmpfs to /srv instead of /mnt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/312330 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar)
[13:41:56] <hashar>	 !log Switching tmpfs from /mnt to /srv https://gerrit.wikimedia.org/r/#/c/312330/   and running fab deploy_slave_scripts
[13:42:00] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[13:46:39] <wmf-insecte>	 Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #154: 04FAILURE in 2 min 38 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/154/
[13:56:04] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662261 (10thiemowmde) @Krenair: https://github.com/thiemowmde
[13:57:20] <wikibugs>	 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662262 (10Krenair) Done
[14:04:51] <hashar>	 !log remove the /mnt based tmpfs for T146381 /  https://gerrit.wikimedia.org/r/#/c/312518/ via: salt -v '*' cmd.run 'umount /mnt/home/jenkins-deploy/tmpfs'
[14:04:55] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:15:22] <wikibugs>	 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2662345 (10Aklapper) p:05Triage>03Low
[14:56:53] <wikibugs>	 10Gerrit: Update gerrit to 2.13.1 - https://phabricator.wikimedia.org/T146350#2662462 (10Paladox)
[15:02:15] <hashar>	 !log rebooting integration-slave-jessie-1001
[15:02:19] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[15:07:06] <wikibugs>	 10Gerrit: Update gerrit to 2.13.1 - https://phabricator.wikimedia.org/T146350#2662494 (10Paladox)
[15:16:45] <wikibugs>	 10Deployment-Systems, 03Scap3, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2662514 (10fgiunchedi) 05Open>03Resolved scap uploaded and deployed, resolving
[15:49:09] <wmf-insecte>	 Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #168: 04FAILURE in 27 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/168/
[16:50:39] <ebernhardson>	 i know it's friday ... but i'd like to deploy a patch to record some timing marks. the wmf.20 deploy increased p95 as seen from php (time spent in elasticsearch is unchanged) from 40ms to >200ms. Patch adds some timing marks to try and get an idea where in php this time is being spent: https://gerrit.wikimedia.org/r/#/c/312529/1
[16:51:37] <ebernhardson>	 for autocomplete, that is. p95 means ~1M req/day are getting that degraded performance.  p75 isn't quite so bad, at a regression from 11ms ->26ms
[16:52:03] <greg-g>	 ebernhardson: I'll allow it
[16:52:12] <greg-g>	 <insert gif here>
[16:52:16] <ebernhardson>	 :) thanks
[16:54:55] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:55:09] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-jessie-1005 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:12:09] <wikibugs>	 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RB is updated on beta labs - https://phabricator.wikimedia.org/T146488#2662808 (10Mholloway)
[17:12:33] <wikibugs>	 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RESTBase is updated on beta labs - https://phabricator.wikimedia.org/T146488#2662822 (10Mholloway)
[17:19:31] <shinken-wm>	 PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[17:25:32] <ebernhardson>	 scap sync-file is taking an amazingly long time for sync-masters, the rsync on mira is up to 2:27, real time is much longer than that
[17:26:15] <bd808>	 ouch. that's happened when there were ipv6 networking issues before
[17:26:40] <bd808>	 Not sure what else would make tin->mira sync so slow
[17:27:07] <greg-g>	 deployment-mira was just reimaged to jessie today/last night
[17:27:13] <greg-g>	 ebernhardson: or is this in production?
[17:27:22] <ebernhardson>	 greg-g: prod, syncing out my timing patch
[17:27:30] <greg-g>	 thcipriani: ^ ideas?
[17:28:30] <bd808>	 are there 2 procs running too?
[17:28:32] <thcipriani>	 blerg. I wonder if it has to sync everything over to mira since it was just reimaged?
[17:28:40] <bd808>	 on mira
[17:29:02] <greg-g>	 oh right, prod mira was reimaged last night
[17:29:11] <ebernhardson>	 ps axf on mira only shows one chain for the scap/rsync
[17:29:14] <greg-g>	 man, I'm tired
[17:29:24] <ebernhardson>	 ok, so it's just syncing the whole thing. will keep waiting :)
[17:30:08] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:30:41] <twentyafterfour>	 mediawiki-staging is only 700mb on mira
[17:30:58] <twentyafterfour>	 except I can't read php-1.28.0-wmf.17
[17:31:20] <twentyafterfour>	 21 gigs on tin
[17:32:22] <thcipriani>	 there are probably 10G of that that won't be transferred, i.e. are cdb files.
[17:32:30] <thcipriani>	 this sync-file may take a while :((
[17:33:01] <ebernhardson>	 actually it just finshed, but with somewhat scary looking output: https://phabricator.wikimedia.org/P4110
[17:34:05] <ebernhardson>	 doesn't look to be too important for my sync, but something up with l10n permissions i guess
[17:34:55] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:34:59] <thcipriani>	 hrm, looks like l10nupdate has a different uid
[17:35:08] <greg-g>	 similar/related discussion going on in -operations from mutante
[17:35:08] <thcipriani>	 on mira vs tin, I think.
[17:36:11] <twentyafterfour>	 user l10nupdate on mira: l10nupdate❌1001:10002::/home/l10nupdate:/bin/bash
[17:37:03] <twentyafterfour>	 on tin: l10nupdate❌10002:10002::/home/l10nupdate:/bin/bash
[18:09:36] <idoine>	 thcipriani: yeah prod mira got reimaged  to Jessie. I acked that to moritz earlier today
[18:09:39] <idoine>	 since beta looked fined
[18:11:13] <idoine>	 thcipriani: mira.deployment-prep we should dish it out it is too small
[18:11:18] <idoine>	 err twentyafterfour ^^^
[18:11:29] <idoine>	 got reimated to deployment-mira or deployment-mira02 with larger disk
[18:12:35] <thcipriani>	 heh, wait, what? deployment-mira was too small? I thought it was based on some custom image?
[18:13:45] <thcipriani>	 oh, wait, misread: mira.deployment-prep was too small. Got it.
[18:18:41] <idoine>	 ideally we should have deployment-tin  / deployment-mira  both Jessie with the custom os flavor
[18:18:50] <idoine>	 something like c8.m8.d60
[18:19:00] <idoine>	 others can be dished
[18:19:08] <idoine>	 and some puppet patch need to be adjusted in consequence
[18:20:48] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2641971
[18:24:21] <shinken-wm>	 PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42)
[18:24:57] <idoine>	 ^ ^  that is the old one guess legoktm is deleting the old CI puppetmaster :]
[18:26:33] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663057 (10dduvall) LGTM. Thanks, @jcrespo!
[18:26:55] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663062 (10dduvall) 05Open>03Resolved
[18:34:11] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2663094 (10Krinkle)
[18:34:13] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2663095 (10Krinkle)
[19:23:37] <shinken-wm>	 PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:28:59] <wikibugs>	 06Release-Engineering-Team, 06Developer-Relations, 10Wikimedia-Blog-Content: blog.wikimedia.org post on Phabricator improvements - https://phabricator.wikimedia.org/T141457#2663223 (10Aklapper) So...... I created https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/Recent_improvements_in_the_Phabricator_pro...
[19:41:50] <hashar>	 Nodepool does less API queries to openstack https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=16&fullscreen&from=1474486854104&to=1474659654105&var-provider=All&var-task=CreateServerTask&var-task=DeleteKeypairTask&var-task=ListFloatingIPsTask&var-task=ListServersTask :D
[19:44:38] <hashar>	 twentyafterfour: thcipriani: so for beta the proper instances are deployment-tin02 and deployment-mira  which has the flavor with large disk
[19:45:07] <twentyafterfour>	 hashar: I'm working on deployment-mira now
[19:47:13] <hashar>	 and looks like puppet is all settled
[19:48:13] <hashar>	 bah and https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/ has some issue
[19:49:38] <twentyafterfour>	 I don't think it's the job that has some issues
[19:49:51] <twentyafterfour>	 looks like missing dependencies
[19:50:36] <twentyafterfour>	 Error: Cannot find module 'ajv'
[19:51:52] <hashar>	 yeah 
[19:51:58] <hashar>	 and the kafka server is never killed
[19:52:13] <hashar>	 twentyafterfour: is that job created via JJB ?
[19:53:11] <hashar>	 apparently not :D
[19:53:21] <hashar>	 !log added a 30 minutes build timeout to https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/
[19:53:25] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[20:00:43] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-jessie-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[20:00:46] <hashar>	 !log rebooting all CI permanent slaves.  Making sure nothing is left on /mnt (which is no more mounted)
[20:00:50] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[20:01:10] <twentyafterfour>	 hashar: originally yes but my jjb stuff never got merged
[20:01:27] <twentyafterfour>	 and I modified the job some manually to make it work
[20:01:39] <Krenair>	 so what's the plan for -db1 and -db2?
[20:03:08] <twentyafterfour>	 hashar: https://gerrit.wikimedia.org/r/#/c/295396/
[20:03:39] <shinken-wm>	 RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:05:10] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[20:05:44] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-jessie-1003 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0]
[20:10:44] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1004 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:10:44] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:11:21] <hashar>	 Krenair: Dan will delete them  when their time has come
[20:11:29] <hashar>	 twentyafterfour: neat :)
[20:12:04] <hashar>	 twentyafterfour: looks like I reviewed that one back in june  and never looked back :(
[20:14:18] <twentyafterfour>	 yuvipanda: mind if I make the http port configurable on the aptly class?  https://gerrit.wikimedia.org/r/#/c/312562/
[20:15:09] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:19:29] <hashar>	 W: GPG error: http://debian.saltstack.com jessie-saltstack InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B09E40B0F2AE6AB9
[20:19:31] <hashar>	 oh men ..
[20:20:05] <Krinkle>	 It seems some CI slavs are unable to access MySQL
[20:20:08] <Krinkle>	 Two jobs failed in a row.
[20:20:20] <Krinkle>	 https://gerrit.wikimedia.org/r/#/c/312467/
[20:20:35] <Krinkle>	 and another two, and more.
[20:20:36] <hashar>	 bah :(
[20:21:08] <Krinkle>	 forgot to gracefully depool?
[20:21:11] <hashar>	 !log integration:  salt -v '*trusty*' cmd.run 'service mysql start'
[20:21:14] <hashar>	 did
[20:21:16] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[20:21:22] <hashar>	 but mysql does not come back properly on boot on Trusty :(
[20:21:35] <hashar>	 due to some crazy hack
[20:21:51] <hashar>	 it only comes back when puppet runs
[20:22:21] <hashar>	 Krinkle: all good now. Sorry I should have spawned mysql again
[20:24:59] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure: Build an apt repository on deployment-prep (for testing packages from jenkins) - https://phabricator.wikimedia.org/T146497#2663278 (10mmodell)
[20:31:31] <yuvipanda>	 twentyafterfour: sure! we've a freeze until monday tho as ops are travelling
[20:37:35] <paladox>	 RoanKattouw hi, gerrit 2.12.5 was released today :), ive updated https://gerrit-test.wmflabs.org/ to use the new updated version
[20:37:46] <paladox>	 and includes the fix for allowing line wrap
[20:37:49] <paladox>	 in the preference :)
[20:38:56] <RoanKattouw>	 paladox: Hmm I don't  see the orange thing you're talking about. Looking at http://gerrit-test.wmflabs.org/gerrit/#/c/17/49
[20:39:13] <paladox>	 Oh it wont be on there
[20:39:18] <paladox>	 but will be on gerrit-new
[20:39:19] <RoanKattouw>	 Oh I see
[20:40:59] <paladox>	 yep
[20:44:48] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build an apt repository on deployment-prep (for testing packages from jenkins) - https://phabricator.wikimedia.org/T146497#2663333 (10yuvipanda) No objections. Poke me again next week and I'll CR? We're in a puppet fr...
[20:45:35] <shinken-wm>	 PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[20:46:41] <shinken-wm>	 PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[20:50:51] <matt_flaschen>	 hashar, do you know why deployment-tin.eqiad.wmflabs says "do not use this server" (and "'Connect to 'deployment.eqiad.wmnet' instead, it will route you to the correct server.'" which is certainly wrong)?
[20:51:05] <matt_flaschen>	 hashar, is there really a new Beta Cluster deployment server, or is it just a Puppet bug?
[20:52:13] <hashar>	 matt_flaschen: use deployment-mira
[20:52:17] <hashar>	 ah yeah 
[20:52:20] <hashar>	 the message is wrong
[20:54:12] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[21:05:12] <hashar>	 bah
[21:11:08] <greg-g>	 matt_flaschen: we're mid migration to jessie for the deploy servers in both beta and prod, so hence that motd that recommends to use deployment.eqiad.wmnet (which will always point to the right place)
[21:12:09] <matt_flaschen>	 greg-g, okay, but the message is wrong when you ssh to deployment-tin.  It should point to deployment-mira.
[21:12:35] <greg-g>	 right, I think it's a prod'ism, as it were
[21:14:45] <hashar>	 matt_flaschen: greg-g: yeah the motd is hardcoded in a puppet file
[21:14:55] <hashar>	 in theory one can convert the file to a puppet erb template
[21:15:07] <hashar>	 and inject whatever variable represent the current scap master
[21:15:11] <hashar>	 or the entry point
[21:15:48] <matt_flaschen>	 hashar, yes, etonkovidova asked me about the message (probably not because of that per se, but it would be nice to make it accurate).
[21:15:52] <paladox>	 hashar it looks like ci is very slow,
[21:15:58] <paladox>	 one nodepool instance
[21:16:06] <wikibugs>	 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663465 (10Mattflaschen-WMF)
[21:16:10] <matt_flaschen>	 T146505
[21:16:10] <paladox>	 Oh never mind
[21:16:12] <hashar>	 paladox: yeah we are out of quota / resource since July 4th
[21:16:13] <paladox>	 only one patch
[21:16:18] <paladox>	 Oh yep
[21:16:28] <paladox>	 The american holiday
[21:16:38] <hashar>	 and that is with most php jobs still on permanent slaves :(
[21:16:44] <hashar>	 basically we are stuck
[21:16:55] <paladox>	 Oh
[21:17:19] <paladox>	 We should probaly setup up a lab type thing for releng
[21:17:36] <paladox>	 so that you can have more space then using labs where everyone goes to test things
[21:17:38] <paladox>	 :)
[21:17:44] <hashar>	 at the bottom of the Zuul status page there are some links to CI / Nodepool / Zuul  which are grafana board
[21:17:50] <hashar>	 the nodepool one has some extended details
[21:17:56] <hashar>	 eg  https://grafana.wikimedia.org/dashboard/db/nodepool
[21:17:58] <paladox>	 Oh thanks
[21:18:01] <hashar>	 notably show you the status of the pool
[21:18:05] <hashar>	 12 instances max
[21:18:09] <paladox>	 oh
[21:18:09] <hashar>	 green = ready to get a job
[21:18:18] <hashar>	 blue = busy executing
[21:18:27] <hashar>	 yellow = node is spawning / being provisionned
[21:18:32] <paladox>	 around 8pm it started getting problems
[21:18:41] <hashar>	 yeah 
[21:18:44] <hashar>	 typical busy hour
[21:19:07] <paladox>	 Oh, i guess we should get dedicated hardware for releng :)
[21:19:12] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:19:24] <hashar>	 the challenge is to find some metrics that represent the wait time / annoyance to the devs
[21:19:28] <hashar>	 so we can claim more instances
[21:19:43] <hashar>	 but I am unable to find meaningful that would say:  we need X instances because of that
[21:19:49] <hashar>	 so we are stuck to 12 max for now
[21:21:06] <paladox>	 Yep, but i mean dedicated hardware not provided by labs since it seems they are going through a crunch (resources) :)
[21:21:17] <greg-g>	 hardware isn't free
[21:21:32] <greg-g>	 there's all kinds of "coulds" that could be done
[21:21:43] <shinken-wm>	 RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[21:21:50] <paladox>	 Oh
[21:21:53] <paladox>	 ok
[21:22:39] <shinken-wm>	 PROBLEM - Keyholder status on deployment-mira is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[21:22:51] <hashar>	 paladox: yeah what greg said, we can come up with something that is lighter / take less resources
[21:23:01] <paladox>	 Oh ok
[21:23:26] <greg-g>	 we'll be talking about it a lot at our team offsite in October (Oct 17-21); more later :)
[21:23:37] <paladox>	 Oh :)
[21:23:57] <hashar>	 I should add a sleep 180 to all jobs and see whether anyone complain :D
[21:24:10] <hashar>	 anyway time to bed. Had a rather crazy week with the jobrunners and all
[21:24:18] <hashar>	 +++
[21:24:18] <paladox>	 Oh yep
[21:24:19] <paladox>	 :)
[21:26:13] <greg-g>	 g'night hashar, have a good weekend
[21:26:51] <wikibugs>	 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, 07Puppet: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663483 (10hashar)
[21:28:38] <wikibugs>	 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, 07Puppet: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663487 (10hashar)
[21:28:54] <hashar>	 ^^that one is trivial :D
[22:58:12] <wikibugs>	 10Deployment-Systems, 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10Mattflaschen-WMF)
[23:23:15] <shinken-wm>	 PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]