[01:57:35] <awight>	 Is zuul hung?  I'm expecting to see https://gerrit.wikimedia.org/r/#/c/311888/ in gate-and-submit...
[02:07:56] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-jobrunner01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0]
[03:11:36] <wmf-insecte>	 Project mediawiki-core-code-coverage build #2275: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2275/
[04:15:24] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-db1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0]
[04:27:01] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-db2 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0]
[04:58:51] <wikibugs>	 10Gerrit: PHP libraries as Gerrit top-level projects - https://phabricator.wikimedia.org/T125031#1972587 (10Legoktm) I created the mediawiki/libs placeholder repository today, so new libraries can be created under it. The first one is mediawiki/libs/WaitConditionLoop :)
[07:16:53] <grrrit-wm>	 (03PS1) 10Legoktm: Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 
[07:17:49] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 (owner: 10Legoktm)
[07:18:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 (owner: 10Legoktm)
[07:19:25] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/311927
[07:19:28] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[07:24:39] <wikibugs>	 10Gerrit, 07Zuul: Gerrit ssh command failing on new repositories, causing zuul to not run against a change - https://phabricator.wikimedia.org/T146260#2655194 (10Legoktm)
[07:52:41] <hashar>	 moritzm: elukey: hello! I went crazy yesterday and rebuild a new deployment-mira instance
[07:52:51] <hashar>	 with extended disk space.   Looks all good to us now 
[07:53:15] <moritzm>	 hashar: saw that that did what I planned for this morning, much appreciated :-)
[07:53:20] <moritzm>	 hashar: saw that that you did what I planned for this morning, much appreciated :-)
[07:57:55] <hashar>	 moritzm: yeah I wanted to try it myself to level up on deployment server provisionning
[07:58:00] <hashar>	 learned a few tricks
[07:58:18] <moritzm>	 hashar: shall we replace tin in deployment-prep today?
[07:58:37] <hashar>	 worth trying
[07:59:09] <hashar>	 going to be a bit funnier since deployment-tin is also a Jenkins slave
[07:59:57] <hashar>	 I think we can land https://gerrit.wikimedia.org/r/#/c/311760/  (which replace deployment-mira02 with deployment-mira)
[08:01:44] <moritzm>	 ok, will make some time and look into merging it after that
[08:02:42] <elukey>	 hashar: hello! Can I nuke jobrunner01?
[08:02:53] <elukey>	 it has not been running anything since yesterday
[08:06:13] <hashar>	 elukey: yeah
[08:06:28] <hashar>	 elukey: I think the jessie one is good enough 
[08:09:36] <elukey>	 super
[08:11:26] <elukey>	 !log terminated jobrunner01 and removed from deployment-prep's sacp dsh list
[08:11:30] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[08:11:58] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2655298 (10elukey)
[08:12:30] <shinken-wm>	 PROBLEM - Host deployment-jobrunner01 is DOWN: CRITICAL - Host Unreachable (10.68.17.96)
[08:15:48] <elukey>	 yeah shinken sorry
[08:16:13] <wmf-insecte>	 Project beta-scap-eqiad build #120961: 04FAILURE in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120961/
[08:18:39] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: mediawiki_selenium feature to show/capture Selenium WebDriver requests to remote browser. - https://phabricator.wikimedia.org/T94577#2655304 (10Jhernandez) Nice progress!  I believe reading web's tests in jenkins mw-selenium job are ru...
[08:26:19] <wmf-insecte>	 Project beta-scap-eqiad build #120962: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120962/
[08:29:05] <wmf-insecte>	 Project beta-scap-eqiad build #120963: 04STILL FAILING in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120963/
[08:34:04] <hashar>	 scap still tries to reach deployment-jobrunner01.deployment-prep.eqiad.wmflabs :/
[08:34:30] <hashar>	  deployment-tin  /etc/dsh/group/mediawiki-installation:deployment-jobrunner01.deployment-prep.eqiad.wmflabs
[08:36:14] <wmf-insecte>	 Project beta-scap-eqiad build #120964: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120964/
[08:37:46] <hashar>	 !log beta: manually rebased puppetmaster
[08:37:51] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[08:38:04] <hashar>	 the puppet autorebaser does not work anymore last was on 20160920T1900
[08:38:05] <hashar>	 bah
[08:39:06] <hashar>	 local diff detected, fixed
[08:46:13] <wmf-insecte>	 Yippee, build fixed!
[08:46:14] <wmf-insecte>	 Project beta-scap-eqiad build #120965: 09FIXED in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120965/
[08:47:21] <hashar>	 elukey: moritz: lets drop the Trusty mira ! https://gerrit.wikimedia.org/r/311939   :)
[08:49:13] <moritzm>	 having a look
[08:50:07] <hashar>	 should free up some quota to create a new tin
[08:53:04] <hashar>	 rebasing puppet master  / running puppet on tin
[08:54:55] <wmf-insecte>	 Project beta-scap-eqiad build #120966: 04FAILURE in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120966/
[08:54:57] <moritzm>	 hashar: I'll remove mira now, ok?
[08:55:04] <hashar>	 yeah
[08:55:10] <hashar>	 got puppet to update the conf on deployment-tin
[08:55:16] <hashar>	 finally ! \O/
[08:55:34] <moritzm>	 !log remove mira from deployment-prep (replaced by deployment-mira)
[08:55:38] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[08:55:53] <hashar>	 so I think we can migrate the production mira  or add a jessie deployment server in prod
[08:56:00] <hashar>	 there is high confidence it is going to just work
[08:56:59] <shinken-wm>	 PROBLEM - Host mira is DOWN: CRITICAL - Host Unreachable (10.68.17.215)
[08:57:27] <moritzm>	 Tyler sounded as if he wanted to run more tests?
[08:57:51] <wmf-insecte>	 Yippee, build fixed!
[08:57:51] <wmf-insecte>	 Project beta-scap-eqiad build #120967: 09FIXED in 1 min 39 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120967/
[08:58:02] <hashar>	 moritzm: maybe. Cant remember the task
[08:58:15] <hashar>	 from a discussion I had with him yesterday one blocker was /srv being too small and that is solved
[08:58:34] <hashar>	 we had a quick chat about trying to run an actual scap/deploy from mira (instead of tin)  so maybe that is what he wants to try
[08:58:49] <moritzm>	 https://phabricator.wikimedia.org/T144578#2650020
[08:59:29] <moritzm>	 with that blocker now resolved, I'd say lets have this evening/US daytime for testing and then I'll reimage mira in production tomorrow
[09:00:04] <hashar>	 yeah seems he is willing to test a deploy from mira
[09:00:14] <hashar>	 which I know how to test actually hehe
[09:00:46] <hashar>	 the Jenkins jobs uses deployment-tin right now, i can migrate them to use deployment-mira instead
[09:00:50] <moritzm>	 we'll need to add the jessie tin as deployment-tin2?
[09:00:52] <hashar>	 then we can assert that scap works fine
[09:01:10] <hashar>	 thus  deployment-mira becomes the master deployment server
[09:01:20] <hashar>	 we can then add deployment-tin2 and delete deployment-tin
[09:01:32] <hashar>	 then switch the jobs from deployment-mira to deployment-tin2
[09:02:10] <hashar>	 if we want to keep the deployment-tin name, I guess we can switch to mira and delete/recreate deployment-tin 
[09:10:14] <moritzm>	 I'm fine either way, not preference at all
[09:24:30] <hashar>	 gonna switch
[09:34:14] <hashar>	 !log From [[Hiera:deployment-prep]] remove bit already in puppet:  "scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs
[09:34:18] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[09:43:28] <hashar>	 !log beta: switching master deployment server from deployment-tin to deployment-mira
[09:43:31] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[09:56:30] <moritzm>	 hashar: shall I merge https://gerrit.wikimedia.org/r/#/c/311946/1 ?
[09:57:34] <shinken-wm>	 PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[10:03:37] <hashar>	 moritzm: yeah that one is safe
[10:03:46] <hashar>	 there is a follow up change that I am currently trying out
[10:05:46] <hashar>	 !log Arming keyholder on deployment-mira
[10:05:50] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[10:07:33] <shinken-wm>	 RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[10:07:51] <hashar>	 !log Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common  T144578 
[10:07:54] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[10:10:53] <hashar>	 !log deployment-mira removing "role::labs::lvm::srv"  duplicate with role::ci::slave::labs::common
[10:10:57] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[10:11:22] <hashar>	 what a mess
[10:13:31] <shinken-wm>	 PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[10:21:18] <hashar>	 so now
[10:21:33] <hashar>	 a CI slave expect  the labs extended disk to be mounted on /mnt
[10:21:49] <hashar>	 on deployment-mira we have it mounted on /srv
[10:22:03] <hashar>	 on deployment-tin it is mounted on /mnt  and there is a symlink  /srv --> /mnt/srv
[10:22:06] <hashar>	 rather messy :D
[10:23:04] <moritzm>	 hashar: https://gerrit.wikimedia.org/r/#/c/311946/ merged
[10:23:16] <hashar>	 great
[10:23:29] <hashar>	 I am going to get rid of a huge tech debt CI has
[10:23:46] <hashar>	 which is that extended disk is on /mnt  when our best practice is /srv
[10:24:07] <hashar>	 that will please faidon :}
[10:25:30] <hashar>	 moritzm: I need some heavy refactoring and a migration. Will send a bunch of puppet patches and sprint that
[10:25:53] <moritzm>	 ok
[10:26:17] <hashar>	 there is a conflict between the deployment role and the  labs slave role :(
[10:26:21] <hashar>	 due to /mnt vs /srv
[10:26:34] <hashar>	 I am going to make the CI slave role to use /srv  which is long overdue
[10:26:51] * hashar grab an axe and slash in puppet manifests
[10:32:10] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-JsonConfig, 10MediaWiki-extensions-ZeroBanner, 06Reading-Web-Backlog, and 3 others: Zero phpunit test failure (blocks merges to MobileFrontend) - https://phabricator.wikimedia.org/T145227#2655549 (10phuedx) I've scheduled {a8733ff46c611f97b569db1c7981...
[10:50:33] <hashar>	 which me luck
[10:57:31] <shinken-wm>	 PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[10:57:46] <hashar>	 !log Changing Jenkins slaves home dir for deployment-tin and deployment-mira from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
[10:57:49] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[10:57:59] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-tin is CRITICAL: CRITICAL: deployment-prep.deployment-tin.diskspace.root.byte_percentfree (<10.00%)
[10:58:46] <hashar>	 !log Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02  from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
[10:58:50] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[11:00:15] <shinken-wm>	 PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[11:05:08] <shinken-wm>	 PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[11:08:32] <shinken-wm>	 RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[11:19:16] <wmf-insecte>	 Project beta-cxserver-update-eqiad build #317: 04FAILURE in 6.4 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/317/
[11:19:37] <hashar>	 I have broke it :(
[11:20:54] <wmf-insecte>	 Project beta-cxserver-update-eqiad build #318: 04STILL FAILING in 0.88 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/318/
[11:21:39] <wmf-insecte>	 Project beta-cxserver-update-eqiad build #319: 04STILL FAILING in 0.84 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/319/
[11:22:23] <wmf-insecte>	 Yippee, build fixed!
[11:22:24] <wmf-insecte>	 Project beta-cxserver-update-eqiad build #320: 09FIXED in 7.9 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/320/
[11:22:59] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-tin is OK: OK: deployment-prep.deployment-tin.diskspace._mnt.byte_percentfree (No valid datapoints found)
[11:24:59] <hashar>	 !log removing Jenkins slave deployment-tin , deployment-mira is the new deployment master  T144578
[11:25:03] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[11:26:32] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:27:22] <hashar>	 doh
[11:29:03] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:33:16] <hashar>	 mobrovac: I have broken various services on beta cluster sorry :(
[11:33:29] <mobrovac>	 sigh
[11:33:31] <mobrovac>	 how?
[11:33:38] <mobrovac>	 what's up?
[11:33:38] <hashar>	 did a crazy migration 
[11:33:46] <hashar>	 so that the jenkins slaves use /srv  instead of /mnt
[11:33:56] <hashar>	 and I guess I screwed deployment-sca01 /  deployment-sca02 :((((((((((((
[11:40:08] <shinken-wm>	 RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0]
[11:43:56] <hashar>	 bah
[11:46:21] <hashar>	 somehow the deploy-service group does not exist on mira :(
[11:46:39] <moritzm>	 let me check, I think I know
[11:46:58] <hashar>	 maybe it created manually
[11:47:02] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:47:26] <moritzm>	 ah, no. not what I though
[11:47:27] <moritzm>	 ah, no. not what I thought
[11:47:46] <hashar>	 maybe it is not provisioned via puppet
[11:47:56] <hashar>	 on deployment-tin it has deploy-service:x:52946:thcipriani,ladsgroup,mobrovac,twentyafterfour,elukey,joal,akosiaris,halfak,nuria,milimetric,arlolra,bd808,ssastry,cscott,krenair
[11:48:02] <hashar>	 brb
[11:49:05] <mobrovac>	 yeah, thcipriani|afk did some magic on deployment-tin to make it work
[11:51:12] <moritzm>	 hashar, mobrovac: modules/beta/manifests/deployaccess.pp hardcodes it to the IP address of deployment-tin
[11:51:19] <hashar>	 ohhh
[11:52:09] <hashar>	 I have a puppet patch to adjust it to mira https://gerrit.wikimedia.org/r/#/c/311947/1/modules/beta/manifests/deployaccess.pp
[11:53:27] <hashar>	 I have cherry picked on beta puppet master
[11:53:38] <shinken-wm>	 PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[11:55:49] <hashar>	 Sep 21 11:55:28 deployment-mediawiki06 sshd[6935]: reverse mapping checking getaddrinfo for ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs [10.68.20.135] failed - POSSIBLE BREAK-IN ATTEMPT!
[11:55:53] <hashar>	 that is from a mw host
[11:55:57] <hashar>	  error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1
[11:56:02] <hashar>	  Failed publickey for mwdeploy from 10.68.20.135
[11:56:26] <shinken-wm>	 PROBLEM - Citoid on deployment-sca02 is CRITICAL: Connection refused
[11:56:34] <hashar>	 # dig +short -x 10.68.20.135
[11:56:34] <hashar>	 ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs.
[11:56:34] <hashar>	 deployment-mira.deployment-prep.eqiad.wmflabs.
[11:56:35] <hashar>	 magic
[11:57:12] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[11:57:22] <hashar>	 that is deployment-mira got an IP address that has two PTR entries :(
[11:59:42] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-sca01 is CRITICAL: CRITICAL: deployment-prep.deployment-sca01.diskspace._var.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._var_log.byte_percentfree (No valid datapoints found)deployment-prep.deployment-sca01.diskspace._srv.byte_percentfree (<40
[11:59:46] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-sca02 is CRITICAL: CRITICAL: deployment-prep.deployment-sca02.diskspace._mnt.byte_percentfree (No valid datapoints found)deployment-prep.deployment-sca02.diskspace._srv.byte_percentfree (<22.22%)
[12:04:18] <hashar>	 $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs
[12:04:18] <hashar>	 Agent admitted failure to sign using the key.
[12:04:20] <hashar>	 yeah bah
[12:04:39] <moritzm>	 hashar: hmm, Andrews should look into that when he's up
[12:04:44] <moritzm>	 hashar: hmm, Andrew Bogott should look into that when he's up
[12:04:54] <hashar>	 I think it is just a notice
[12:05:10] <hashar>	 the dupe dns entries are a known issue,  I dont think it prevents the connection
[12:08:36] <hashar>	 $ ssh-key-ldap-lookup mwdeploy
[12:08:36] <hashar>	 KeyError: 'sshPublicKey'
[12:08:38] <hashar>	 ...
[12:08:46] <hashar>	 the user has no key in ldap bah
[12:09:26] <moritzm>	 ok
[12:09:50] <hashar>	 there is /etc/ssh/userkeys/mwdeploy  though
[12:17:49] <hashar>	 Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: trying public key file /etc/ssh/userkeys/mwdeploy
[12:17:49] <hashar>	 Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: fd 4 clearing O_NONBLOCK
[12:17:50] <hashar>	 Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: restore_uid: 0/0
[12:17:50] <hashar>	 Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: Failed publickey for mwdeploy from 10.68.20.135 port 46514 ssh2: RSA fa:e3:91:7e:86:e6:b3:9c:c5:63:df:44:71:75:cf:2f
[12:21:16] <hashar>	 on deployment-mediawiki06  the fingerprint of /etc/ssh/userkeys/mwdeploy  match the fingerprint of a key in the keyholder
[12:21:56] <hashar>	 but somehow it is not presented
[12:22:33] <hashar>	 hmm it is 
[12:22:37] <hashar>	 but sshd logs:  Postponed publickey for mwdeploy from 10.68.20.135 port 46570 ssh2 [preauth]
[12:33:38] <shinken-wm>	 RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:35:46] <hashar>	 moritzm: so I have no idea what is going on.  I dont know the ssh issue is due to something missing in our puppet manifest or if it is an issue related to Jessie :(
[12:41:02] <moritzm>	 mhh, so what in particular fails?
[12:43:34] <hashar>	 moritzm: from deployment-mira (Jessie) I can't use the keyholder agent-proxy
[12:43:39] <hashar>	 eg: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs
[12:44:05] <hashar>	 that yields:  Agent admitted failure to sign using the key.
[12:44:27] <hashar>	 with the remote sshd  apparently accepting the key but postponing it 
[12:47:02] <thcipriani|afk>	 so the agent admitted failure to sign using the key message is a result of you not being in the group authorized to use the key
[12:47:02] <moritzm>	 but in that case it's from deployment-mira to deployment-mira, so on the same host?
[12:47:20] <hashar>	 luckily magic tyler woke up early :}
[12:47:31] <moritzm>	 heh
[12:48:27] <thcipriani|afk>	 this is something I do manually on beta :((
[12:48:34] <thcipriani|afk>	 it's captured in point 1 here https://phabricator.wikimedia.org/T144647
[12:49:37] <hashar>	 yeah I noticed that part
[12:49:48] <hashar>	 and added myself to that group
[12:50:06] <thcipriani|afk>	 oh
[12:50:30] <hashar>	 SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs  fails still :D
[12:50:51] <hashar>	 I should download the /etc of both deployment-mira and deployment-tin and do a diff :}
[12:52:25] <thcipriani|afk>	 hrm I wonder if this has something to do with the new fingerprint algorithm for ssh agent?
[12:53:22] <hashar>	 looks like the fingerprint value match but who know really
[12:57:44] <thcipriani|afk>	 blerg, trying to rearm keyholder...says "keyholder: command not found"
[12:57:46] <hashar>	 Only in mira/etc: gss
[12:57:46] <hashar>	 Only in tin/etc: gssapi_mech.conf
[12:57:52] <hashar>	 yeah /usr/local/sbin
[12:59:00] <thcipriani|afk>	 hrm...may have broken something good :)
[13:04:35] <wmf-insecte>	 Yippee, build fixed!
[13:04:36] <wmf-insecte>	 Project selenium-Math » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #151: 09FIXED in 35 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/151/
[13:06:18] <hashar>	 I am handling the production swat
[13:07:25] <thcipriani|afk>	 kk still poking at keyholder
[13:13:58] <thcipriani|afk>	 huh. I have no idea what happened, but it seems to be working now :((
[13:14:47] <thcipriani|afk>	 The only thing I did was have ssh-agent-proxy print out the perms hash on line 231
[13:15:54] <thcipriani|afk>	 although, maybe I didn't try to use /run/keyholder/proxy.sock after reloading keyholder-proxy to reload the perms. It's possible the perms on disk changed since the proxy was last restarted? Only explanation I can think of...
[13:16:08] <thcipriani|afk>	 I'm going to finish morning things bbiab.
[13:46:24] <hashar>	 thcipriani|afk: or I have screwed up when arming the keyholder
[13:46:31] <wmf-insecte>	 Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #152: 04FAILURE in 2 min 31 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/152/
[13:48:53] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2655927 (10zeljkofilipin) [[ https://github.com/mozilla/geckodriver#firefox-capabilities | `firefoxOptions` ]] is the way to go, but is a feature o...
[13:49:39] <hashar>	 ah and sudo is scary
[13:49:45] <hashar>	  sudo -u mwdeploy -n -- ....
[13:49:53] <hashar>	 failed to add keys to /mnt/home/jenkins-deploy/.ssh/  :D
[13:50:03] <hashar>	 sudo does not reset home, should have -H maybe
[13:51:43] <Krenair>	 what about sudo -iu mwdeploy -n -- ....?
[13:51:45] <Krenair>	 (note the -i)
[13:52:00] <hashar>	 and -i is?
[13:52:45] <hashar>	 and it seems rsync or ssh  does not look in /etc/ssh/ssh_known_hosts
[13:53:01] <Krenair>	 it does some things that result in the home being set
[13:53:07] <Krenair>	 -H probably does the same thing
[13:53:57] <hashar>	 yeah might
[13:54:04] <thcipriani>	 blerg. ssh does not look at ssh_known_hosts
[13:54:20] <thcipriani>	 wonder when/why that changed?
[13:55:14] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2655941 (10zeljkofilipin) Firefox 47.0.1 mediawiki_selenium 1.7.2 watir-webdriver 0.9.3 selenium-webdriver 2.53.4 firefox driver  ``` $ bundle exec...
[13:55:30] <Krenair>	 it's not reading /etc/ssh/ssh_known_hosts, just the user's known_hosts thcipriani?
[13:56:11] <thcipriani>	 Krenair: just going off of hashar 's comment above...double-checking now
[13:56:28] <Krenair>	 ah
[13:56:44] <hashar>	 thcipriani: take your time for our chat,  need to head to restroom/grab coffee
[13:56:55] <thcipriani>	 okie doke
[13:58:28] <hashar>	 ok back
[13:58:44] <thcipriani>	 hrm, /etc/ssh/ssh_known_hosts seems to be working for me on deployment-mira
[13:58:59] <thcipriani>	 SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l mwdeploy deployment-mediawiki06.deployment-prep.eqiad.wmflabs
[13:59:15] <hashar>	 so
[13:59:16] <thcipriani>	 worked with a blank ~/.ssh/known_hosts
[13:59:25] <hashar>	 on deployment-mira I have used ssh§keyscan to populate it
[13:59:31] <thcipriani>	 nice :)
[13:59:33] <hashar>	 did that hours ago
[13:59:38] <hashar>	 based on your comment on some task
[13:59:39] <hashar>	 then
[13:59:44] <hashar>	 13:58:57 pull-master failed: <CalledProcessError> Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs']' returned non-zero exit status 10
[13:59:56] <hashar>	 deployment-mira has an IP address with two PTR entries in dns
[14:00:03] <Krenair>	 oh lovely
[14:00:06] <thcipriani>	 oh good.
[14:00:08] <hashar>	 and somehow it seems to use the IP
[14:00:11] <Krenair>	 this is T115194
[14:00:12] <hashar>	 does a PTR lookup
[14:00:17] <hashar>	 and use whatever is returned by dns
[14:00:26] <hashar>	 despite us having specific ip / hostnames listed
[14:00:44] <hashar>	 I guess it is an "helper" to show a hostname when IP are given
[14:00:46] <hashar>	 side effects
[14:00:47] <hashar>	 so
[14:01:08] <hashar>	 since dupe PTR entries are not cleaned, we might have to .... recreate an instance :)
[14:01:29] <thcipriani>	 heh. I think at this point you're re-re-re-recreating :)
[14:01:34] <Krenair>	 or I can do some magic to vanish the bad PTR entries
[14:01:48] <Krenair>	 unless you have other reasons to want to recreate
[14:02:09] <hashar>	 ohh
[14:02:33] <hashar>	 Krenair: if you get access to Designate yeah dropping the wrong ptr on  10.68.20.135 would be much appreciated
[14:05:56] <hashar>	 !log deployment-mira seems ready for action and is the primary deployment server.  Enabling jenkins to it
[14:06:00] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:06:34] <hashar>	 !log Enabling Jenkins slave deployment-mira
[14:06:38] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:06:41] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5705: 04FAILURE in 1 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5705/
[14:06:41] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5706: 04STILL FAILING in 82 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5706/
[14:06:42] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5707: 04STILL FAILING in 80 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5707/
[14:06:42] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5708: 04STILL FAILING in 89 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5708/
[14:06:43] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5709: 04STILL FAILING in 0.14 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5709/
[14:06:43] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5710: 04STILL FAILING in 74 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5710/
[14:06:44] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5711: 04STILL FAILING in 67 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5711/
[14:06:44] <wmf-insecte>	 Project beta-code-update-eqiad build #122416: 04FAILURE in 0.28 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/122416/
[14:06:45] <wmf-insecte>	 Project beta-update-databases-eqiad build #11501: 04FAILURE in 0.14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11501/
[14:08:19] <hashar>	 all that is me ^^^^
[14:08:26] <hashar>	 !log deployment-mira adding puppet class beta::autoupdater
[14:08:30] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:08:35] <Krenair>	 Er
[14:08:41] <Krenair>	 I may have just fixed more than I was intending
[14:09:32] <shinken-wm>	 PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:09:42] <Krenair>	 or maybe not
[14:11:44] <wmf-insecte>	 Yippee, build fixed!
[14:11:45] <wmf-insecte>	 Project beta-mediawiki-config-update-eqiad build #5712: 09FIXED in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5712/
[14:12:21] <wmf-insecte>	 Project beta-scap-eqiad build #120971: 04FAILURE in 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120971/
[14:12:40] <Krenair>	 Dammit
[14:12:51] <Krenair>	 {u'message': u'Managed records may not be updated', u'code': 400, u'type': u'bad_request'
[14:13:00] <wmf-insecte>	 Project beta-scap-eqiad build #120972: 04STILL FAILING in 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120972/
[14:13:38] <hashar>	 scap dies with: 14:13:00 PHP Fatal error:  Class 'Memcached' not found in /srv/mediawiki-staging/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63
[14:14:55] <wmf-insecte>	 Yippee, build fixed!
[14:14:55] <wmf-insecte>	 Project beta-code-update-eqiad build #122417: 09FIXED in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/122417/
[14:15:17] <wmf-insecte>	 Project beta-scap-eqiad build #120973: 04STILL FAILING in 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120973/
[14:16:24] <hashar>	 Krenair: bad luck :/
[14:16:34] <Krenair>	 Did it
[14:16:47] <Krenair>	 Went through the designate source code and found an undocumented HTTP header that allowed what I wanted
[14:16:50] <hashar>	 Krenair: awesome
[14:17:38] <Krenair>	 At least, it wasn't mentioned in the docs I was reading :)
[14:17:47] <hashar>	 ahahah
[14:17:59] <Krenair>	 ah, in the old docs page
[14:18:05] <hashar>	 restarted nscd on deployment-mira
[14:18:08] <Krenair>	 I should fix the new designate rest api does to show these things
[14:19:01] <Krenair>	 api docs*
[14:19:33] <shinken-wm>	 RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[14:20:04] <wmf-insecte>	 Project beta-update-databases-eqiad build #11502: 04STILL FAILING in 3.8 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11502/
[14:20:30] <Krenair>	 http://git.openstack.org/cgit/openstack/designate/tree/designate/api/middleware.py#n100
[14:24:53] <wmf-insecte>	 Project beta-scap-eqiad build #120974: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120974/
[14:27:49] <hashar>	 !log deployment-sca01 and deployment-sca02 are now broken.  The CI puppet class mount /srv which ends up being only 500 MBytes
[14:27:54] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:28:25] <hashar>	 mobrovac: found the issue with deployment-sca01 / sca02 :D
[14:28:32] <hashar>	 not enough disk space eeg
[14:28:53] <mobrovac>	 huh
[14:29:37] <mobrovac>	 hashar: /srv has got only 480MB???
[14:29:39] <mobrovac>	 wth?
[14:30:28] <mobrovac>	 hashar: jenkins and jenkins-workspace take for than 50% of that space
[14:30:34] <mobrovac>	 hashar: why are they even there?
[14:33:58] <hashar>	 mobrovac: I think they are small instances with just 20GB disk
[14:34:16] <hashar>	 and I refactored the CI class to mount /srv on the extended disk
[14:34:20] <hashar>	 which has only 500MB
[14:34:33] <hashar>	 a bit less than 20G is for /   (the system)
[14:34:59] <wmf-insecte>	 Project beta-scap-eqiad build #120975: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120975/
[14:35:03] <mobrovac>	 that's not good though
[14:35:38] <mobrovac>	 and, what is jenkins doing on deplyoment-sca any way?
[14:35:46] <mobrovac>	 all of the build jobs should be disabled
[14:35:56] <mobrovac>	 since all of the services are now using scap3 for deployments
[14:36:29] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-JsonConfig, 10MediaWiki-extensions-ZeroBanner, 06Reading-Web-Backlog, and 3 others: Zero phpunit test failure (blocks merges to MobileFrontend) - https://phabricator.wikimedia.org/T145227#2656042 (10phuedx) 05Open>03Resolved I'm happy to mark this...
[14:38:16] <hashar>	 mobrovac: ahhh
[14:38:22] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656047 (10zeljkofilipin) Firefox 48.0.2 mediawiki_selenium 1.7.2 + [[ https://gerrit.wikimedia.org/r/#/c/310286/ | 310286 ]] watir 6.0.0.beta4 sel...
[14:38:27] <hashar>	 mobrovac: GOODNESS!!!  going to drop the class and unmount /srv
[14:38:48] <mobrovac>	 hashar: no, don't umount /srv, all of the deploy code is there!
[14:39:01] <hashar>	 na it got mounted empty
[14:39:16] <grrrit-wm>	 (03PS3) 10Zfilipin: WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) 
[14:39:19] <hashar>	 the deploy code is hidden in the /srv directory of the /  partition
[14:39:44] <mobrovac>	 k
[14:40:04] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) (owner: 10Zfilipin)
[14:44:59] <wmf-insecte>	 Project beta-scap-eqiad build #120976: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120976/
[14:52:51] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656073 (10zeljkofilipin) I don't think Marionette/geckodriver is ready yet. Setting the profile has changed. I have disabled profiles for this tes...
[14:52:55] <hashar>	 ok fixing deployment-sca01 02
[14:54:13] <shinken-wm>	 PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[14:54:55] <wmf-insecte>	 Project beta-scap-eqiad build #120977: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120977/
[14:54:56] <thcipriani>	 well this is why it's having a problem finding memcached https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/mwscript#L22
[14:55:03] <hashar>	 !log removed the CI puppet class from deployment-sca01 and deployment-sca02 .  Stopped services using /srv  ,  unmounted /srv, removed it from /etc/fstab
[14:55:06] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:55:14] <hashar>	 puppet is happy on sca01
[14:55:55] <thcipriani>	 should mwscript still be using php5 explicitly? Is that correct?
[14:56:12] <hashar>	 oh my god
[14:56:16] <hashar>	 yeah that is quite old
[14:56:23] <hashar>	 we have a task for it iirc
[14:56:27] <shinken-wm>	 RECOVERY - Citoid on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.055 second response time
[14:56:32] <hashar>	 the reason is some job exploding on terbium.eqiad.wmnet when using hhvm
[14:56:38] <hashar>	 which was some corner case / weird issue we had
[14:56:47] <hashar>	 so the hack was to get mwscript to use php5/zend
[14:56:55] <thcipriani>	 ah, ok.
[14:56:57] <hashar>	 it is probably no more needed
[14:57:08] <ebernhardson>	 question about long running tasks, we have roughly 3 of them in search. We have a daily completion suggester build that runs about 7 hours, a weekly dump from hadoop to elasticsearch that runs ~35 hours, and a weekly export to dumps.wikimedia.org that runs for ~48 hours, do they all go on the deployments calendar?
[14:57:14] <hashar>	 but I can't remember off hand what is preventing us from moving terbium and/or mwscript back to zend
[14:57:34] <hashar>	 ebernhardson: in doubt yes?  :)
[14:57:40] <thcipriani>	 well a quick install of php5-memcached will probably fix the explosions in beta-scap-eqiad /me does
[14:58:05] <hashar>	 ebernhardson: thanks for taking care of that.   The idea came from Jynus (the DBA) since having long term running jobs can clash with database upgrades
[14:58:33] <hashar>	 ebernhardson: so the idea behind adding long running tasks to the deployment  calendar  is for people to eventually notice that two things are going to overlap somehow
[14:58:54] <hashar>	 so I guess your mileage may vary
[14:59:09] <hashar>	 mobrovac: 	RECOVERY - Citoid on deployment-sca02 
[14:59:19] <ebernhardson>	 hashar: hmm, ok. I wasn't sure because these are scheduled things, the notice kinda read like it was for one-off maintenance actions but wasn't sure
[14:59:42] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-sca01 is OK: OK: deployment-prep.deployment-sca01.diskspace._var.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._srv.byte_percentfree (More than half of the datapoints are undefined) deployment-prep.deployment-sca01.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._var_log.byte_perce
[15:00:05] <hashar>	 ebernhardson: I guess be bold and enquire about it on whatever task/mail that proposed to add long running task
[15:00:12] <hashar>	 ebernhardson: maybe in your corner case it is not needed at lall
[15:00:15] <mobrovac>	 hashar: thnx
[15:00:44] <hashar>	 mobrovac: and puppet is all happy.  How do you get services deployed on those sca hosts? 
[15:01:21] <hashar>	 mobrovac: I mean: do you just git pull on the deployment server then  scap deploy?
[15:02:07] <mobrovac>	 yes
[15:03:02] <Krenair>	 so where are we with deployments on beta?
[15:03:11] <Krenair>	 is deployment-mira the current master?
[15:03:25] <hashar>	 yeah 
[15:03:40] <hashar>	 after a long day of madness
[15:03:45] <Krenair>	 you're working on it?
[15:04:03] <hashar>	 thcipriani: I can't find the task about mwscript still using Zend php :(
[15:04:44] <hashar>	 thcipriani: but I found the root cause which is at https://phabricator.wikimedia.org/T132751
[15:04:46] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-sca02 is OK: OK: deployment-prep.deployment-sca02.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca02.diskspace._srv.byte_percentfree (No valid datapoints found)
[15:04:53] <hashar>	 and the hack https://gerrit.wikimedia.org/r/#/c/267816/
[15:04:54] <wmf-insecte>	 Project beta-scap-eqiad build #120978: 04STILL FAILING in 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120978/
[15:05:14] <shinken-wm>	 RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:06:13] <hashar>	 Krenair: yeah with ops we got deployment-mira setup with jessie with keyholder/scap etc
[15:06:23] <hashar>	 made it a master to validate that all the stack works fine on Jessie
[15:06:29] <Krenair>	 I just ask because I notice mediawiki-config is 6 commits behind
[15:06:32] <thcipriani>	 Krenair: working on it. beta-scap-eqiad keeps failing since we're using php5 explicitly in mwscript so a bunch of libraries that hhvm has built-in need to be installed. Got php5-memcached and php5-redis installed just now
[15:06:33] <hashar>	 then we are going to get rid of deployment-tin (trusty)
[15:06:38] <ebernhardson>	 thcipriani: the relevant patch that made it use php5 was from me, https://phabricator.wikimedia.org/rOPUP8f8e7dbdd834066504e59edfc4881bb98f76072a
[15:06:38] <Krenair>	 ah, yep
[15:06:39] <hashar>	 and rebuild one based on Jessie
[15:06:47] <ebernhardson>	 thcipriani: it was to fix a deployment error. not sure if it still exists
[15:06:59] <thcipriani>	 oh, gotcha.
[15:07:01] <Krenair>	 well, if there's anything you need my help with, I'm here
[15:07:28] <shinken-wm>	 RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:07:45] <thcipriani>	 interesting, so scap failure was the cause of the switch to php5 :)
[15:08:28] <hashar>	 I can't report it to phabricator :( 
[15:08:49] <hashar>	 Manifest error #666:  out of quota, you have created too many tasks
[15:09:23] <thcipriani>	 hahaha, what‽
[15:09:37] <thcipriani>	 hashar: phabricator thinks you work too much :)
[15:10:06] <hashar>	 flling
[15:10:51] <wmf-insecte>	 Project mediawiki-core-code-coverage build #2276: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2276/
[15:14:22] <hashar>	 thcipriani: did you get a task for lack of Memcached class on deployment server?
[15:15:10] <thcipriani>	 no, will file when I get a full list of needed libraries
[15:16:31] <thcipriani>	 although I think those two may have been it: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120979/console
[15:16:41] * thcipriani files task
[15:17:17] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar)
[15:17:55] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656124 (10hashar)
[15:18:29] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar)
[15:18:50] <ebernhardson>	 hashar: that's backwards, curl_init_pooled only exists in hhvm
[15:18:59] <hashar>	 thcipriani: you can add it as a subtask of https://phabricator.wikimedia.org/T146285 which I have just created.   Its description at the end as a placeholder "task to be filled" for you to add your task :)
[15:19:06] <hashar>	 ebernhardson: oh my I am very bad ://
[15:19:45] <hashar>	 ebernhardson: so your use case is definitely fixed apparently :} 
[15:20:02] <hashar>	 ebernhardson: feel free to comment about it and remove gehel/dcausse and yourself from the list of subscribers. 
[15:20:03] <wmf-insecte>	 Project beta-update-databases-eqiad build #11503: 04STILL FAILING in 2.9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11503/
[15:20:43] <thcipriani>	 I think that this is why we haven't run into the problem of missing libraries for mwscript with deployment-tin/tin: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/manifests/packages.pp#L6-L10
[15:20:44] <hashar>	 thcipriani: isn't the deployment-server supposed to include mediawiki::packages or something like that ?
[15:20:51] <hashar>	 thcipriani: but maybe we have dropped the php-redis / php-memcached from it
[15:20:59] * gehel reading back...
[15:21:01] <thcipriani>	 hashar: heh, see last comment
[15:21:07] <gehel>	 hashar: which task are you talking about?
[15:21:12] <ebernhardson>	 hashar: well, the way we fixed it for curl_init_pooled is to appropriately fall back to curl_init when it's not available. i wish i could remember (or had documented) what exactly was failing when it was switched the other way...
[15:21:16] <hashar>	 gehel: https://phabricator.wikimedia.org/T146285#2656106
[15:21:24] <gehel>	 hashar: thanks!
[15:21:26] <hashar>	 gehel: related to PHP lacking curl_init_pooled()
[15:21:45] <hashar>	 or the other way around, I am all confused
[15:21:56] <hashar>	 the idea behind is me suddenly feeling that mwscript should use hhvm instead of zend :}
[15:22:30] <hashar>	 thcipriani: yeah so the role::deployment::server would have to include mediawiki::packages::legacy apparently
[15:22:52] <hashar>	 I am very happy we manage to catch that on beta instead of prod :D
[15:23:11] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656140 (10EBernhardson)
[15:24:26] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10thcipriani)
[15:24:55] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656155 (10zeljkofilipin) a:05zeljkofilipin>03None
[15:25:08] <wmf-insecte>	 Yippee, build fixed!
[15:25:08] <wmf-insecte>	 Project beta-scap-eqiad build #120979: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120979/
[15:25:27] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656160 (10thcipriani)
[15:25:29] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656159 (10thcipriani)
[15:25:56] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10thcipriani)
[15:26:38] <hashar>	 !sal
[15:26:38] <wm-bot>	 https://tools.wmflabs.org/sal/releng
[15:29:08] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656179 (10hashar) **status for beta cluster**  dpeloyment-mira is the new master running Jessie.  The Jenkins jobs are running on it.  There are...
[15:29:51] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10hashar)
[15:29:53] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656183 (10hashar)
[15:31:20] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656188 (10thcipriani)
[15:32:33] <hashar>	 !log spawned deployment-tin02
[15:32:38] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[15:34:13] <shinken-wm>	 RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0]
[15:35:34] <hashar>	 elukey: if still around. deployment-mira is the new master and https://gerrit.wikimedia.org/r/#/c/311947/  reflect that change :}
[15:35:45] <hashar>	 picked it on beta cluster
[15:35:57] <hashar>	 there is a long tail of other messies change that goes after but that specific change is ok
[15:36:14] <elukey>	 nice!
[15:36:25] <hashar>	 though no
[15:36:26] <hashar>	 ah
[15:36:30] <hashar>	 forgot about one more setting
[15:36:39] <hashar>	 copy pasting form ./hieradata/labs/deployment-prep/host/deployment-tin.yaml :D
[15:41:03] <hashar>	 elukey: sorry spoke too fast earlier.  https://gerrit.wikimedia.org/r/311947 is good to go and includes a hack from https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-tin :)
[15:52:20] <grrrit-wm>	 (03Abandoned) 10Zfilipin: WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) (owner: 10Zfilipin)
[15:56:09] <shinken-wm>	 PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[15:58:28] <hashar>	 ^^e
[15:58:30] <hashar>	 me
[16:01:24] <hashar>	 !log deployment-tin02 applied puppet classes beta::autoupdater, beta::deployaccess, role::deployment::server, role::labs::lvm::srv
[16:01:27] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[16:04:58] <shinken-wm>	 PROBLEM - Puppet run on deployment-tin02 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0]
[16:05:18] <hashar>	 being provisionned
[16:10:00] <shinken-wm>	 RECOVERY - Puppet run on deployment-tin02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:09] <shinken-wm>	 RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:57] <wikibugs>	 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: mediawiki_selenium feature to show/capture Selenium WebDriver requests to remote browser. - https://phabricator.wikimedia.org/T94577#2656344 (10zeljkofilipin) I have forgot to mention, a simple change in the script displays webdriver i...
[16:20:04] <wmf-insecte>	 Project beta-update-databases-eqiad build #11504: 04STILL FAILING in 3.9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11504/
[16:23:44] <hashar>	 Call to undefined function curl_multi_init()
[16:23:46] <hashar>	 bah
[16:23:54] <Krenair>	 php5-curl?
[16:24:03] <hashar>	 using zend php5 on deployment-mira
[16:24:07] <hashar>	 I guess yeah
[16:24:14] <hashar>	 tyler has a task about missing packages
[16:24:27] <Krenair>	 we use that php function in VE
[16:24:28] <hashar>	 we are in a meeting though
[16:24:34] <Krenair>	 ok
[16:25:54] <wikibugs>	 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656384 (10hashar) https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11504/ fails due to Flow eventually invoking `curl_multi_init()`  Looks...
[16:31:10] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656397 (10Jdlrobson)
[16:31:19] <wikibugs>	 06Release-Engineering-Team: Remove .gitreview from MediaWiki and Extensions - https://phabricator.wikimedia.org/T146293#2656409 (10thcipriani)
[16:32:08] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656424 (10Jdlrobson) https://gerrit.wikimedia.org/r/#/c/310458/6 is a first stab at this in MobileFrontend....
[16:33:00] <wikibugs>	 06Release-Engineering-Team: Remove .gitreview from MediaWiki and Extensions - https://phabricator.wikimedia.org/T146293#2656409 (10Paladox) This will break git-review, which I see some ops users use it.
[16:33:43] <shinken-wm>	 PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:38:12] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656445 (10ovasileva) p:05Triage>03Normal
[16:41:34] <hashar>	 !log deployment-tin02 initiale provisioning is complete. Gotta add it as a deployment server via a puppet.git patch
[16:41:38] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[16:41:39] <hashar>	 thcipriani: ^^^ \O/
[16:59:07] <hashar>	 so that has been a long day
[16:59:10] <hashar>	 and I am disappearing
[16:59:15] <hashar>	 :D
[17:07:24] <yuvipanda>	 legoktm: can I shut down integration-puppetmaster, so it doesn't confuse our precise puppetmaster stats?
[17:12:55] <Krenair>	 krenair@deployment-mira:/srv/mediawiki-staging/wmf-config$ git log HEAD..origin/master --oneline | wc -l
[17:12:56] <Krenair>	 6
[17:12:59] <Krenair>	 :(
[17:13:12] <Krenair>	 thcipriani, you still working on this?
[17:13:43] <shinken-wm>	 RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:14:40] <thcipriani>	 Krenair: hrm, I was focusing on database update, I didn't realize that mw-config was still broken since the job was green again https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/
[17:14:49] <thcipriani>	 although I see now it hasn't run in quite a while.
[17:19:32] <shinken-wm>	 PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[17:19:48] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656634 (10zeljkofilipin)
[17:20:15] <wmf-insecte>	 Project beta-update-databases-eqiad build #11505: 04STILL FAILING in 14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11505/
[17:24:40] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656397 (10zeljkofilipin) a:03zeljkofilipin
[17:26:09] <yuvipanda>	 !log cherry-pick https://gerrit.wikimedia.org/r/#/c/312044/ on deployment-puppetmaser
[17:26:12] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656653 (10zeljkofilipin) The rule of thumb so far was that if a feature is needed i...
[17:26:13] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[17:31:00] <wmf-insecte>	 Project beta-update-databases-eqiad build #11506: 04STILL FAILING in 14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11506/
[17:35:19] <shinken-wm>	 PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[17:37:32] <wmf-insecte>	 Project beta-update-databases-eqiad build #11507: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11507/
[17:37:53] <shinken-wm>	 PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[17:39:01] <grrrit-wm>	 (03PS1) 10Zfilipin: WIP Add helper to Selenium that allows you to query whether JavaScript module has loaded [selenium] - 10https://gerrit.wikimedia.org/r/312047 (https://phabricator.wikimedia.org/T146292) 
[17:39:07] <wmf-insecte>	 Project beta-update-databases-eqiad build #11508: 04STILL FAILING in 24 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11508/
[17:39:25] <thcipriani>	 sigh.
[17:39:58] <thcipriani>	 When I run update.php manually for these failures they work fine, but the full script continues to fail.
[17:41:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP Add helper to Selenium that allows you to query whether JavaScript module has loaded [selenium] - 10https://gerrit.wikimedia.org/r/312047 (https://phabricator.wikimedia.org/T146292) (owner: 10Zfilipin)
[17:42:52] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 13Patch-For-Review, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656739 (10zeljkofilipin) a:05zeljkofilipin>03None Will co...
[17:59:20] <wmf-insecte>	 Yippee, build fixed!
[17:59:20] <wmf-insecte>	 Project beta-update-databases-eqiad build #11509: 09FIXED in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11509/
[17:59:57] <thcipriani>	 so that gets the beta dashboard back to green: https://integration.wikimedia.org/ci/view/Beta/ (aside from the job that has been failing for 4 months)
[18:12:52] <shinken-wm>	 RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:15:19] <shinken-wm>	 RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:20:42] <legoktm>	 yuvipanda: I thought I already shut it down?
[18:21:32] <yuvipanda>	 legoktm: nope is up
[18:21:38] <legoktm>	 wat
[18:21:48] <legoktm>	 then please, shut it down
[18:21:53] <legoktm>	 I'll delete it next week
[18:22:57] <yuvipanda>	 !log shutting down integration-puppetmaster
[18:22:59] <yuvipanda>	 legoktm: done
[18:23:01] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:24:21] <shinken-wm>	 PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42)
[18:28:18] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 10Reading-Web-Tech-Debt, 07Browser-Tests, and 2 others: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656938 (10bmansurov)
[18:32:04] <shinken-wm>	 PROBLEM - Puppet run on deployment-ms-be01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:38:10] <marktraceur>	 Looks like Jenkins is throwing a fit
[18:38:45] <marktraceur>	 Maybe just on a couple patches
[19:00:20] <paladox>	 hashar you now known as idoine :)
[19:07:03] <shinken-wm>	 RECOVERY - Puppet run on deployment-ms-be01 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:24:37] <shinken-wm>	 PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:59:40] <shinken-wm>	 RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:09:41] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[21:27:37] <shinken-wm>	 PROBLEM - Keyholder status on deployment-mira is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[21:35:59] <shinken-wm>	 PROBLEM - Puppet run on deployment-tin02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[22:16:02] <shinken-wm>	 RECOVERY - Puppet run on deployment-tin02 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:19:56] <wikibugs>	 06Release-Engineering-Team, 15User-greg: Create agenda outline for 2016 RelEng team offsite - https://phabricator.wikimedia.org/T138437#2657686 (10greg) Drafting over in https://etherpad.wikimedia.org/p/releng-offsite201610-planning for now (probably will migrate to a gdoc later, as we'll have other assets to...
[22:20:32] <wikibugs>	 03Scap3: Local config deploys should use the target's current version - https://phabricator.wikimedia.org/T145373#2657689 (10thcipriani)