[01:57:35] Is zuul hung? I'm expecting to see https://gerrit.wikimedia.org/r/#/c/311888/ in gate-and-submit... [02:07:56] PROBLEM - Puppet staleness on deployment-jobrunner01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0] [03:11:36] Project mediawiki-core-code-coverage build #2275: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2275/ [04:15:24] PROBLEM - Puppet staleness on deployment-db1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [04:27:01] PROBLEM - Puppet staleness on deployment-db2 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [04:58:51] 10Gerrit: PHP libraries as Gerrit top-level projects - https://phabricator.wikimedia.org/T125031#1972587 (10Legoktm) I created the mediawiki/libs placeholder repository today, so new libraries can be created under it. The first one is mediawiki/libs/WaitConditionLoop :) [07:16:53] (03PS1) 10Legoktm: Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 [07:17:49] (03CR) 10Legoktm: [C: 032] Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 (owner: 10Legoktm) [07:18:46] (03Merged) 10jenkins-bot: Add jobs for ScopedCallback and WaitConditionLoop libraries [integration/config] - 10https://gerrit.wikimedia.org/r/311927 (owner: 10Legoktm) [07:19:25] !log deploying https://gerrit.wikimedia.org/r/311927 [07:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [07:24:39] 10Gerrit, 07Zuul: Gerrit ssh command failing on new repositories, causing zuul to not run against a change - https://phabricator.wikimedia.org/T146260#2655194 (10Legoktm) [07:52:41] moritzm: elukey: hello! I went crazy yesterday and rebuild a new deployment-mira instance [07:52:51] with extended disk space. Looks all good to us now [07:53:15] hashar: saw that that did what I planned for this morning, much appreciated :-) [07:53:20] hashar: saw that that you did what I planned for this morning, much appreciated :-) [07:57:55] moritzm: yeah I wanted to try it myself to level up on deployment server provisionning [07:58:00] learned a few tricks [07:58:18] hashar: shall we replace tin in deployment-prep today? [07:58:37] worth trying [07:59:09] going to be a bit funnier since deployment-tin is also a Jenkins slave [07:59:57] I think we can land https://gerrit.wikimedia.org/r/#/c/311760/ (which replace deployment-mira02 with deployment-mira) [08:01:44] ok, will make some time and look into merging it after that [08:02:42] hashar: hello! Can I nuke jobrunner01? [08:02:53] it has not been running anything since yesterday [08:06:13] elukey: yeah [08:06:28] elukey: I think the jessie one is good enough [08:09:36] super [08:11:26] !log terminated jobrunner01 and removed from deployment-prep's sacp dsh list [08:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:11:58] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2655298 (10elukey) [08:12:30] PROBLEM - Host deployment-jobrunner01 is DOWN: CRITICAL - Host Unreachable (10.68.17.96) [08:15:48] yeah shinken sorry [08:16:13] Project beta-scap-eqiad build #120961: 04FAILURE in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120961/ [08:18:39] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: mediawiki_selenium feature to show/capture Selenium WebDriver requests to remote browser. - https://phabricator.wikimedia.org/T94577#2655304 (10Jhernandez) Nice progress! I believe reading web's tests in jenkins mw-selenium job are ru... [08:26:19] Project beta-scap-eqiad build #120962: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120962/ [08:29:05] Project beta-scap-eqiad build #120963: 04STILL FAILING in 1 min 43 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120963/ [08:34:04] scap still tries to reach deployment-jobrunner01.deployment-prep.eqiad.wmflabs :/ [08:34:30] deployment-tin /etc/dsh/group/mediawiki-installation:deployment-jobrunner01.deployment-prep.eqiad.wmflabs [08:36:14] Project beta-scap-eqiad build #120964: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120964/ [08:37:46] !log beta: manually rebased puppetmaster [08:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:38:04] the puppet autorebaser does not work anymore last was on 20160920T1900 [08:38:05] bah [08:39:06] local diff detected, fixed [08:46:13] Yippee, build fixed! [08:46:14] Project beta-scap-eqiad build #120965: 09FIXED in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120965/ [08:47:21] elukey: moritz: lets drop the Trusty mira ! https://gerrit.wikimedia.org/r/311939 :) [08:49:13] having a look [08:50:07] should free up some quota to create a new tin [08:53:04] rebasing puppet master / running puppet on tin [08:54:55] Project beta-scap-eqiad build #120966: 04FAILURE in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120966/ [08:54:57] hashar: I'll remove mira now, ok? [08:55:04] yeah [08:55:10] got puppet to update the conf on deployment-tin [08:55:16] finally ! \O/ [08:55:34] !log remove mira from deployment-prep (replaced by deployment-mira) [08:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:55:53] so I think we can migrate the production mira or add a jessie deployment server in prod [08:56:00] there is high confidence it is going to just work [08:56:59] PROBLEM - Host mira is DOWN: CRITICAL - Host Unreachable (10.68.17.215) [08:57:27] Tyler sounded as if he wanted to run more tests? [08:57:51] Yippee, build fixed! [08:57:51] Project beta-scap-eqiad build #120967: 09FIXED in 1 min 39 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120967/ [08:58:02] moritzm: maybe. Cant remember the task [08:58:15] from a discussion I had with him yesterday one blocker was /srv being too small and that is solved [08:58:34] we had a quick chat about trying to run an actual scap/deploy from mira (instead of tin) so maybe that is what he wants to try [08:58:49] https://phabricator.wikimedia.org/T144578#2650020 [08:59:29] with that blocker now resolved, I'd say lets have this evening/US daytime for testing and then I'll reimage mira in production tomorrow [09:00:04] yeah seems he is willing to test a deploy from mira [09:00:14] which I know how to test actually hehe [09:00:46] the Jenkins jobs uses deployment-tin right now, i can migrate them to use deployment-mira instead [09:00:50] we'll need to add the jessie tin as deployment-tin2? [09:00:52] then we can assert that scap works fine [09:01:10] thus deployment-mira becomes the master deployment server [09:01:20] we can then add deployment-tin2 and delete deployment-tin [09:01:32] then switch the jobs from deployment-mira to deployment-tin2 [09:02:10] if we want to keep the deployment-tin name, I guess we can switch to mira and delete/recreate deployment-tin [09:10:14] I'm fine either way, not preference at all [09:24:30] gonna switch [09:34:14] !log From [[Hiera:deployment-prep]] remove bit already in puppet: "scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs [09:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:43:28] !log beta: switching master deployment server from deployment-tin to deployment-mira [09:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:56:30] hashar: shall I merge https://gerrit.wikimedia.org/r/#/c/311946/1 ? [09:57:34] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [10:03:37] moritzm: yeah that one is safe [10:03:46] there is a follow up change that I am currently trying out [10:05:46] !log Arming keyholder on deployment-mira [10:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:07:33] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [10:07:51] !log Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578 [10:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:10:53] !log deployment-mira removing "role::labs::lvm::srv" duplicate with role::ci::slave::labs::common [10:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:11:22] what a mess [10:13:31] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:21:18] so now [10:21:33] a CI slave expect the labs extended disk to be mounted on /mnt [10:21:49] on deployment-mira we have it mounted on /srv [10:22:03] on deployment-tin it is mounted on /mnt and there is a symlink /srv --> /mnt/srv [10:22:06] rather messy :D [10:23:04] hashar: https://gerrit.wikimedia.org/r/#/c/311946/ merged [10:23:16] great [10:23:29] I am going to get rid of a huge tech debt CI has [10:23:46] which is that extended disk is on /mnt when our best practice is /srv [10:24:07] that will please faidon :} [10:25:30] moritzm: I need some heavy refactoring and a migration. Will send a bunch of puppet patches and sprint that [10:25:53] ok [10:26:17] there is a conflict between the deployment role and the labs slave role :( [10:26:21] due to /mnt vs /srv [10:26:34] I am going to make the CI slave role to use /srv which is long overdue [10:26:51] * hashar grab an axe and slash in puppet manifests [10:32:10] 10Continuous-Integration-Config, 10MediaWiki-extensions-JsonConfig, 10MediaWiki-extensions-ZeroBanner, 06Reading-Web-Backlog, and 3 others: Zero phpunit test failure (blocks merges to MobileFrontend) - https://phabricator.wikimedia.org/T145227#2655549 (10phuedx) I've scheduled {a8733ff46c611f97b569db1c7981... [10:50:33] which me luck [10:57:31] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [10:57:46] !log Changing Jenkins slaves home dir for deployment-tin and deployment-mira from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy [10:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:57:59] PROBLEM - Free space - all mounts on deployment-tin is CRITICAL: CRITICAL: deployment-prep.deployment-tin.diskspace.root.byte_percentfree (<10.00%) [10:58:46] !log Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02 from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy [10:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:00:15] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:05:08] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:08:32] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [11:19:16] Project beta-cxserver-update-eqiad build #317: 04FAILURE in 6.4 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/317/ [11:19:37] I have broke it :( [11:20:54] Project beta-cxserver-update-eqiad build #318: 04STILL FAILING in 0.88 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/318/ [11:21:39] Project beta-cxserver-update-eqiad build #319: 04STILL FAILING in 0.84 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/319/ [11:22:23] Yippee, build fixed! [11:22:24] Project beta-cxserver-update-eqiad build #320: 09FIXED in 7.9 sec: https://integration.wikimedia.org/ci/job/beta-cxserver-update-eqiad/320/ [11:22:59] RECOVERY - Free space - all mounts on deployment-tin is OK: OK: deployment-prep.deployment-tin.diskspace._mnt.byte_percentfree (No valid datapoints found) [11:24:59] !log removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578 [11:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:26:32] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:22] doh [11:29:03] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:33:16] mobrovac: I have broken various services on beta cluster sorry :( [11:33:29] sigh [11:33:31] how? [11:33:38] what's up? [11:33:38] did a crazy migration [11:33:46] so that the jenkins slaves use /srv instead of /mnt [11:33:56] and I guess I screwed deployment-sca01 / deployment-sca02 :(((((((((((( [11:40:08] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [11:43:56] bah [11:46:21] somehow the deploy-service group does not exist on mira :( [11:46:39] let me check, I think I know [11:46:58] maybe it created manually [11:47:02] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:47:26] ah, no. not what I though [11:47:27] ah, no. not what I thought [11:47:46] maybe it is not provisioned via puppet [11:47:56] on deployment-tin it has deploy-service:x:52946:thcipriani,ladsgroup,mobrovac,twentyafterfour,elukey,joal,akosiaris,halfak,nuria,milimetric,arlolra,bd808,ssastry,cscott,krenair [11:48:02] brb [11:49:05] yeah, thcipriani|afk did some magic on deployment-tin to make it work [11:51:12] hashar, mobrovac: modules/beta/manifests/deployaccess.pp hardcodes it to the IP address of deployment-tin [11:51:19] ohhh [11:52:09] I have a puppet patch to adjust it to mira https://gerrit.wikimedia.org/r/#/c/311947/1/modules/beta/manifests/deployaccess.pp [11:53:27] I have cherry picked on beta puppet master [11:53:38] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:55:49] Sep 21 11:55:28 deployment-mediawiki06 sshd[6935]: reverse mapping checking getaddrinfo for ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs [10.68.20.135] failed - POSSIBLE BREAK-IN ATTEMPT! [11:55:53] that is from a mw host [11:55:57] error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1 [11:56:02] Failed publickey for mwdeploy from 10.68.20.135 [11:56:26] PROBLEM - Citoid on deployment-sca02 is CRITICAL: Connection refused [11:56:34] # dig +short -x 10.68.20.135 [11:56:34] ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs. [11:56:34] deployment-mira.deployment-prep.eqiad.wmflabs. [11:56:35] magic [11:57:12] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:57:22] that is deployment-mira got an IP address that has two PTR entries :( [11:59:42] PROBLEM - Free space - all mounts on deployment-sca01 is CRITICAL: CRITICAL: deployment-prep.deployment-sca01.diskspace._var.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._var_log.byte_percentfree (No valid datapoints found)deployment-prep.deployment-sca01.diskspace._srv.byte_percentfree (<40 [11:59:46] PROBLEM - Free space - all mounts on deployment-sca02 is CRITICAL: CRITICAL: deployment-prep.deployment-sca02.diskspace._mnt.byte_percentfree (No valid datapoints found)deployment-prep.deployment-sca02.diskspace._srv.byte_percentfree (<22.22%) [12:04:18] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs [12:04:18] Agent admitted failure to sign using the key. [12:04:20] yeah bah [12:04:39] hashar: hmm, Andrews should look into that when he's up [12:04:44] hashar: hmm, Andrew Bogott should look into that when he's up [12:04:54] I think it is just a notice [12:05:10] the dupe dns entries are a known issue, I dont think it prevents the connection [12:08:36] $ ssh-key-ldap-lookup mwdeploy [12:08:36] KeyError: 'sshPublicKey' [12:08:38] ... [12:08:46] the user has no key in ldap bah [12:09:26] ok [12:09:50] there is /etc/ssh/userkeys/mwdeploy though [12:17:49] Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: trying public key file /etc/ssh/userkeys/mwdeploy [12:17:49] Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: fd 4 clearing O_NONBLOCK [12:17:50] Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: debug1: restore_uid: 0/0 [12:17:50] Sep 21 12:17:19 deployment-mediawiki06 sshd[8514]: Failed publickey for mwdeploy from 10.68.20.135 port 46514 ssh2: RSA fa:e3:91:7e:86:e6:b3:9c:c5:63:df:44:71:75:cf:2f [12:21:16] on deployment-mediawiki06 the fingerprint of /etc/ssh/userkeys/mwdeploy match the fingerprint of a key in the keyholder [12:21:56] but somehow it is not presented [12:22:33] hmm it is [12:22:37] but sshd logs: Postponed publickey for mwdeploy from 10.68.20.135 port 46570 ssh2 [preauth] [12:33:38] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:35:46] moritzm: so I have no idea what is going on. I dont know the ssh issue is due to something missing in our puppet manifest or if it is an issue related to Jessie :( [12:41:02] mhh, so what in particular fails? [12:43:34] moritzm: from deployment-mira (Jessie) I can't use the keyholder agent-proxy [12:43:39] eg: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs [12:44:05] that yields: Agent admitted failure to sign using the key. [12:44:27] with the remote sshd apparently accepting the key but postponing it [12:47:02] so the agent admitted failure to sign using the key message is a result of you not being in the group authorized to use the key [12:47:02] but in that case it's from deployment-mira to deployment-mira, so on the same host? [12:47:20] luckily magic tyler woke up early :} [12:47:31] heh [12:48:27] this is something I do manually on beta :(( [12:48:34] it's captured in point 1 here https://phabricator.wikimedia.org/T144647 [12:49:37] yeah I noticed that part [12:49:48] and added myself to that group [12:50:06] oh [12:50:30] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs fails still :D [12:50:51] I should download the /etc of both deployment-mira and deployment-tin and do a diff :} [12:52:25] hrm I wonder if this has something to do with the new fingerprint algorithm for ssh agent? [12:53:22] looks like the fingerprint value match but who know really [12:57:44] blerg, trying to rearm keyholder...says "keyholder: command not found" [12:57:46] Only in mira/etc: gss [12:57:46] Only in tin/etc: gssapi_mech.conf [12:57:52] yeah /usr/local/sbin [12:59:00] hrm...may have broken something good :) [13:04:35] Yippee, build fixed! [13:04:36] Project selenium-Math » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #151: 09FIXED in 35 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/151/ [13:06:18] I am handling the production swat [13:07:25] kk still poking at keyholder [13:13:58] huh. I have no idea what happened, but it seems to be working now :(( [13:14:47] The only thing I did was have ssh-agent-proxy print out the perms hash on line 231 [13:15:54] although, maybe I didn't try to use /run/keyholder/proxy.sock after reloading keyholder-proxy to reload the perms. It's possible the perms on disk changed since the proxy was last restarted? Only explanation I can think of... [13:16:08] I'm going to finish morning things bbiab. [13:46:24] thcipriani|afk: or I have screwed up when arming the keyholder [13:46:31] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #152: 04FAILURE in 2 min 31 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/152/ [13:48:53] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2655927 (10zeljkofilipin) [[ https://github.com/mozilla/geckodriver#firefox-capabilities | `firefoxOptions` ]] is the way to go, but is a feature o... [13:49:39] ah and sudo is scary [13:49:45] sudo -u mwdeploy -n -- .... [13:49:53] failed to add keys to /mnt/home/jenkins-deploy/.ssh/ :D [13:50:03] sudo does not reset home, should have -H maybe [13:51:43] what about sudo -iu mwdeploy -n -- ....? [13:51:45] (note the -i) [13:52:00] and -i is? [13:52:45] and it seems rsync or ssh does not look in /etc/ssh/ssh_known_hosts [13:53:01] it does some things that result in the home being set [13:53:07] -H probably does the same thing [13:53:57] yeah might [13:54:04] blerg. ssh does not look at ssh_known_hosts [13:54:20] wonder when/why that changed? [13:55:14] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2655941 (10zeljkofilipin) Firefox 47.0.1 mediawiki_selenium 1.7.2 watir-webdriver 0.9.3 selenium-webdriver 2.53.4 firefox driver ``` $ bundle exec... [13:55:30] it's not reading /etc/ssh/ssh_known_hosts, just the user's known_hosts thcipriani? [13:56:11] Krenair: just going off of hashar 's comment above...double-checking now [13:56:28] ah [13:56:44] thcipriani: take your time for our chat, need to head to restroom/grab coffee [13:56:55] okie doke [13:58:28] ok back [13:58:44] hrm, /etc/ssh/ssh_known_hosts seems to be working for me on deployment-mira [13:58:59] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l mwdeploy deployment-mediawiki06.deployment-prep.eqiad.wmflabs [13:59:15] so [13:59:16] worked with a blank ~/.ssh/known_hosts [13:59:25] on deployment-mira I have used ssh§keyscan to populate it [13:59:31] nice :) [13:59:33] did that hours ago [13:59:38] based on your comment on some task [13:59:39] then [13:59:44] 13:58:57 pull-master failed: Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'ci-jessie-wikimedia-199745.contintcloud.eqiad.wmflabs']' returned non-zero exit status 10 [13:59:56] deployment-mira has an IP address with two PTR entries in dns [14:00:03] oh lovely [14:00:06] oh good. [14:00:08] and somehow it seems to use the IP [14:00:11] this is T115194 [14:00:12] does a PTR lookup [14:00:17] and use whatever is returned by dns [14:00:26] despite us having specific ip / hostnames listed [14:00:44] I guess it is an "helper" to show a hostname when IP are given [14:00:46] side effects [14:00:47] so [14:01:08] since dupe PTR entries are not cleaned, we might have to .... recreate an instance :) [14:01:29] heh. I think at this point you're re-re-re-recreating :) [14:01:34] or I can do some magic to vanish the bad PTR entries [14:01:48] unless you have other reasons to want to recreate [14:02:09] ohh [14:02:33] Krenair: if you get access to Designate yeah dropping the wrong ptr on 10.68.20.135 would be much appreciated [14:05:56] !log deployment-mira seems ready for action and is the primary deployment server. Enabling jenkins to it [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:06:34] !log Enabling Jenkins slave deployment-mira [14:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:06:41] Project beta-mediawiki-config-update-eqiad build #5705: 04FAILURE in 1 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5705/ [14:06:41] Project beta-mediawiki-config-update-eqiad build #5706: 04STILL FAILING in 82 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5706/ [14:06:42] Project beta-mediawiki-config-update-eqiad build #5707: 04STILL FAILING in 80 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5707/ [14:06:42] Project beta-mediawiki-config-update-eqiad build #5708: 04STILL FAILING in 89 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5708/ [14:06:43] Project beta-mediawiki-config-update-eqiad build #5709: 04STILL FAILING in 0.14 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5709/ [14:06:43] Project beta-mediawiki-config-update-eqiad build #5710: 04STILL FAILING in 74 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5710/ [14:06:44] Project beta-mediawiki-config-update-eqiad build #5711: 04STILL FAILING in 67 ms: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5711/ [14:06:44] Project beta-code-update-eqiad build #122416: 04FAILURE in 0.28 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/122416/ [14:06:45] Project beta-update-databases-eqiad build #11501: 04FAILURE in 0.14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11501/ [14:08:19] all that is me ^^^^ [14:08:26] !log deployment-mira adding puppet class beta::autoupdater [14:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:08:35] Er [14:08:41] I may have just fixed more than I was intending [14:09:32] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:09:42] or maybe not [14:11:44] Yippee, build fixed! [14:11:45] Project beta-mediawiki-config-update-eqiad build #5712: 09FIXED in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5712/ [14:12:21] Project beta-scap-eqiad build #120971: 04FAILURE in 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120971/ [14:12:40] Dammit [14:12:51] {u'message': u'Managed records may not be updated', u'code': 400, u'type': u'bad_request' [14:13:00] Project beta-scap-eqiad build #120972: 04STILL FAILING in 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120972/ [14:13:38] scap dies with: 14:13:00 PHP Fatal error: Class 'Memcached' not found in /srv/mediawiki-staging/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63 [14:14:55] Yippee, build fixed! [14:14:55] Project beta-code-update-eqiad build #122417: 09FIXED in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/122417/ [14:15:17] Project beta-scap-eqiad build #120973: 04STILL FAILING in 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120973/ [14:16:24] Krenair: bad luck :/ [14:16:34] Did it [14:16:47] Went through the designate source code and found an undocumented HTTP header that allowed what I wanted [14:16:50] Krenair: awesome [14:17:38] At least, it wasn't mentioned in the docs I was reading :) [14:17:47] ahahah [14:17:59] ah, in the old docs page [14:18:05] restarted nscd on deployment-mira [14:18:08] I should fix the new designate rest api does to show these things [14:19:01] api docs* [14:19:33] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [14:20:04] Project beta-update-databases-eqiad build #11502: 04STILL FAILING in 3.8 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11502/ [14:20:30] http://git.openstack.org/cgit/openstack/designate/tree/designate/api/middleware.py#n100 [14:24:53] Project beta-scap-eqiad build #120974: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120974/ [14:27:49] !log deployment-sca01 and deployment-sca02 are now broken. The CI puppet class mount /srv which ends up being only 500 MBytes [14:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:28:25] mobrovac: found the issue with deployment-sca01 / sca02 :D [14:28:32] not enough disk space eeg [14:28:53] huh [14:29:37] hashar: /srv has got only 480MB??? [14:29:39] wth? [14:30:28] hashar: jenkins and jenkins-workspace take for than 50% of that space [14:30:34] hashar: why are they even there? [14:33:58] mobrovac: I think they are small instances with just 20GB disk [14:34:16] and I refactored the CI class to mount /srv on the extended disk [14:34:20] which has only 500MB [14:34:33] a bit less than 20G is for / (the system) [14:34:59] Project beta-scap-eqiad build #120975: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120975/ [14:35:03] that's not good though [14:35:38] and, what is jenkins doing on deplyoment-sca any way? [14:35:46] all of the build jobs should be disabled [14:35:56] since all of the services are now using scap3 for deployments [14:36:29] 10Continuous-Integration-Config, 10MediaWiki-extensions-JsonConfig, 10MediaWiki-extensions-ZeroBanner, 06Reading-Web-Backlog, and 3 others: Zero phpunit test failure (blocks merges to MobileFrontend) - https://phabricator.wikimedia.org/T145227#2656042 (10phuedx) 05Open>03Resolved I'm happy to mark this... [14:38:16] mobrovac: ahhh [14:38:22] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656047 (10zeljkofilipin) Firefox 48.0.2 mediawiki_selenium 1.7.2 + [[ https://gerrit.wikimedia.org/r/#/c/310286/ | 310286 ]] watir 6.0.0.beta4 sel... [14:38:27] mobrovac: GOODNESS!!! going to drop the class and unmount /srv [14:38:48] hashar: no, don't umount /srv, all of the deploy code is there! [14:39:01] na it got mounted empty [14:39:16] (03PS3) 10Zfilipin: WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) [14:39:19] the deploy code is hidden in the /srv directory of the / partition [14:39:44] k [14:40:04] (03CR) 10jenkins-bot: [V: 04-1] WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) (owner: 10Zfilipin) [14:44:59] Project beta-scap-eqiad build #120976: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120976/ [14:52:51] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656073 (10zeljkofilipin) I don't think Marionette/geckodriver is ready yet. Setting the profile has changed. I have disabled profiles for this tes... [14:52:55] ok fixing deployment-sca01 02 [14:54:13] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:54:55] Project beta-scap-eqiad build #120977: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120977/ [14:54:56] well this is why it's having a problem finding memcached https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/mwscript#L22 [14:55:03] !log removed the CI puppet class from deployment-sca01 and deployment-sca02 . Stopped services using /srv , unmounted /srv, removed it from /etc/fstab [14:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:55:14] puppet is happy on sca01 [14:55:55] should mwscript still be using php5 explicitly? Is that correct? [14:56:12] oh my god [14:56:16] yeah that is quite old [14:56:23] we have a task for it iirc [14:56:27] RECOVERY - Citoid on deployment-sca02 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.055 second response time [14:56:32] the reason is some job exploding on terbium.eqiad.wmnet when using hhvm [14:56:38] which was some corner case / weird issue we had [14:56:47] so the hack was to get mwscript to use php5/zend [14:56:55] ah, ok. [14:56:57] it is probably no more needed [14:57:08] question about long running tasks, we have roughly 3 of them in search. We have a daily completion suggester build that runs about 7 hours, a weekly dump from hadoop to elasticsearch that runs ~35 hours, and a weekly export to dumps.wikimedia.org that runs for ~48 hours, do they all go on the deployments calendar? [14:57:14] but I can't remember off hand what is preventing us from moving terbium and/or mwscript back to zend [14:57:34] ebernhardson: in doubt yes? :) [14:57:40] well a quick install of php5-memcached will probably fix the explosions in beta-scap-eqiad /me does [14:58:05] ebernhardson: thanks for taking care of that. The idea came from Jynus (the DBA) since having long term running jobs can clash with database upgrades [14:58:33] ebernhardson: so the idea behind adding long running tasks to the deployment calendar is for people to eventually notice that two things are going to overlap somehow [14:58:54] so I guess your mileage may vary [14:59:09] mobrovac: RECOVERY - Citoid on deployment-sca02 [14:59:19] hashar: hmm, ok. I wasn't sure because these are scheduled things, the notice kinda read like it was for one-off maintenance actions but wasn't sure [14:59:42] RECOVERY - Free space - all mounts on deployment-sca01 is OK: OK: deployment-prep.deployment-sca01.diskspace._var.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._srv.byte_percentfree (More than half of the datapoints are undefined) deployment-prep.deployment-sca01.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca01.diskspace._var_log.byte_perce [15:00:05] ebernhardson: I guess be bold and enquire about it on whatever task/mail that proposed to add long running task [15:00:12] ebernhardson: maybe in your corner case it is not needed at lall [15:00:15] hashar: thnx [15:00:44] mobrovac: and puppet is all happy. How do you get services deployed on those sca hosts? [15:01:21] mobrovac: I mean: do you just git pull on the deployment server then scap deploy? [15:02:07] yes [15:03:02] so where are we with deployments on beta? [15:03:11] is deployment-mira the current master? [15:03:25] yeah [15:03:40] after a long day of madness [15:03:45] you're working on it? [15:04:03] thcipriani: I can't find the task about mwscript still using Zend php :( [15:04:44] thcipriani: but I found the root cause which is at https://phabricator.wikimedia.org/T132751 [15:04:46] RECOVERY - Free space - all mounts on deployment-sca02 is OK: OK: deployment-prep.deployment-sca02.diskspace._mnt.byte_percentfree (No valid datapoints found) deployment-prep.deployment-sca02.diskspace._srv.byte_percentfree (No valid datapoints found) [15:04:53] and the hack https://gerrit.wikimedia.org/r/#/c/267816/ [15:04:54] Project beta-scap-eqiad build #120978: 04STILL FAILING in 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120978/ [15:05:14] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:06:13] Krenair: yeah with ops we got deployment-mira setup with jessie with keyholder/scap etc [15:06:23] made it a master to validate that all the stack works fine on Jessie [15:06:29] I just ask because I notice mediawiki-config is 6 commits behind [15:06:32] Krenair: working on it. beta-scap-eqiad keeps failing since we're using php5 explicitly in mwscript so a bunch of libraries that hhvm has built-in need to be installed. Got php5-memcached and php5-redis installed just now [15:06:33] then we are going to get rid of deployment-tin (trusty) [15:06:38] thcipriani: the relevant patch that made it use php5 was from me, https://phabricator.wikimedia.org/rOPUP8f8e7dbdd834066504e59edfc4881bb98f76072a [15:06:38] ah, yep [15:06:39] and rebuild one based on Jessie [15:06:47] thcipriani: it was to fix a deployment error. not sure if it still exists [15:06:59] oh, gotcha. [15:07:01] well, if there's anything you need my help with, I'm here [15:07:28] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:45] interesting, so scap failure was the cause of the switch to php5 :) [15:08:28] I can't report it to phabricator :( [15:08:49] Manifest error #666: out of quota, you have created too many tasks [15:09:23] hahaha, what‽ [15:09:37] hashar: phabricator thinks you work too much :) [15:10:06] flling [15:10:51] Project mediawiki-core-code-coverage build #2276: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2276/ [15:14:22] thcipriani: did you get a task for lack of Memcached class on deployment server? [15:15:10] no, will file when I get a full list of needed libraries [15:16:31] although I think those two may have been it: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120979/console [15:16:41] * thcipriani files task [15:17:17] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar) [15:17:55] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656124 (10hashar) [15:18:29] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar) [15:18:50] hashar: that's backwards, curl_init_pooled only exists in hhvm [15:18:59] thcipriani: you can add it as a subtask of https://phabricator.wikimedia.org/T146285 which I have just created. Its description at the end as a placeholder "task to be filled" for you to add your task :) [15:19:06] ebernhardson: oh my I am very bad :// [15:19:45] ebernhardson: so your use case is definitely fixed apparently :} [15:20:02] ebernhardson: feel free to comment about it and remove gehel/dcausse and yourself from the list of subscribers. [15:20:03] Project beta-update-databases-eqiad build #11503: 04STILL FAILING in 2.9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11503/ [15:20:43] I think that this is why we haven't run into the problem of missing libraries for mwscript with deployment-tin/tin: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/manifests/packages.pp#L6-L10 [15:20:44] thcipriani: isn't the deployment-server supposed to include mediawiki::packages or something like that ? [15:20:51] thcipriani: but maybe we have dropped the php-redis / php-memcached from it [15:20:59] * gehel reading back... [15:21:01] hashar: heh, see last comment [15:21:07] hashar: which task are you talking about? [15:21:12] hashar: well, the way we fixed it for curl_init_pooled is to appropriately fall back to curl_init when it's not available. i wish i could remember (or had documented) what exactly was failing when it was switched the other way... [15:21:16] gehel: https://phabricator.wikimedia.org/T146285#2656106 [15:21:24] hashar: thanks! [15:21:26] gehel: related to PHP lacking curl_init_pooled() [15:21:45] or the other way around, I am all confused [15:21:56] the idea behind is me suddenly feeling that mwscript should use hhvm instead of zend :} [15:22:30] thcipriani: yeah so the role::deployment::server would have to include mediawiki::packages::legacy apparently [15:22:52] I am very happy we manage to catch that on beta instead of prod :D [15:23:11] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656140 (10EBernhardson) [15:24:26] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10thcipriani) [15:24:55] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2656155 (10zeljkofilipin) a:05zeljkofilipin>03None [15:25:08] Yippee, build fixed! [15:25:08] Project beta-scap-eqiad build #120979: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120979/ [15:25:27] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656160 (10thcipriani) [15:25:29] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656159 (10thcipriani) [15:25:56] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10thcipriani) [15:26:38] !sal [15:26:38] https://tools.wmflabs.org/sal/releng [15:29:08] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656179 (10hashar) **status for beta cluster** dpeloyment-mira is the new master running Jessie. The Jenkins jobs are running on it. There are... [15:29:51] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10hashar) [15:29:53] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656183 (10hashar) [15:31:20] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656188 (10thcipriani) [15:32:33] !log spawned deployment-tin02 [15:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:34:13] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [15:35:34] elukey: if still around. deployment-mira is the new master and https://gerrit.wikimedia.org/r/#/c/311947/ reflect that change :} [15:35:45] picked it on beta cluster [15:35:57] there is a long tail of other messies change that goes after but that specific change is ok [15:36:14] nice! [15:36:25] though no [15:36:26] ah [15:36:30] forgot about one more setting [15:36:39] copy pasting form ./hieradata/labs/deployment-prep/host/deployment-tin.yaml :D [15:41:03] elukey: sorry spoke too fast earlier. https://gerrit.wikimedia.org/r/311947 is good to go and includes a hack from https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-tin :) [15:52:20] (03Abandoned) 10Zfilipin: WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) (owner: 10Zfilipin) [15:56:09] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:58:28] ^^e [15:58:30] me [16:01:24] !log deployment-tin02 applied puppet classes beta::autoupdater, beta::deployaccess, role::deployment::server, role::labs::lvm::srv [16:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:04:58] PROBLEM - Puppet run on deployment-tin02 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [16:05:18] being provisionned [16:10:00] RECOVERY - Puppet run on deployment-tin02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:09] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:57] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: mediawiki_selenium feature to show/capture Selenium WebDriver requests to remote browser. - https://phabricator.wikimedia.org/T94577#2656344 (10zeljkofilipin) I have forgot to mention, a simple change in the script displays webdriver i... [16:20:04] Project beta-update-databases-eqiad build #11504: 04STILL FAILING in 3.9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11504/ [16:23:44] Call to undefined function curl_multi_init() [16:23:46] bah [16:23:54] php5-curl? [16:24:03] using zend php5 on deployment-mira [16:24:07] I guess yeah [16:24:14] tyler has a task about missing packages [16:24:27] we use that php function in VE [16:24:28] we are in a meeting though [16:24:34] ok [16:25:54] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656384 (10hashar) https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11504/ fails due to Flow eventually invoking `curl_multi_init()` Looks... [16:31:10] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656397 (10Jdlrobson) [16:31:19] 06Release-Engineering-Team: Remove .gitreview from MediaWiki and Extensions - https://phabricator.wikimedia.org/T146293#2656409 (10thcipriani) [16:32:08] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656424 (10Jdlrobson) https://gerrit.wikimedia.org/r/#/c/310458/6 is a first stab at this in MobileFrontend.... [16:33:00] 06Release-Engineering-Team: Remove .gitreview from MediaWiki and Extensions - https://phabricator.wikimedia.org/T146293#2656409 (10Paladox) This will break git-review, which I see some ops users use it. [16:33:43] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:38:12] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656445 (10ovasileva) p:05Triage>03Normal [16:41:34] !log deployment-tin02 initiale provisioning is complete. Gotta add it as a deployment server via a puppet.git patch [16:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:41:39] thcipriani: ^^^ \O/ [16:59:07] so that has been a long day [16:59:10] and I am disappearing [16:59:15] :D [17:07:24] legoktm: can I shut down integration-puppetmaster, so it doesn't confuse our precise puppetmaster stats? [17:12:55] krenair@deployment-mira:/srv/mediawiki-staging/wmf-config$ git log HEAD..origin/master --oneline | wc -l [17:12:56] 6 [17:12:59] :( [17:13:12] thcipriani, you still working on this? [17:13:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:14:40] Krenair: hrm, I was focusing on database update, I didn't realize that mw-config was still broken since the job was green again https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/ [17:14:49] although I see now it hasn't run in quite a while. [17:19:32] PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:19:48] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656634 (10zeljkofilipin) [17:20:15] Project beta-update-databases-eqiad build #11505: 04STILL FAILING in 14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11505/ [17:24:40] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656397 (10zeljkofilipin) a:03zeljkofilipin [17:26:09] !log cherry-pick https://gerrit.wikimedia.org/r/#/c/312044/ on deployment-puppetmaser [17:26:12] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656653 (10zeljkofilipin) The rule of thumb so far was that if a feature is needed i... [17:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:31:00] Project beta-update-databases-eqiad build #11506: 04STILL FAILING in 14 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11506/ [17:35:19] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:37:32] Project beta-update-databases-eqiad build #11507: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11507/ [17:37:53] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:39:01] (03PS1) 10Zfilipin: WIP Add helper to Selenium that allows you to query whether JavaScript module has loaded [selenium] - 10https://gerrit.wikimedia.org/r/312047 (https://phabricator.wikimedia.org/T146292) [17:39:07] Project beta-update-databases-eqiad build #11508: 04STILL FAILING in 24 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11508/ [17:39:25] sigh. [17:39:58] When I run update.php manually for these failures they work fine, but the full script continues to fail. [17:41:40] (03CR) 10jenkins-bot: [V: 04-1] WIP Add helper to Selenium that allows you to query whether JavaScript module has loaded [selenium] - 10https://gerrit.wikimedia.org/r/312047 (https://phabricator.wikimedia.org/T146292) (owner: 10Zfilipin) [17:42:52] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 13Patch-For-Review, 15User-zeljkofilipin: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656739 (10zeljkofilipin) a:05zeljkofilipin>03None Will co... [17:59:20] Yippee, build fixed! [17:59:20] Project beta-update-databases-eqiad build #11509: 09FIXED in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11509/ [17:59:57] so that gets the beta dashboard back to green: https://integration.wikimedia.org/ci/view/Beta/ (aside from the job that has been failing for 4 months) [18:12:52] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:15:19] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [18:20:42] yuvipanda: I thought I already shut it down? [18:21:32] legoktm: nope is up [18:21:38] wat [18:21:48] then please, shut it down [18:21:53] I'll delete it next week [18:22:57] !log shutting down integration-puppetmaster [18:22:59] legoktm: done [18:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:24:21] PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42) [18:28:18] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 10Reading-Web-Tech-Debt, 07Browser-Tests, and 2 others: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2656938 (10bmansurov) [18:32:04] PROBLEM - Puppet run on deployment-ms-be01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:38:10] Looks like Jenkins is throwing a fit [18:38:45] Maybe just on a couple patches [19:00:20] hashar you now known as idoine :) [19:07:03] RECOVERY - Puppet run on deployment-ms-be01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:24:37] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:59:40] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:41] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:27:37] PROBLEM - Keyholder status on deployment-mira is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:35:59] PROBLEM - Puppet run on deployment-tin02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [22:16:02] RECOVERY - Puppet run on deployment-tin02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:19:56] 06Release-Engineering-Team, 15User-greg: Create agenda outline for 2016 RelEng team offsite - https://phabricator.wikimedia.org/T138437#2657686 (10greg) Drafting over in https://etherpad.wikimedia.org/p/releng-offsite201610-planning for now (probably will migrate to a gdoc later, as we'll have other assets to... [22:20:32] 03Scap3: Local config deploys should use the target's current version - https://phabricator.wikimedia.org/T145373#2657689 (10thcipriani)