[00:08:04] twentyafterfour: +1 to all the things you said about the train deploy process sucking. It needs to be rebuilt from the ground up and much slaying of strange sacred ideas. [00:08:37] My biggest contribution was making the giant checklist visible rather than buried in Sam's head [00:09:42] bd808: indeed... I figured out how to semi-automate most of it today but it was taking too long and I had to finish the deploy so I didn't have time to tie it all together [00:10:07] can deployments be simulated on beta cluster? [00:10:21] I think they could [00:10:36] we run scap there and it has the whole multiwiki setup [00:10:57] today it only runs a single branch [00:11:19] but that was something that greg-g wanted to change this quarter at one point [00:11:50] one big improvement would be simply to have the different branches cloned from a local copy instead of cloning fresh from the origin every time [00:12:02] git-new-workdir would probably do the trick [00:12:30] yeah. ^d started to look at that ~1 year ago and then never got to finish it [00:12:56] <^d> It gets a little weird with the submodules [00:13:01] <^d> But it should be possible in theory [00:13:20] to test the waters I added a remote pointing from the new branch (wmf18) to /srv/mediawiki-staging/php-1.25wmf17/ ... then I was able to merge in the security patches using git instead of reapplying .patch files from csteipp [00:14:19] the submodules are a separate issue ...but at least the patches to core could be handled without making a fresh clone of the entire (large) repo for each branch deploy [00:14:41] In Beta today, each "deploy" runs the beta-code-update-eqiad jenkins job which is an inplace update of all the staging things and then the beta-scap-eqiad job that just runs scap to send the staged copy out across the cluster [00:14:47] I mean, git-new-workdir just symlinks a bunch of stuff to a shared .git [00:15:14] So you could practice things by stopping the update job and doing whatever you wanted before runnign scap [00:15:46] I want to make a simple dashboard that collects all the details about the current state of mediawiki-staging and summarizes it so that it's not so difficult to tell what's going on [00:16:30] that would be awesome [00:17:24] basically a conglomeration of git status / git log, git submodule summary, active mediawiki versions, and some kind of representation of the wikiversions.json pointers [00:48:56] marxarelli: I filed https://phabricator.wikimedia.org/T89917 as a placeholder for mw-vagrant install parties at both hackathons [00:49:25] feel free to reference it on your application to attend either or both [00:49:35] bd808: nice! [00:49:46] * bd808 is not sure about this "community buddy" idea [02:05:57] bd808: Can it be Reedy? [02:06:12] I was wondering the same thing :) [02:07:02] * James_F grins. [03:30:06] ^d, have you seen those unexpected N4HPHP13DataBlockFullE fatals? [03:30:38] <^d> I have not [03:32:25] looks like they stopped just over 5 minutes ago [03:33:15] <^d> Lob it in Phab if you'd like so we don't forget to follow up [03:33:16] * ^d is walking out the door to dinner [03:56:47] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #378: FAILURE in 7 min 45 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/378/ [04:54:40] ^d: I've added to you for co-deployer for CX/cxserver (specially, need first time assistance for CX) :) [05:02:25] Yippee, build fixed! [05:02:25] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #525: FIXED in 17 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/525/ [08:57:06] 3Deployment-Systems: l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail - https://phabricator.wikimedia.org/T76061#1049229 (10Nikerabbit) Still same error in the log for last run. [09:03:12] (03PS1) 10Adrian Lang: Ignore HEADLESS and KEEP_BROWSER_OPEN for phantomjs [selenium] - 10https://gerrit.wikimedia.org/r/191552 [09:05:24] (03CR) 10Adrian Lang: "Btw, how comes https://rubygems.org/gems/mediawiki_selenium/versions/0.4.2 is not in this repo?" [selenium] - 10https://gerrit.wikimedia.org/r/191552 (owner: 10Adrian Lang) [09:07:17] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [09:18:55] 3Beta-Cluster: Don't throttle WMF office IP(s) for account creation - https://phabricator.wikimedia.org/T87841#1049247 (10hashar) 5Open>3Resolved Should be good now. Thanks @JohnLewis ! [09:18:56] 3Beta-Cluster: Account creation throttling too restrictive on Beta Cluster - https://phabricator.wikimedia.org/T87704#1049249 (10hashar) [09:25:34] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » en,contintLabsSlave && UbuntuTrusty build #15: FAILURE in 18 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=en,label=contintLabsSlave%20&&%20UbuntuTrusty/15/ [09:41:48] 3Release-Engineering: Rethinking our deployment process - https://phabricator.wikimedia.org/T89945#1049275 (10mmodell) 3NEW a:3mmodell [09:53:40] 3Release-Engineering: Rethinking our deployment process - https://phabricator.wikimedia.org/T89945#1049302 (10Aklapper) p:5Triage>3Normal [09:53:56] 3Deployment-Systems, Release-Engineering: Rethinking our deployment process - https://phabricator.wikimedia.org/T89945#1049275 (10Aklapper) [10:45:48] (03PS1) 10Amire80: Remove failing ULS jobs: [integration/config] - 10https://gerrit.wikimedia.org/r/191566 [10:59:27] zeljkof: Great success! [10:59:28] https://integration.wikimedia.org/ci/view/BrowserTests/view/VisualEditor/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=ast,label=contintLabsSlave%20&&%20UbuntuTrusty/16/console [10:59:29] All the uploads wrked. [10:59:31] I checked https://commons.wikimedia.org/wiki/Special:Contributions/LanguageScreenshotBot [10:59:32] uploads for ast appear to be correct. [11:00:36] aharoni: yeah! :) [11:00:52] you were worried about passwords but everything just worked! [11:01:03] I can now start doing the same for ContentTranslation! [11:01:05] Excitement! [11:02:41] aharoni: I think we can create a command for upload ? something like bundle exec commons_upload ? [11:03:04] aharoni: This will eliminate upload.rb [11:05:31] 3Deployment-Systems: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1049432 (10Aklapper) p:5Triage>3High [11:22:58] vikasyaligar: patches welcome :) [11:23:04] will it be in the gem? [11:23:22] aharoni: Yes ! [11:24:39] cool, pull request is welcome here: https://github.com/amire80/commons_upload [11:25:49] vikasyaligar: I made you a collaborator in GitHub and rubygems for this gem, too. [11:27:34] aharoni: yup ! thank you :) [11:29:59] aharoni: can that be used to send any screenshot to Commons? [11:44:47] (03CR) 10Hashar: [C: 032] "Already deleted apparently \O/" [integration/config] - 10https://gerrit.wikimedia.org/r/191566 (owner: 10Amire80) [11:51:31] (03Merged) 10jenkins-bot: Remove failing ULS jobs: [integration/config] - 10https://gerrit.wikimedia.org/r/191566 (owner: 10Amire80) [11:54:30] (03PS4) 10Hashar: VectorBeta depends on EventLogging [integration/config] - 10https://gerrit.wikimedia.org/r/191262 (owner: 10Mattflaschen) [12:01:25] (03CR) 10Hashar: [C: 032] "Jobs updated" [integration/config] - 10https://gerrit.wikimedia.org/r/191262 (owner: 10Mattflaschen) [12:03:29] kart_: yes, pretty much. [12:04:08] that's what I did with vikasyaligar and zeljko in last GSoC, and since then the three of us are slowly maintaining and developing it. [12:04:57] kart_: until now it was all coupled to VisualEditor, but now we decoupled the generic screenshot capturing and uploading functionality, so it can be used by any MediaWiki component. [12:05:02] My guess it that CX will be next. [12:08:28] (03Merged) 10jenkins-bot: VectorBeta depends on EventLogging [integration/config] - 10https://gerrit.wikimedia.org/r/191262 (owner: 10Mattflaschen) [12:14:33] aharoni: nice! [12:37:35] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [12:41:17] (03PS4) 10Hashar: zuul: test 'recheck' behavior [integration/config] - 10https://gerrit.wikimedia.org/r/184967 [12:41:23] (03PS3) 10Hashar: zuul: test check/test behavior [integration/config] - 10https://gerrit.wikimedia.org/r/184968 [14:39:56] (03CR) 10Hashar: [C: 031] Rebuild composer autoloader to support classmap-authoritative setting [integration/phpunit] - 10https://gerrit.wikimedia.org/r/188398 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [14:55:02] aharoni: around? [15:14:22] aharoni: sent you e-mail [15:14:38] zeljkof: got it :) [15:14:42] zeljkof: he is in meeting. [15:14:48] oh. here he is :) [15:14:50] kart_, aharoni :) [15:23:18] <^d> Krenair: I'm still seeing those errors you mentioned last night [15:23:49] <^d> I'm going to take this to #-core. Looks like an HHVM thing [15:23:54] ok [15:23:54] (03CR) 10Hashar: "Hello Kartik, from my comment on T87607 we have to figure out how to get jenkins-debian-glue to build with a Trusty. We probably want to r" [integration/config] - 10https://gerrit.wikimedia.org/r/190708 (https://phabricator.wikimedia.org/T87607) (owner: 10KartikMistry) [15:28:39] 3Continuous-Integration: debian-glue need multiple distributions support (add Ubuntu Trusty and Debian Jessie) - https://phabricator.wikimedia.org/T89959#1049819 (10hashar) 3NEW a:3hashar [15:41:25] (03PS1) 10Hashar: Remove '{name}-debbuild' (unused) [integration/config] - 10https://gerrit.wikimedia.org/r/191623 [15:41:37] (03CR) 10Hashar: [C: 032] Remove '{name}-debbuild' (unused) [integration/config] - 10https://gerrit.wikimedia.org/r/191623 (owner: 10Hashar) [15:49:00] (03Merged) 10jenkins-bot: Remove '{name}-debbuild' (unused) [integration/config] - 10https://gerrit.wikimedia.org/r/191623 (owner: 10Hashar) [15:55:33] hi hashar! how would I go about deploying? https://gerrit.wikimedia.org/r/188398 (updating phpunit's composer autoloader)? will merging it update it everywhere? [15:58:22] I think a merge will update in labs slaves and then prod will need a trebuchet deploy [16:03:01] 3Continuous-Integration, MediaWiki-extensions-WikibaseRepository, Wikidata: generate patch code coverage on gerrit patch-set upload for wikibase.git - https://phabricator.wikimedia.org/T88435#1049926 (10Lydia_Pintscher) p:5Triage>3Normal [16:04:25] legoktm: what bd808 said :-] [16:04:41] the integration/phpunit is maintained by puppet git::clone on labs instance [16:04:44] and via git-deploy on prod [16:04:47] are there docs somewhere on how to do the trebuchet deploy? [16:05:00] if trebuchet has doc yeah [16:05:02] ssh tin.eqiad.wmnet [16:05:13] cd /srv/deployment/integration/slave-scripts [16:05:13] git deploy start [16:05:13] git pull [16:05:14] [16:05:20] git deploy sync [16:05:20] r [16:05:20] r [16:05:20] r [16:05:21] r [16:05:21] y [16:05:22] r [16:05:22] r [16:05:23] r [16:05:23] r [16:05:24] y [16:05:34] r == retry [16:05:37] y = yes / continue [16:06:28] a sad but true representation of a typical trebuchet install [16:07:28] legoktm: you want to do that in integration/phpunit rather than slave-scripts for this one though [16:07:38] right :P [16:07:50] (03CR) 10Legoktm: [C: 032] Rebuild composer autoloader to support classmap-authoritative setting [integration/phpunit] - 10https://gerrit.wikimedia.org/r/188398 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:08:43] re [16:09:08] I typically do `git fetch; git log --stat HEAD..origin/master; git rebase origin/master` rather than `git pull` [16:09:08] bd808: with bit torrent we would know when all leechers finished fetching [16:09:13] then salt the switch to the new ver [16:09:51] Or we could get rid of salt, switch to mcollective and have a process that is actually scriptable [16:10:16] not sure whether ops will like mcollective [16:10:29] does ansible provide such orchestration system ? [16:10:34] some would, some wouldn't [16:11:43] (03CR) 10QEDK: [C: 031] Rebuild composer autoloader to support classmap-authoritative setting [integration/phpunit] - 10https://gerrit.wikimedia.org/r/188398 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:12:21] <^d> bd808: My crazy idea was to torrent the .git directory about [16:12:53] <^d> Then git-deploy could just to the checkout on all hosts that are up to date [16:13:54] who's QEDK? [16:13:57] The big trick is ensuring that all hosts are either up-to-date or excluded from participating in whatever the repo is providing. Running mixed version networks is a huge pain for most things [16:14:33] <^d> With torrenting we'd know when clients were done [16:14:44] <^d> (plus they'd only have to ever fetch the delta) [16:15:06] <^d> Long as we don't do destructive repacks ;-) [16:15:15] (03Merged) 10jenkins-bot: Rebuild composer autoloader to support classmap-authoritative setting [integration/phpunit] - 10https://gerrit.wikimedia.org/r/188398 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:15:18] legoktm: no idea. random looking gmail address and no patches anywhere [16:16:36] Missing the following configuration item: user.name [16:16:36] Missing the following configuration item: user.email [16:16:36] Please add the missing configuration items via git config or in the .trigger file [16:17:29] <^d> bd808: Anyway, it may all be very crazy [16:17:56] legoktm: that sounds weird. getting that on tin? [16:18:06] (03PS1) 10Hashar: Switch debian-glue to Trusty instance [integration/config] - 10https://gerrit.wikimedia.org/r/191626 (https://phabricator.wikimedia.org/T89959) [16:18:12] yeah, I created a ~/.gitconfig and it went away [16:18:23] ^d: crazy ideas are often the best. I want twentyafterfour to just tell us the right way to do it. :) [16:18:27] (03CR) 10Hashar: [C: 032] Switch debian-glue to Trusty instance [integration/config] - 10https://gerrit.wikimedia.org/r/191626 (https://phabricator.wikimedia.org/T89959) (owner: 10Hashar) [16:18:37] 2/2 minions completed fetch [16:18:37] Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [16:18:43] is it r or y? [16:18:50] 2/2 == y [16:19:13] <^d> Anything less than a full fetch you should at least check [d] first before continuing [16:19:21] *nod* [16:19:27] <^d> (that's the biggest gripe I have with it, that stale shit doesn't get purged) [16:19:35] checkout was 2/2 so I said y, and then "Deployment finished." [16:19:53] it often takes several trebuchet timeout cycles for all the minions to get the command from the salt master [16:20:07] <^d> *several hundred thousand [16:20:14] legoktm: perfect [16:20:45] * bd808 gives legoktm an "I survived using Trebuchet" sticker [16:20:45] !log updated phpunit for https://gerrit.wikimedia.org/r/188398 [16:20:48] Logged the message, Master [16:20:49] :> [16:21:50] <^d> bd808: I used trebuchet and all I got was a goddamn sticker [16:22:03] legoktm: so how many other composer installs to we need to touch to get the classmap authoritative ClassLoader.php everywhere? [16:22:21] theoretically that one should be good enough for mediawiki-config... [16:22:44] https://gerrit.wikimedia.org/r/#/c/188393/ [16:23:04] there are a bunch of 18hr old beta-mediawiki-config-update-eqiad jobs queued btw [16:23:12] blah [16:23:19] stuck I bet [16:23:34] ah [16:23:42] yeah deployment-bastion is deadlocked [16:23:46] gotta restart Jenkins :-( [16:23:53] stupid damn database update job locked it all again [16:24:11] can I try to shake it loose without a restart? [16:24:19] sure [16:24:26] sometimes it works to disable the slave and kill all the jobs [16:24:32] sometimes [16:24:44] from some trace I took, it seems it is the Gearman plugin that considers the executors on deployment-bastion slaves are unavailable [16:24:55] right [16:25:16] toggling the slave off and on can shake that loose most of the time [16:25:34] but you also have to kill all the stacked up jobs [16:25:45] * bd808 tries [16:26:39] !log disconnected deployment-bastion.eqiad from jenkins [16:26:44] Logged the message, Master [16:27:23] !log killed all pending jobs for deployment-bastion.eqiad [16:27:24] <^d> Running out of log space again or something else? [16:27:27] Logged the message, Master [16:28:15] !log disconnected deployment-bastion.eqiad from jenkins [16:28:18] Logged the message, Master [16:28:47] !log reconnected deployment-bastion.eqiad to jenkins [16:28:49] Logged the message, Master [16:29:01] oh [16:29:53] bah. still getting the waiting for executor message [16:29:56] one more round [16:30:46] the plugin has a bunch of logs at https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/ [16:30:51] !log disconnected and reconnected deployment-bastion.eqiad again [16:30:52] Feb 19, 2015 4:28:14 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor unlock [16:30:53] AvailabilityMonitor unlock request: null [16:30:54] Logged the message, Master [16:30:58] the last 'null' should be a hostname [16:31:21] AvailabilityMonitor lock request: deployment-bastion.eqiad_exec-2 [16:31:22] oh [16:31:33] maybe that fixed it [16:31:51] nope :( [16:32:10] #45142 (pending—Waiting for next available executor on deployment-bastion.eqiad) [16:32:15] :-( [16:32:16] https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ [16:32:27] gotta grab my daughter back home [16:32:32] it works sometimes but I've had to toggle it a bunch of times [16:32:34] only fix is to restart Jenkins :-/ [16:32:58] * hashar_ waves [16:33:08] zeljkof: aharoni: What do you think about https://github.com/amire80/commons_upload/pull/2 ? [16:33:34] vikasyaligar: sorry, in a meeting [16:33:41] Another thing we could try is toggling the gearman plugin off and back on [16:33:44] zeljkof: OK :) [16:34:00] * bd808 does that [16:34:23] bah that requires a restart too [16:35:16] ^d: have you done a full jenkins restart before? I haven't here. Not sure I know all the right bits [16:36:05] <^d> You just restart the service right? [16:36:32] <^d> Ah, https://integration.wikimedia.org/ci/manage [16:36:33] yeah I think that's it [16:36:42] <^d> "Prepare for shutdown" [16:36:47] <^d> Probably good to do that first [16:36:51] *nod* [16:37:11] <^d> "Stops executing new builds, so that the system can be eventually shut down safely." [16:37:20] <^d> Yeah, I'd do that, then kick the service [16:37:24] +1 [16:41:27] https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart_all_of_Jenkins [16:41:48] (g'morning btw) [16:42:23] <^d> Oh that's a nice shiny button [16:42:26] <^d> I like shiny buttons [16:52:20] (we need more nice shiny buttons) [16:58:39] <^d> bd808: You know the default pull strategy on deployment repos is rebase, right? Your fancy rebase commands are just extra keystrokes ;-) [16:59:08] <^d> Too many people just do `git pull` without thinking, so I made sure that does the right thing! [16:59:53] explicit is better than implicit and I can peek before I apply :P [17:01:09] <^d> Oh fetching and checking the log is always good :) [17:02:23] <^d> pulling without rebase leads to messy ugly merge commits you don't see in the gerrit-hosted version of the repo [17:02:29] <^d> And hides security patches, which is bad bad bad [17:18:05] 3Continuous-Integration: Why are the language screenshot tests stalled by so long? - https://phabricator.wikimedia.org/T89178#1050121 (10greg) p:5Triage>3Low [17:18:18] (03CR) 10Phuedx: "Are there any requirements for the folder structure inside of src/docs? Note well that running `make docs` generates docs in the js, php, " [integration/config] - 10https://gerrit.wikimedia.org/r/191046 (https://phabricator.wikimedia.org/T74794) (owner: 10Hashar) [17:25:32] (03Merged) 10jenkins-bot: Switch debian-glue to Trusty instance [integration/config] - 10https://gerrit.wikimedia.org/r/191626 (https://phabricator.wikimedia.org/T89959) (owner: 10Hashar) [17:55:57] 3Continuous-Integration, Wikimedia-Fundraising-CiviCRM: CI for Civi: provision and run tests under Jenkins/Zuul - https://phabricator.wikimedia.org/T86103#1050236 (10awight) [17:58:31] cscott: Can you kick Jenkins (or tell me who to ask; Krinkle|detached is |detached and hashar is absent)? It's got 328 items in the queue and counting, with nothing executing. [17:58:40] <^d> I already did kick it [17:58:48] Ah. Darn. [17:59:05] <^d> And I saw at least one job go through post-restart [17:59:14] * James_F sighs. [17:59:33] <^d> All the slaves are connected [18:18:12] chrismcmahon: Here you go: https://gerrit.wikimedia.org/r/#/c/191655 [18:19:18] thanks vikasyaligar, I merged it [18:19:30] chrismcmahon: thank you :) [18:22:13] <^d> greg-g: Jenkins is really hurting, need more help. [18:23:04] integration-slave1007 (offline) [18:23:04] integration-slave1008 (offline) [18:23:04] integration-slave1009 (offline) [18:23:06] why? [18:23:54] <^d> 7 and 8 were because low disk space, automatic [18:23:55] <^d> That's ok [18:24:06] <^d> 9 says d/c by Krinkle|detached for debugging [18:24:25] <^d> I'm not worried about that. I'm worried because no jobs except beta-scap-update seem to be making it into an executor [18:24:32] <^d> zuul's backed up with like >300 jobs [18:25:20] blerg [18:25:32] how did that break? [18:26:06] is it worth kicking jenkins the "hardcore way"? [18:26:17] https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart_all_of_Jenkins [18:26:38] lets' check zuul's gearman [18:26:53] <^d> Ah yes, let's check gearman [18:26:58] https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Debugging [18:29:03] ffs I did it [18:29:11] geraman plugin is disable in jenkins [18:29:29] jenkins will need to restart again [18:29:44] * ^d gives bd808 another of his shirts [18:29:47] <^d> :) [18:29:53] !log restarting jenkins because I messed up and disabled gearman plugin earlier [18:29:58] Logged the message, Master [18:30:10] less work for me [18:30:28] can I nuke the running sauce tests? [18:30:38] bd808: nuke away [18:31:06] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #324: ABORTED in 24 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/324/ [18:31:13] kill 'em all [18:31:16] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #478: ABORTED in 21 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/478/ [18:32:02] Please wait while Jenkins is getting ready to work... [18:32:18] * greg-g brews him some coffee [18:32:45] somebody should take away my rights to do things while I'm pretending to be a PM [18:32:56] bam! [18:33:03] things are flowing in from gearman now [18:33:13] all of the things [18:33:28] * bd808 expects some slaves to fall on their faces [18:34:57] bd808: actually, for future reference, you can kill browser test builds any time you want. the daytime builds are less valuable than the overnight ones, and even those aren't sacred [18:35:25] bd808: Aha, thanks. [18:35:40] and Thu and Fri builds less valuable than Mon/Tue/Wed [18:36:13] just don't stop them forever :-) [18:36:32] chrismcmahon: you could go on vacation though! [18:37:03] heh [18:37:35] !log cleaned up mess in /tmp on integration-slave1007 [18:37:39] Logged the message, Master [18:38:53] !log brought integration-slave1007 back online [18:38:55] Logged the message, Master [18:41:56] !log cleaned up mess in /tmp on integration-slave1008 [18:41:59] Logged the message, Master [18:44:03] ^d: you're not going to believe this crap. (pending—Waiting for next available executor on deployment-bastion.eqiad) [18:44:13] fffffffffuuuuuuuuu [18:44:18] <^d> ... [18:45:11] bd808: Again? [18:45:23] yeah. [18:45:29] * James_F sighs. [18:45:52] https://crayfisher.files.wordpress.com/2012/07/double_facepalm_tng1.jpg [18:47:11] <^d> Why does it keep disconnecting from dp-bastion? [18:47:56] There is some gearman lockup that only seems to strike that box. It's a bug in the jenkins gearman plugin that deadlocks [18:48:16] <^d> And we have to kick the master? [18:48:20] hashar has stacktraces in a phab task somewhere [18:48:54] <^d> In the meantime, can we bring 7 and 8 back up? Some jobs are half-stuck on them [18:49:00] fixes are shake dp-bastion violently or restart jenkins yet again [18:49:14] I brought them up and the toggled right back to down [18:49:20] <^d> boo [18:49:22] <^d> silly jenkins [18:49:33] df looks good but they havent told jenkins apparently [18:49:48] "go home jerkins you're drunk" [18:50:22] <^d> Should've !logged that [18:50:56] <^d> Well, we can kick -bastion again [18:51:05] the problem isn't there [18:51:08] <^d> I'd rather not kick jenkins until the zuul queue goes down [18:51:11] yeah [18:51:25] I just did the jenkins detach/reattach dance [18:51:50] * greg-g sighs [18:53:12] looks like 07 and 08 are staying alive now [18:54:44] <^d> queue's down to 100 [18:55:18] https://graphite.wikimedia.org/render/?from=-8hours&height=180&width=400&target=alias(color(zuul.geard.queue.running.value,%27blue%27),%27Running%27)&target=alias(color(zuul.geard.queue.waiting.value,%27red%27),%27Waiting%27)&target=alias(color(zuul.geard.queue.total.value,%27888888%27),%27Total%27)&title=Zuul%20Geard%20job%20queue%20(8%20hours)&_=1424372094169 [18:55:23] fun graph [18:57:20] other than active tests the build queue is just sauce labs stuff again [18:58:58] !log took deployment-bastion jenkins connection offline and online 5 times; gearman plugin still stuck [18:59:01] Logged the message, Master [19:00:31] So there's this -- https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock [19:01:24] !log toggling gearman plugin in jenkins admin console [19:01:28] Logged the message, Master [19:02:48] !log VICTORY! deployment-bastion jenkins slave unstuck [19:02:54] Logged the message, Master [19:03:40] .... for one f'ing job? [19:03:50] no there it goes again [19:04:15] <^d> https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/45147/console wfm? [19:04:52] <^d> And now https://integration.wikimedia.org/ci/job/beta-scap-eqiad/42319/console [19:04:54] yeah. I killed one that had been waiting for 5+ minutes then the queue started to move again [19:04:55] "there it goes again" I think was a positive not negative statement [19:05:38] now we just wait for the singel file line for gate-and-submit [19:06:10] no cutting please [19:06:38] I was never here [19:06:42] * bd808 slinks away [19:32:44] 3MediaWiki-extensions-GWToolset, Multimedia, Beta-Cluster: Creating directory with special characters - https://phabricator.wikimedia.org/T75725#1050759 (10Bawolff) I can't reproduce this (locally using the username Léna, and on beta commons using username Léna2). For example, I successfully uploaded http://com... [19:45:30] !log Destroying integration-slave1009 and re-imaging [19:45:35] Logged the message, Master [19:53:23] (03PS1) 10Hashar: debian-glue can now use a different distribution [integration/config] - 10https://gerrit.wikimedia.org/r/191676 (https://phabricator.wikimedia.org/T89959) [19:53:41] (03CR) 10Hashar: [C: 032] debian-glue can now use a different distribution [integration/config] - 10https://gerrit.wikimedia.org/r/191676 (https://phabricator.wikimedia.org/T89959) (owner: 10Hashar) [19:55:57] 3Continuous-Integration: debian-glue need multiple distributions support (add Ubuntu Trusty and Debian Jessie) - https://phabricator.wikimedia.org/T89959#1050959 (10hashar) [19:56:59] 3Continuous-Integration: debian-glue need multiple distributions support (add Ubuntu Trusty and Debian Jessie) - https://phabricator.wikimedia.org/T89959#1049819 (10hashar) To change the distribution, we just have to `export distribution=precise` which build-and-provide-package recognize. '{name}-debian-glue' l... [20:01:08] (03Merged) 10jenkins-bot: debian-glue can now use a different distribution [integration/config] - 10https://gerrit.wikimedia.org/r/191676 (https://phabricator.wikimedia.org/T89959) (owner: 10Hashar) [20:04:55] (03PS4) 10Hashar: Enable jenkins for operations/debs/contenttranslation [integration/config] - 10https://gerrit.wikimedia.org/r/190708 (https://phabricator.wikimedia.org/T87607) (owner: 10KartikMistry) [20:06:02] 3Continuous-Integration: debian-glue need multiple distributions support (add Ubuntu Trusty and Debian Jessie) - https://phabricator.wikimedia.org/T89959#1050984 (10hashar) 5Open>3Resolved Haven't tried Jessie, but Precise/Trusty should work. All debian-glue jobs are using `$distribution=trusty`. [20:06:03] 3Continuous-Integration, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Enable Debian CI tests on all Apertium packages - https://phabricator.wikimedia.org/T87607#1050986 (10hashar) [20:06:59] (03CR) 10Hashar: [C: 032] "Rebased and integrated changes made for T89959: debian-glue need multiple distributions support (add Ubuntu Trusty and Debian Jessie)" [integration/config] - 10https://gerrit.wikimedia.org/r/190708 (https://phabricator.wikimedia.org/T87607) (owner: 10KartikMistry) [20:13:42] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #478: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/478/ [20:13:56] (03Merged) 10jenkins-bot: Enable jenkins for operations/debs/contenttranslation [integration/config] - 10https://gerrit.wikimedia.org/r/190708 (https://phabricator.wikimedia.org/T87607) (owner: 10KartikMistry) [20:17:28] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #477: FAILURE in 21 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/477/ [20:26:20] 3Quality-Assurance, VisualEditor, VisualEditor-MediaWiki: Update language_screenshot test - https://phabricator.wikimedia.org/T89370#1051039 (10Cmcmahon) 5Open>3Resolved [20:37:22] 3Continuous-Integration, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Enable Debian CI tests on all Apertium packages - https://phabricator.wikimedia.org/T87607#1051076 (10hashar) I have migrated the debian-glue jobs to Trusty instances and additionally made them to explicitly `export... [20:54:54] hey, is there a way for me to trigger full V+2 tests for a commit created by a non-whitelisted contributor? [20:55:18] MatmaRex: comment "recheck" [20:55:31] legoktm: will that run full tests? not just the V+1 ones? [20:55:40] yup [20:58:19] thanks legoktm [20:58:24] :) [21:03:12] (03PS1) 10Hashar: Experimental integration-zuul-debian-glue job [integration/config] - 10https://gerrit.wikimedia.org/r/191701 (https://phabricator.wikimedia.org/T48552) [21:03:25] (03CR) 10Hashar: [C: 032] Experimental integration-zuul-debian-glue job [integration/config] - 10https://gerrit.wikimedia.org/r/191701 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [21:10:20] (03Merged) 10jenkins-bot: Experimental integration-zuul-debian-glue job [integration/config] - 10https://gerrit.wikimedia.org/r/191701 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [21:15:57] (03PS1) 10Hashar: git-buildpackage config [integration/zuul] - 10https://gerrit.wikimedia.org/r/191765 [21:16:48] (03CR) 10Hashar: "check experimental" [integration/zuul] - 10https://gerrit.wikimedia.org/r/191765 (owner: 10Hashar) [21:21:30] Are phabricator and jenkins inaccessible for anyone else, or is it just me? [21:22:07] phabricator is fine [21:22:19] jenkins seems ok as well [21:22:25] gerrit isn't loading for me [21:22:31] phab is fine though [21:22:43] gerrit is fine for me, but 208.80.154.241 isn't reachable... [21:23:21] (right when chad loses his ssh keys...) [21:23:48] I'm guessing it's a comcast issue [21:24:21] I've been having issues with other sites as well [21:24:24] so probably yeah [21:25:38] * greg-g loads sonic.net [21:26:13] (03Abandoned) 10Hashar: git-buildpackage config [integration/zuul] - 10https://gerrit.wikimedia.org/r/191765 (owner: 10Hashar) [21:27:24] (bah, only up to 20mbs for sonic, but I can only hope they'll bring fiber soon) [21:28:17] (we're on the list: https://www.sonic.com/gigabit-fiber-internet ) [21:28:39] re: fiber, hear hear. i'm stuck at 5mbps :/ [21:28:48] eek! [21:28:49] I hate comcast sometimes [21:29:00] s/ sometimes// [21:29:27] :) [21:29:29] So true [21:30:27] marxarelli: sonic.net support good though? [21:32:21] greg-g: it's incredible, but i haven't had to call since they set it up [21:32:37] awesome [21:33:14] * greg-g might just do that, he has 25 mbps with comcast, no big deal going to 20 [21:33:50] bah, now smtp.google.com isn't responding.... [21:34:12] (smtp.gmail.com I mean) [21:34:14] csteipp: I'm on comcast and been having issues with about half the internet today :/ [21:34:22] including phab [21:34:26] gerrit works though [21:34:38] random routing issues suck [21:34:49] legoktm: Yep, gerrit was working for me, but tons of other stuff started dropping off. [21:36:18] * greg-g vpns into the office [21:36:29] 3Continuous-Integration, VisualEditor, Flow: Flow tests fails to run with VisualEditor installed - https://phabricator.wikimedia.org/T86920#1051322 (10Jdforrester-WMF) [21:36:40] 3VisualEditor, Beta-Cluster: Beta Cluster: API PrefixSearch is taking a very long time to return, and returns nothing when it does - https://phabricator.wikimedia.org/T74332#1051329 (10Jdforrester-WMF) [21:38:28] (03PS1) 10Hashar: Merge branch 'upstream-debian-sid' into debian [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 [21:38:48] (03CR) 10Hashar: "check experimental" [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 (owner: 10Hashar) [21:41:26] (03PS2) 10Hashar: Merge branch 'upstream-debian-sid' into debian [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 [21:41:47] (03CR) 10Hashar: "check experimental" [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 (owner: 10Hashar) [21:44:25] (03PS3) 10Hashar: Merge branch 'upstream-debian-sid' into debian [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 [21:44:41] (03CR) 10Hashar: "recheck" [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 (owner: 10Hashar) [21:45:20] (03CR) 10Hashar: "check experimental" [integration/zuul] (debian) - 10https://gerrit.wikimedia.org/r/191770 (owner: 10Hashar) [22:04:49] (03CR) 10Hashar: "Whatever is under /src/docs/ (which is /docs/ relatively to the source repo working copy) will be rsynced as it to https://doc.wikimedia." [integration/config] - 10https://gerrit.wikimedia.org/r/191046 (https://phabricator.wikimedia.org/T74794) (owner: 10Hashar) [22:06:58] Project beta-scap-eqiad build #42335: FAILURE in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/42335/ [22:24:36] !next [22:24:53] hello deployers, could you do me a favor and sync one appserver? [22:25:09] it has been reinstalled, so it needs to be synced [22:25:27] i dont wanna hack dsh groups [22:29:12] twentyafterfour: ^ [22:30:28] Yippee, build fixed! [22:30:29] Project beta-scap-eqiad build #42336: FIXED in 22 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/42336/ [22:32:27] greg-g, mutante: I don't think I have dsh access [22:32:56] i was wondering how you do it without changing them [22:33:12] but if you ever need to, they are in puppet [22:33:26] so i just went to the server and ran sync-common [22:33:31] after Krenair helped me find it [22:33:44] 22:31:49 Copying to mw1062.eqiad.wmnet from tin.eqiad.wmnet [22:34:00] i hope that makes it so i can put it back in dsh now [22:34:03] well scap has it's own ssh keys [22:34:09] and you guys dont run into issues on next deploy [22:34:29] twentyafterfour: you have dsh access [22:34:30] we always have this problem when hardware breaks, a server gets removed and fixed later [22:35:09] i reinstalled it, added back to puppet and then needed the initial sync [22:35:44] puppet will refresh if the /srv/mediawiki dir is completely absent but it doesn't update otherwise I don't think [22:35:55] bd808: I only said that because the one dsh command on https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys ..failed for me with a permission error [22:36:22] bd808: does "sync-common" sound like all there is to it? [22:36:28] yup [22:36:31] rsync finished fine.. cool [22:36:42] i'm re-adding it to dsh and pybal then [22:36:55] mutante: https://github.com/wikimedia/operations-puppet/blob/5bd92dcb68d945fd807515fc00a42249c58c9115/modules/mediawiki/manifests/scap.pp#L48-L55 [22:37:20] is there a way to have puppet run sync-common when it rejoins the flock? [22:37:25] and sync-common is what scap runs on each host [22:37:43] oh, it already does that?? nice [22:37:49] greg-g: we'd need a custom resource to decide if it was needed [22:38:20] ah [22:38:32] if we had something on each host that told what "revision" was there then we could have puppet check that against some cannonical source [22:38:46] we have a bug for part of that [22:38:59] is that checking for directory php-1.25wmfFOO ? [22:39:25] ^^ might be something to add to https://phabricator.wikimedia.org/T89945 twentyafterfour :) [22:39:33] maybe like this: [22:39:45] let mediawiki puppet role add a salt grain with the mediawiki version [22:39:50] then check the salt grain [22:39:50] it would need to check for patches too, not just the presence of the directory [22:42:13] *nod* we need a .version file or something that we push to all hosts on each sync-* or scap [22:42:31] and then a way to check the "current" version from tin [22:42:56] or just kill all this shit and use git for real on all the hosts [22:43:06] then `git pull` whenever you want [22:43:31] bd808: that's how I'd like to do it ;) [22:43:41] oooorrrrr, push out the hhvm binary [22:43:46] ^ [22:43:49] adding a grain, i think i can do the patch, but it would still mean a puppet change on each MW version bump [22:43:57] that somebody needs to do and merge [22:44:10] nooooo [22:44:24] puppet change per week deploy? no [22:44:54] greg-g: that would also have to be updated for each swat [22:45:03] yeah, hellz no [22:45:05] the last thing we need is more manual stuff [22:45:47] how about asking Special:Version of en.wp [22:45:50] twentyafterfour: have I told you about all the cool things in the Facebook deploy process? [22:45:58] and comparing that to what the server has [22:46:01] mutante: not good enough at all [22:46:07] mutante: but what about worst case when we're deploying during an outage? :) [22:46:10] can we make it good enough? [22:46:10] bd808: no [22:46:14] the git hash only updates on a full scap [22:46:19] make Special:Version show more , i mean [22:46:23] I'm familiar with facebook, a little bit [22:46:45] Their hhvm deploy process is basically: [22:46:50] greg-g: ok, version of wikitech-static ../me hides [22:47:01] make a squashfs files system [22:47:22] fill it with hhvm binary, hhbc cache, other assets [22:47:28] torrent it to the clsuter [22:47:38] mutante: the problem is we do so much one-off-patching there really would just need to be a centrally published serial number that increments with each change [22:47:39] touch a file to depool a server [22:47:47] wit for it to drain [22:47:55] stopp hhvm [22:48:01] unmount current version [22:48:05] mount new version [22:48:10] prime cache [22:48:11] twentyafterfour: i wasn't aware that doesn't exist :p [22:48:14] rm stop file [22:48:36] (03PS1) 10Dduvall: Run CentralAuth browser tests at en.m.wikipedia.beta.wmflabs.org [integration/config] - 10https://gerrit.wikimedia.org/r/191798 [22:48:45] mutante: it should exist. I'm happy to build it [22:49:04] bd808: that seems complex but honestly less scary than what we do now [22:49:24] twentyafterfour: so maybe it could be a post-commit hook in the mw repo, that automatically increments it [22:49:38] on merge ++ [22:50:37] all deployment changes go through tin so any scap action could increment the value ...just need a place to publish it (I'd vote for directly publishing it via http on tin) [22:51:26] or put that new version number in special:version as well? [22:51:34] then you can even check via API [22:52:10] yeah that would probably work [22:53:31] -1 for more crap in special:version [22:54:00] version number = crap ?:) [22:54:01] we run http on tin for trebuchet already [22:54:20] well it would be custom somthing or other just for the wmf cluster [22:54:46] something simple that you can curl with no extra processing would be nice [22:54:50] we could try to jam it in the version number I guess 1.25wmf.37 [22:54:58] we could try to jam it in the version number I guess 1.25wmf18.37 [22:55:00] shouldn't have to parse html or run a full api client [22:55:06] right [22:55:13] there is also http://config-master.wikimedia.org/ [22:55:21] which is used for pybal currenlty [22:55:24] jsut a .txt with a hash/timestamp/whatever in it [22:55:40] what's config-master [22:56:00] root only stuff for pybal I think [22:56:19] a webserver made to store config (for which appserver is in the cluster and which isnt) [22:57:21] it could be super smart if it knew that one appserver has a wrong serial [22:57:25] and deactivate it [22:59:08] even though that list of which appserver is in the cluster or not is hand maintained out of vcs... :P [23:00:24] yea, that's why i say it would be nice if it could automatically do that.. cycle closed [23:00:37] * greg-g nods [23:01:18] so much here to grapple with, it's almost like we need a cross team (RelEng+Ops) group to drive and maintain these kinds of changes [23:01:20] also just because you said http and this is already a http server that could be used [23:01:47] greg-g: yea, just merge engineering teams into one and get rid of the overhead [23:01:51] (which, btw, is what I'm proposing during the budgeting process conversation that is kicking off in earnest tomorrow for engineering) [23:02:00] mutante: ish [23:02:03] :) [23:02:20] I want a small nimble group to work on this that has buy-in from ops and releng [23:03:19] uhmm yea, but on the other hand if you put every ops into a special project you have no regular ops left [23:04:00] budget meeting sounds "fun" :p [23:05:36] mutante: there is the possibility of hiring :) [23:07:14] <^d> I never realized how much I relied on my bouncer until today. [23:07:20] <^d> spof? [23:07:39] hah, i guess that depends on the budget :) [23:07:57] ^d: do you want the OIT bouncer? *g* [23:08:09] there was that project wasnt there [23:08:21] <^d> Yes but hours/day got in the way [23:08:34] ^d: just have two bouncers running :P [23:08:43] ^d: thcipriani needs sudo privs (he's not in the nda group in ldap?) [23:08:47] <^d> Then I'd need a second VM [23:08:57] <^d> twentyafterfour: I know, working on it. [23:09:07] <^d> Something something, legal hoops to jump through [23:09:09] I have a shared host and a VM [23:09:12] how are sudo privs related to that ldap group? [23:09:14] ^d: ok cool, figured you might have missed it [23:09:20] heh, thanks twentyafterfour and ^d [23:09:28] thcipriani: you signed all the forms they put in front of you at HR, right? [23:09:47] yup [23:10:09] <^d> greg-g: I feel like HR would've made a point to tell you if he had refused to sign some :) [23:10:23] if you say yes, I'm willing to take the fall and say "just freaking add him already, he's an employee who signed all his paperwork, just because we don't have a good HR -> Legal -> ops/us workflow for NDAs doesn't matter" [23:10:32] ^d: you'd think :) [23:10:59] but that LDAP group is just for logins on icinga and graphite [23:11:02] <^d> "So Greg, this new hire of yours...he won't sign the non-discrimination policy" [23:11:13] * bd808 got in based on "works for Rob" [23:11:14] if it turns out you sneakily wrote "Not Tyler" in cursive everywhere, well, good on you [23:11:19] <^d> mutante: He's already in `wmf` so no big deal there [23:11:26] <^d> It's the nda groups in beta cluster he needs [23:11:31] ^d: then i don't get what it has to do with sudo [23:11:46] <^d> sudo on beta requires nda [23:11:51] <^d> because $reasons [23:11:58] hah, there is a group in beta cluster called nda that is unlike the other LDAP group called nda? [23:12:05] oh. that's a dumb thing to worry about [23:12:08] <^d> Which is also not the Phab nda, right. [23:12:09] * bd808 will add him [23:12:12] lol [23:12:24] but it's even the same LDAP server that is wikitech and labs :p [23:12:45] <^d> We have 3 nda groups, all of which are managed separately :) [23:12:54] hahaha [23:13:05] if it has the keyword NDA in it, run [23:13:46] !log added Thcipriani to under_NDA sudoers group; WMF staff [23:13:51] Logged the message, Master [23:13:52] (it will never change though if we just keep doing it manually when needed instead of having it on the onboarding workflow docs) [23:14:37] give HR access to LDAP already so they can set it :) just needs some nice web UI [23:15:07] <^d> Different LDAP [23:15:10] <^d> That's OIT ldap [23:15:14] <^d> They don't add to wikitech ldap [23:15:18] * ^d cries a little [23:15:29] neat, I have sudo access, but I'm not really at liberty to talk about it: I signed an NDA, probably. [23:16:11] <^d> Also, if you disclose stuff from beta you'd be disclosing some of the most boring data we have :p [23:16:17] ^d: the OIT ldap needs to die then [23:16:20] <^d> Which is part of why nda-for-beta makes me lol. [23:16:48] <^d> mutante: Suggest it to techsupport@ ;-) [23:17:05] how is it even possible.. wikitech uses ldap to determine project memberships [23:17:14] being admin in a project gives you sudo [23:17:17] beta is a labs project [23:17:28] <^d> Sudo policies for beta aren't default [23:17:38] so why isn't this just adding them in wikitech to the project as admins [23:21:24] Can't project admins add people to the nda group though? [23:26:20] <^d> Yes, so you can escalate if you get added to the former :p [23:32:33] "escalate"