[00:08:49] awight: AndyRussG is this OK? https://integration.wikimedia.org/ci/view/BrowserTests/view/CentralNotice/ [00:09:26] chrismcmahon: wooo, looks very pretty indeed! thanks much :) [00:09:44] yay! [00:09:58] awight: AndyRussG it is also available via https://integration.wikimedia.org/ci/view/BrowserTests/view/-All/ [00:10:24] chrismcmahon: nice! [00:11:25] AndyRussG: I'm kicking the mobile test, thinking it should pass now. [00:11:59] awight: cool... yea I was looking for the kick button a little while ago... [00:12:30] AndyRussG: ah, u have to log in (LDAP password) [00:12:50] awight: ah heh that makes sense (/me facepalms) [00:13:25] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:13:44] awight: also, not sure you saw this, I was not sure it was correct but it seems to be https://gerrit.wikimedia.org/r/#/c/174839 [00:14:29] chrismcmahon: sure, thanks for noticing that! Hopefully the world doesn't turn upside-down too many more times :) [00:14:38] or if it does, hopefully it's an even number. [00:15:16] awight: as long as the tests keep pace with reality I don't care what turns upside down [00:17:37] chrismcmahon: I'm curious about all the idle on integration slaves... [00:18:04] maybe they are not in the browsertest role? [00:19:27] awight: hang on a sec... [00:21:51] awight: sorry, other conversation going on. we have limited the number of executors on Jenkins because performance issues were causing false failures and such [00:22:09] i see. no worries! [00:22:20] I'm in no rush, just nosy. [00:23:25] awight: yep. I think we have made some improvements to horsepower for the Jenkins hosts, and I think we have more to come, but we are doing an enormous amount with Jenkins and managing all the load can be tricky [00:25:41] awight: it is actually something I need to re-visit with hashar/Antoine. we have 10 simultaneous VMs available with our SauceLabs account, but load on our Jenkins means we can only use 2 or maybe 3 at a time last I looked. [00:28:11] jenkins is a dog. [00:28:29] I'm pretty sad about how it needs to load all the logs into memory before doing anything. [00:29:18] i have no clue how server assignments are done, but it seems the application cluster has recently gone from 60% to 20% cpu usage :) i wonder if its possible to repurpose any machines [00:42:41] ebernhardson: we're getting some of them, yes [00:43:25] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:45:01] greg-g: thats awsome. [01:55:07] Project beta-scap-eqiad build #33207: FAILURE in 1 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33207/ [02:05:24] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #180: FAILURE in 44 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/180/ [02:15:31] Yippee, build fixed! [02:15:32] Project beta-scap-eqiad build #33210: FIXED in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33210/ [03:15:25] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<11.11%) [03:17:16] Yippee, build fixed! [03:17:17] Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #353: FIXED in 1 hr 30 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/353/ [03:37:57] Yippee, build fixed! [03:37:57] Project browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce build #216: FIXED in 19 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce/216/ [03:38:11] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #126: FAILURE in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/126/ [04:10:27] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<11.11%) [04:50:26] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree.value (<55.56%) [06:26:23] Yippee, build fixed! [06:26:24] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #352: FIXED in 10 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/352/ [06:40:25] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [06:48:12] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:03:36] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:18:16] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [07:23:36] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:15:37] Project beta-scap-eqiad build #33246: FAILURE in 1 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33246/ [08:35:37] Yippee, build fixed! [08:35:38] Project beta-scap-eqiad build #33248: FIXED in 1 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33248/ [08:40:54] 3Beta-Cluster: Renumber apache user/group to uid=48 on Trusty beta hosts - https://phabricator.wikimedia.org/T78076#836076 (10hashar) On deployment-jobrunner01 , you might want to stop the jobrunner /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService It seems to have upstart support, just lac... [09:27:26] Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #215: FAILURE in 20 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/215/ [10:30:03] !log Adding hhvm on Trusty slaves, using depooled integration-slave1009 as the main work area [10:30:08] Logged the message, Master [10:49:07] Yippee, build fixed! [10:49:08] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #181: FIXED in 40 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/181/ [11:29:43] Yippee, build fixed! [11:29:43] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #336: FIXED in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/336/ [13:50:08] 3Continuous-Integration: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#836468 (10Krinkle) [14:35:08] Project beta-scap-eqiad build #33284: FAILURE in 1 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33284/ [14:55:14] Yippee, build fixed! [14:55:14] Project beta-scap-eqiad build #33286: FIXED in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33286/ [16:26:35] (03PS1) 10Hashar: Support MediaWiki core under HHVM [integration/config] - 10https://gerrit.wikimedia.org/r/178862 [16:41:22] 3Wikimedia-Logstash, Beta-Cluster, Release-Engineering: Make logstash in beta public - https://phabricator.wikimedia.org/T76784#837009 (10coren) Please remember to add the disclaimer from [[ https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use#If_my_tools_collect_Private_Information... | the Labs Terms... [16:43:01] Yippee, build fixed! [16:43:01] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #127: FIXED in 35 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/127/ [16:43:48] Do we have a policy/procedure for scheduling beta outages? Changing the uid on the hhvm servers is going to require stopping apache+hhvm for "a while". [16:44:21] I can do a dry run with find to guess about how long [16:45:33] bd808: nope, no policy. "with least reasonable impact" where "reasonable" is a bigger loop-hole than in production. [16:46:11] heh. So "schedule as to not piss off chrismcmahon" and I'm probably good? [16:46:13] chrismcmahon: generally, what are the hours that browser tests *aren't* running on beta? [16:46:17] :) [16:47:03] sometime over the weekend might actually be easiest [16:47:33] like Saturday morning while I watch Buffy reruns or something [16:49:26] "or something" [16:49:50] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:52:14] bd808: that's cool, if you're up for it [16:52:49] bd808: beta labs is usually not in use before about 11AM Pacific time. we do a day run and a night run [16:53:25] do you know when they finish running for each of those? [16:53:44] bd808: at your convenience, though. I'm easy [16:54:03] greg-g: end times vary pretty widely [16:54:13] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [16:54:13] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:54:15] by how much? [16:55:09] greg-g: I've never checked, actually [16:56:00] greg-g: since we limited the number of executors, it just takes A Long Time. [16:56:10] as in? [16:56:17] I'm looking for a ballpark figure here [16:58:52] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #68: FAILURE in 2 hr 36 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/68/ [16:59:11] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:00:07] greg-g: last night's complete run seems to have been on the order of 7 hours total [17:00:34] * greg-g nods [17:00:35] greg-g: but that can vary widely depending on what is failing and how, response times, all sorts of things. [17:00:40] right [17:01:24] so, wildly extrapolating, we have two 5 hours blocks of non-browser test running time, ish, kinda, generally, not really, but a thumbnail sketch, if you squint [17:11:16] greg-g: yeah, the way it works in practice is that I check the results of the overnight run each morning, so I can repro any issues using a mostly-idle beta labs. The day run kicks off about lunchtime and I keep half an eye on IRC notifications for unexpected failures. [17:14:10] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:14:10] RECOVERY - Puppet failure on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:14:12] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [17:14:51] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:32:00] chrismcmahonbrb: /me nods [17:42:23] (03PS2) 10Hashar: Support MediaWiki core under HHVM [integration/config] - 10https://gerrit.wikimedia.org/r/178862 [17:44:01] greg-g: and others are checking builds too. Robson is usually pretty good about monitoring the MF builds, S looks after Echo/Flow pretty well, and I went over Jenkins with Rummana yesterday. [17:45:15] greg-g: although Robson is traveling or something now. Some MF tests started failing late Friday and I only got around to reporting the bugs today. [17:47:26] (03PS3) 10Hashar: Support MediaWiki core under HHVM [integration/config] - 10https://gerrit.wikimedia.org/r/178862 [17:49:04] (03CR) 10Hashar: "Still have to inject HHVM env on the jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/178862 (owner: 10Hashar) [17:54:15] (03CR) 10Cmcmahon: [C: 031] "We can support these builds in Jenkins" [integration/config] - 10https://gerrit.wikimedia.org/r/178714 (owner: 10AndyRussG) [17:55:55] Hi chrismcmahon... Thanks for the CR... Can u merge the patch? [17:56:12] AndyRussG: I don't have privileges to merge that [17:57:27] chrismcmahon: Ah OK thanks! Lemme see who can, I guess... [18:01:57] AndyRussG: I'd like hashar and/or zeljko to give it a +1 also, I am not the most expert at jjb, those guys know it much better. hashar can probably merge it. [18:02:22] chrismcmahon: OK fantastic, thanks much! [18:12:25] greg-g: any objections if I mark this closed since Brandon's patch got merged? https://phabricator.wikimedia.org/T67683 [18:16:33] chrismcmahon: I'd want JeanFred or someone to confirm the fix works [18:17:15] OK [18:26:04] Project browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce build #217: FAILURE in 20 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-chrome-sauce/217/ [20:11:38] PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:13:04] PROBLEM - Puppet failure on deployment-sca-cache01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:26:16] Project beta-scap-eqiad build #33320: FAILURE in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33320/ [20:35:36] Yippee, build fixed! [20:35:36] Project beta-scap-eqiad build #33321: FIXED in 1 min 30 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33321/ [20:36:38] RECOVERY - Puppet failure on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [20:38:03] RECOVERY - Puppet failure on deployment-sca-cache01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:38:11] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [20:40:34] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:42:15] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:45:10] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #166: FAILURE in 56 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/166/ [20:46:03] 3Continuous-Integration: Jenkins: Implement mediawiki-phpunit-hhvm job for mediawiki-core repo (and voting) - https://phabricator.wikimedia.org/T75521#838958 (10hashar) [20:59:13] Project beta-scap-eqiad build #33325: FAILURE in 4 min 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33325/ [21:02:14] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:13] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:05:33] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [21:10:38] marxarelli: thanks to have looked at the puppet rspec patch :-D you need to clone labs/private as /private under your puppet checkout . I have commented about it on https://gerrit.wikimedia.org/r/#/c/178810/ :] [21:13:02] 3Beta-Cluster, Wikimedia-Logstash: Make beta logstash server based on a Trusty base image - https://phabricator.wikimedia.org/T78195#839073 (10bd808) 3NEW [21:18:58] Yippee, build fixed! [21:18:58] Project beta-scap-eqiad build #33327: FIXED in 4 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/33327/ [21:21:00] 3Beta-Cluster, Wikimedia-Logstash: Make beta logstash server based on a Trusty base image - https://phabricator.wikimedia.org/T78195#839095 (10bd808) @faidon has added the logstash debs to reprepro so the puppet classes should "just work" to provision everything. The safest transition would be to: # bring up... [21:25:26] 3Wikimedia-Logstash, Release-Engineering: Upgrade ElasticSearch on logstash boxen to 1.3.6 - https://phabricator.wikimedia.org/T76089#839105 (10bd808) [21:26:53] ryasmeen: I gave it +2, thanks! [21:27:12] 3Beta-Cluster, Wikimedia-Logstash: Make beta logstash server based on a Trusty base image - https://phabricator.wikimedia.org/T78195#839110 (10bd808) Elasticsearch on deployment-logstash1 should be updated to the latest version in apt before starting this so we don't end up with a mixed version cluster. [21:27:20] Thanks chrismcmahon [21:42:10] 3Mobile-Web, Continuous-Integration: Publish our JS Documentation - https://phabricator.wikimedia.org/T74794#839147 (10hashar) [21:55:22] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #375: FAILURE in 46 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/375/ [21:57:15] vikasyaligar: having connection issues? [22:03:43] bd808 greg-g : Flow (at least) just now is all 503 on beta labs. Known? [22:04:05] * chrismcmahon wonders if it's the parsoid host with the 503 [22:05:29] [proxy_fcgi:error] [pid 19052] (70014)End of file found: [client 10.68.16.16:4302] AH01075: Error dispatching request to :, referer: http://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow_QA [22:06:06] wheee [22:06:57] I can't even read that. [22:13:29] bd808: so is that Flow trying to send a message to nowhere, or did proxy_fcgi lose its mind? thinking it's the former. [22:13:39] chrismcmahon: seems to just be flow? I can make ve edits [22:14:07] OK, I'll file an issue for it. [22:15:00] I'm not sure what's happening under the hood. I see the 503 response and can't find log messages in logstash or on the host (deploymnet-mediawiki01 for my request) not any hhvm cores [22:17:05] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #218: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/218/ [22:17:45] !log restarted logstash on logstash1001. redis event queue not being processed [22:17:47] spagewmf: ^^ [22:17:49] Logged the message, Master [22:17:57] spagewmf: re the [proxy_fcgi:error] [pid 19052] (70014)End of file found: [client 10.68.16.16:4302] AH01075: Error dispatching request to :, referer: [22:18:22] There may be error messages coming. logstash was stuck and I just unstuck it [22:18:31] ah [22:18:37] chrismcmahon, bd808 looking into it [22:18:53] whee, my first "Unbreak now" Phab ticket :-) [22:20:15] greg-g: On a related note, restarting logstash brought in all 21K log messages that we stuck in redis. I think that's pretty cool. [22:21:08] bd808: that is, in fact [22:21:21] yay for them not being dropped on the floor [22:22:18] greg-g: you know how earlier I mentioned surfing Jenkins output through the afternoon, that was an example. when I get an email that looks like this something has gone haywire http://imgur.com/RJn6Azt [22:24:05] * greg-g nods [22:31:20] 7.5 hours ago I git cloned the UploadWizard repo to do some refactoring. I have yet to touch a line of UW code. :-/ [22:33:55] bd808 FYI the Flow 503 is because the signature of shouldCheck method in ConfirmEdit just changed, so Fatal error in hhvm.log. EBernhardson is figuring out what to do. [22:34:27] spagewmf: sweet. Glad you found it [22:35:15] this is exactly why we have beta labs. and browser tests. go us. [22:58:27] PROBLEM - Free space - all mounts on deployment-mediawiki02 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki02.diskspace._var.byte_percentfree.value (<11.11%) [23:39:54] PROBLEM - Free space - all mounts on deployment-mediawiki01 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki01.diskspace._var.byte_percentfree.value (<30.00%)