[02:06:21] 06Release-Engineering-Team, 10MediaWiki-Vagrant, 06Operations, 07Epic, 13Patch-For-Review: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3136099 (10Krinkle) [06:42:05] Project selenium-Wikibase » chrome,beta,Linux,BrowserTests build #313: 04FAILURE in 2 hr 2 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/313/ [06:42:59] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:49] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [07:42:45] !log nodepool cleared a couple alien instances [07:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:34:51] (03CR) 10Hashar: "As a result the Jenkins job mediawiki-selenium-integration is gone and replaced by mediawiki-selenium-integration-jessie which was already" [selenium] - 10https://gerrit.wikimedia.org/r/344941 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [09:05:53] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10media-storage, 13Patch-For-Review: deployment-ms-be01.deployment-prep and deployment-ms-be02.deployment-prep have high load / system CPU - https://phabricator.wikimedia.org/T160990#3136430 (10fgiunchedi) Another... [09:13:22] 06Release-Engineering-Team, 06Operations, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3136437 (10fgiunchedi) [10:25:22] 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10Android-app-Bugs, 06Wikipedia-Android-App-Backlog: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305#3136547 (10hashar) We can merge the jobs and improve the readability of the test report at th... [10:28:52] !log Jenkins: installing Android Lint plugin 2.4 - T161305 [10:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:28:56] T161305: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305 [10:49:44] 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10Android-app-Bugs, 06Wikipedia-Android-App-Backlog: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305#3136584 (10hashar) Gave it a try on a test job https://integration.wikimedia.org/ci/job/hasha... [11:05:03] 10Browser-Tests-Infrastructure, 10MediaWiki-General-or-Unknown, 07JavaScript, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 4 others: Port Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T139740#3136624 (10zeljkofilipin) [11:05:38] (03PS1) 10Hashar: Android: remove lint job [integration/config] - 10https://gerrit.wikimedia.org/r/345122 (https://phabricator.wikimedia.org/T161305) [11:05:48] 10Browser-Tests-Infrastructure, 10MediaWiki-General-or-Unknown, 07JavaScript, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 4 others: Port Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T139740#2441243 (10zeljkofilipin) [11:08:04] PROBLEM - Puppet run on integration-c1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:08:52] 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10Android-app-Bugs, 06Wikipedia-Android-App-Backlog, 13Patch-For-Review: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305#3128129 (10hashar) a:03hashar [11:36:42] (03PS1) 10Hashar: Android: generate lint report on build page [integration/config] - 10https://gerrit.wikimedia.org/r/345127 (https://phabricator.wikimedia.org/T161305) [11:37:49] (03CR) 10Hashar: "I have refreshed the android-apps-wikipedia-test job already. The lint report should show up, giving it a try with the change https://ger" [integration/config] - 10https://gerrit.wikimedia.org/r/345127 (https://phabricator.wikimedia.org/T161305) (owner: 10Hashar) [11:46:33] 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10Android-app-Bugs, 06Wikipedia-Android-App-Backlog, 13Patch-For-Review: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305#3136681 (10hashar) Summary ====== * https://gerrit.wikimedia.org/r/345... [11:54:18] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 10releng-201516-q3, 13Patch-For-Review, 07WorkType-NewFunctionality: [keyresult] Migrate as many misc CI jobs as possible to Nodepool - https://phabricator.wikimedia.org/T119140#3136683 (10EddieGP) [13:22:04] !log restarting elasticsearch on deployment-elastic05 to reload log4j configuration [13:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:46:19] Project selenium-VisualEditor » firefox,beta,Linux,BrowserTests build #350: 04FAILURE in 2 min 18 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/350/ [13:49:09] PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:32:24] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:34:06] 10Continuous-Integration-Infrastructure (Little Steps Sprint): Create "High Priority" gate-and-submit pipeline - https://phabricator.wikimedia.org/T160668#3136975 (10hashar) Gave it a quick try by duplicating gate-and-submit on all MediaWiki repositories but that does not work quite well. We have Zuul templates... [14:52:40] Hello RelEng. Could someone help me with logstash a bit? I've enabled some logging inn ChangeProp so that the logs would end up in logstash, but they don't show up. Perhaps some typing conflict again and I don't have the access to kibana/elasticsearch logs. Could someone have a look if there's something useful there? [15:39:55] Pchelolo: are you a member in the deployment-prep project? The ELK cluster is all on deployment-logstash2.deployment-prep.eqiad.wmflabs [15:40:36] Oh, didn't know that. Thank you bd808 Lemme try to have a look [15:40:47] production logstash is there too right? [15:41:29] no, prod is on logstash100*.eqiad.wmnet [15:42:20] I'm actually not sure if any of the releng folks have access to those hosts [15:42:41] (they should... if they want access) [15:44:14] Pchelolo: I'm about to head into some meetings, but I could try and help you look at prod in a hour and half or so. Erik B and any root also has access to the logstash100* hosts [15:44:48] cool, thank you. I will try to force a problematic log in beta and test on deployment-logstash [15:45:01] +1 [15:46:44] (03PS1) 10Hashar: Make some skin tests voting [integration/config] - 10https://gerrit.wikimedia.org/r/345167 [15:55:50] (03PS2) 10Hashar: Make some skin tests voting [integration/config] - 10https://gerrit.wikimedia.org/r/345167 [15:55:52] (03PS1) 10Hashar: Zuul layoutdiff did not recognize gate-and-pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/345169 [15:57:58] (03CR) 10Hashar: [C: 032] "Verified output in child change https://gerrit.wikimedia.org/r/#/c/345167/ and the gate and submit shows up properly now." [integration/config] - 10https://gerrit.wikimedia.org/r/345169 (owner: 10Hashar) [15:58:00] (03CR) 10Hashar: [C: 032] Make some skin tests voting [integration/config] - 10https://gerrit.wikimedia.org/r/345167 (owner: 10Hashar) [15:59:15] (03Merged) 10jenkins-bot: Zuul layoutdiff did not recognize gate-and-pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/345169 (owner: 10Hashar) [15:59:17] (03Merged) 10jenkins-bot: Make some skin tests voting [integration/config] - 10https://gerrit.wikimedia.org/r/345167 (owner: 10Hashar) [16:00:12] bd808: unfortunately in logstash-beta the logs do show up, so I can't debug there. Please ping me when you have time to grep the logs from prod [16:01:06] Pchelolo: I'll try to remember. If you haven't heard from me by 17:30Z poke me :) [16:01:21] kk, cool. Thank you [16:05:13] !log deployed ores:18beebf (T160638) [16:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:05:18] T160638: Deploy ORES late march - https://phabricator.wikimedia.org/T160638 [16:05:50] It's weird that scap didn't automatically log this. Does scap auto-log for beta deployments? [16:06:26] I ran `scap deploy -v "T160638"` [17:02:32] 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10Android-app-Bugs, 06Wikipedia-Android-App-Backlog, 13Patch-For-Review: Merge apps/android/wikipedia Jenkins jobs lint and test - https://phabricator.wikimedia.org/T161305#3137737 (10Niedzielski) {icon thumbs-up} @hashar, thank you! These all... [17:13:32] halfak: scap !log doesn't work in beta cluster. There is a bug about it somewhere. The TL;DR is that beta cluster is missing the tcp->irc bridge that makes that happen in prod [17:14:22] Pchelolo: 0/ do you know which logstash server I need to look at? It would be in the ChangeProp config somewhere [17:19:50] bd808: logstash1001.eqiad.wmnet [17:19:58] And the problematic log is WARN level with a message "Retry count exceeded" [17:23:01] ok. logstash says ":message=>"failed action with response of 400, dropping action" so now to see what is making Elasticsearch unhappy [17:23:34] java [17:23:49] bd808: probably the type conflict again :( [17:28:33] Pchelolo: "java.lang.IllegalArgumentException: [event] is defined as a field in mapping [changeprop] but this name is already used for an object in other types" [17:28:49] so yes the mapping nightmare [17:29:18] bd808: I'm starting to think more and more that we should prefix the properties names with a service name [17:29:21] perhaps another vote for auto-hungarian notation [17:29:32] I haven't ranted about how much I hate this ES change for a while [17:29:49] for extra fun, iiuc in es5 dotted notation becomes auto-objects [17:30:11] Thank you a lot bd808, I'll fix this after the lunch now that I know what's exactly the problem [17:30:13] but we de-dot, so that shouldn't effect us (but de_dot logstash filter is basically unmainteained now) [17:30:33] hope ElasticSearch would be fine for another couple of hours with this error? [17:30:35] I really don't get how they can claim that this is actually a usable product anymore [17:30:59] their repeating suggestion is basically run a bunch of ES clusters, and use tribe nodes to link them [17:31:30] right, buy more hardware because we made an arbitrary decision [17:31:52] indeed. Although s/buy more hardware/pay AWS more/ is their suggestion :P [17:31:56] or ... got back to using elasticsearch 1.x and stop upgrading if you want ELK to actually work [17:32:21] they are even releasing software to simplify creating/managing multiple clusters in a cloud environment [17:32:36] well if I was in AWS I would just use a logging service :) [17:32:55] and I would not run their software myself at all [17:33:55] for WMF I guess we either hack the auto-hungarian notation thing or start making an index per log event type [17:34:25] i'm not sure either :S [17:35:28] I took care of the scalar mismatch problem I think, so it's just objects and collections that can blow up now [17:35:45] I don't think collections is actually solvable [17:36:02] args[] could be anything and it will puke [17:38:19] mildly amusing: https://discuss.elastic.co/t/elasticsearch-stopped-accepting-specific-documents/43417/8 moderator responds almost instantly, multiple times about what is wrong and suggested fixes. When asked why it works this way the thread dies [18:37:22] (03CR) 10Mholloway: [C: 031] Android: remove lint job [integration/config] - 10https://gerrit.wikimedia.org/r/345122 (https://phabricator.wikimedia.org/T161305) (owner: 10Hashar) [18:37:43] (03CR) 10Mholloway: [C: 031] Android: generate lint report on build page [integration/config] - 10https://gerrit.wikimedia.org/r/345127 (https://phabricator.wikimedia.org/T161305) (owner: 10Hashar) [19:12:28] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Depool precise jenkins instances - https://phabricator.wikimedia.org/T158652#3138186 (10hashar) 05Open>03Resolved The other puppet clean up patch https://gerrit.wikimedia.org/r/#/c/343309/ is on our radars. Will be m... [19:15:09] 10Scap: scap clean not removing staging dirs - https://phabricator.wikimedia.org/T161643#3138191 (10thcipriani) [19:15:11] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10OOjs-UI, 13Patch-For-Review: Speed up oojs/ui Jenkins jobs - https://phabricator.wikimedia.org/T155483#3138203 (10hashar) The repository now has a 'jenkins' npm script and the test task now depends on demo. S... [19:21:00] (03PS1) 10Hashar: Merge 3 oojs/ui jobs in a single one [integration/config] - 10https://gerrit.wikimedia.org/r/345203 (https://phabricator.wikimedia.org/T155483) [19:22:25] (03CR) 10Jforrester: [C: 031] Merge 3 oojs/ui jobs in a single one [integration/config] - 10https://gerrit.wikimedia.org/r/345203 (https://phabricator.wikimedia.org/T155483) (owner: 10Hashar) [19:22:54] (03CR) 10Hashar: "I have created the new job in Jenkins already : oojs-ui-npm-run-jenkins-node-6-jessie" [integration/config] - 10https://gerrit.wikimedia.org/r/345203 (https://phabricator.wikimedia.org/T155483) (owner: 10Hashar) [19:32:32] (03CR) 10Hashar: [C: 032] Merge 3 oojs/ui jobs in a single one [integration/config] - 10https://gerrit.wikimedia.org/r/345203 (https://phabricator.wikimedia.org/T155483) (owner: 10Hashar) [19:33:33] hashar i got eddsa ssh key's supported in gerrit 2.14+ https://gerrit-review.googlesource.com/#/c/100998/ :) [19:33:37] (03Merged) 10jenkins-bot: Merge 3 oojs/ui jobs in a single one [integration/config] - 10https://gerrit.wikimedia.org/r/345203 (https://phabricator.wikimedia.org/T155483) (owner: 10Hashar) [19:34:09] !log Migrate oojs/ui to just run 'npm jenkins' https://gerrit.wikimedia.org/r/345203 / T155483 [19:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:34:12] T155483: Speed up oojs/ui Jenkins jobs - https://phabricator.wikimedia.org/T155483 [19:52:32] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10OOjs-UI, 13Patch-For-Review: Speed up oojs/ui Jenkins jobs - https://phabricator.wikimedia.org/T155483#3138269 (10hashar) a:03Prtksxna The first run of oojs-ui-npm-run-jenkins-node-6-jessie passed. Thank yo... [19:53:14] !log Populating package manager cache of oojs-ui-npm-run-jenkins-node-6-jessie by manually triggering a build with ZUUL_PIPELINE=postmerge T155483 [19:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:53:18] T155483: Speed up oojs/ui Jenkins jobs - https://phabricator.wikimedia.org/T155483 [19:56:51] hashar: Looks like it worked. [20:00:21] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure (Little Steps Sprint), 10OOjs-UI, 13Patch-For-Review: Speed up oojs/ui Jenkins jobs - https://phabricator.wikimedia.org/T155483#3138284 (10hashar) The next optimization is that the npm test / grunt test commands use composer. So most... [20:00:32] James_F: yeah which saves a lot of instances [20:01:10] the build can most probably be made faster [20:01:22] the npm install step can definitely benefit from caching somehow [20:02:06] 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Update gerrit to 2.14 - https://phabricator.wikimedia.org/T156120#3138287 (10Paladox) eddsa keys are now supported by default https://gerrit-review.googlesource.com/#/c/100998/ @demon i managed to fix one of the bugs you wrote to asking for this witho... [20:02:21] but one step at a time [20:02:27] hashar: Yeah. [20:15:06] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Builds from mwext-testextension jobs sometimes pick up tests from unrelated skins - https://phabricator.wikimedia.org/T117710#3138323 (10hashar) 05Open>03Resolved >>! In T117710#3048702, @Krinkle wrote: > @hashar I thought we no longer preserve a... [20:26:33] 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Update gerrit to 2.14 - https://phabricator.wikimedia.org/T156120#3138349 (10Paladox) [20:36:13] 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Update gerrit to 2.14 - https://phabricator.wikimedia.org/T156120#3138361 (10Paladox) [20:43:40] Project selenium-Echo » chrome,beta,Linux,BrowserTests build #347: 04FAILURE in 2 min 39 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/347/ [20:43:40] Project selenium-Echo » firefox,beta,Linux,BrowserTests build #347: 04FAILURE in 2 min 40 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/347/ [23:01:51] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Release Pipeline: Spike: Evaluate containerized CI builds using Kubernetes - https://phabricator.wikimedia.org/T153363#3138755 (10dduvall) 05Open>03declined This idea is essentially superseded by the #release_pipeline which will undou... [23:21:35] 10Continuous-Integration-Config: Prevent the addition of files with names that aren't supported on Windows - https://phabricator.wikimedia.org/T67140#3138845 (10Krinkle) [23:27:25] 10Continuous-Integration-Config, 07Technical-Debt: Migrate "analytics-*" jobs to Jenkins Job Builder - https://phabricator.wikimedia.org/T97514#3138861 (10Krinkle) [23:27:55] 10Continuous-Integration-Config: Write a test to ensure all jobs in Zuul are defined in JJB - https://phabricator.wikimedia.org/T103847#3138864 (10Krinkle) [23:27:57] 10Continuous-Integration-Config, 07Technical-Debt: Migrate "analytics-*" jobs to Jenkins Job Builder - https://phabricator.wikimedia.org/T97514#1244558 (10Krinkle) [23:38:43] 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050#3138911 (10Krinkle) [23:38:44] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: MediaWiki gate takes 20 minutes for extensions tests and 1.5 hour for at least a patch - https://phabricator.wikimedia.org/T126274#3138910 (10Krinkle) [23:38:46] 10Continuous-Integration-Config, 13Patch-For-Review: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#3138907 (10Krinkle) 05Open>03Resolved >>! In T126670#2072047, @Nikerabbit wrote: >>>! In T126670#2065433, @gerritbot wrote: >> Chan...