[00:35:16] !log git clone failure in https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/131/console blocking merge of core patch [00:35:18] Logged the message, Master [00:48:54] !log deleted ntegration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/ZeroBanner to try and clear the git clone problem there [00:48:56] Logged the message, Master [00:54:11] blerg! [00:55:36] !log zuul is plugged up because a gate-and-submit job failed on integration-slave1006 (ZeroBanner clone problem) and then the patch was force merged [00:55:39] Logged the message, Master [00:57:45] marxarelli: can you help with that ^ if it's still a problem? [00:57:46] now zuul is retesting the patch that is already merged? [00:58:03] ci is gray-cray yo [00:58:10] *cray-cray [00:58:19] i like gray-cray [00:58:24] it's old and crazy [00:58:33] true dat [00:59:22] looks like it may unplug soon. nuking the bad partial checkout on integration-slave1006 seemed ot make things happier [01:00:14] the gate-and-submit has quite a pile of things to work through though [01:04:34] blek. i need a serious education on integration servers [01:04:51] are they in eqiad? [01:05:10] yeah in the integration labs project [01:05:34] integration-slave1006.eqiad.wmflabs was the one I was slapping around [01:05:58] each time i try to ssh i get a connection closed [01:06:13] maybe it's my proxy through bastion that's borked [01:06:27] I don't think you are in the project actually ... [01:06:35] * bd808 looks again [01:06:42] oh. ha! that would be funny [01:07:27] nope. no dduvall in the list [01:07:41] that's your wikitech user right? [01:08:05] bd808: yes sir [01:08:54] !log Added Dduvall as an admin in the integration project [01:08:57] Logged the message, Master [01:09:19] marxarelli: you can add twentyafterfour now too :) [01:09:57] bd808: \o/ gracias [01:12:13] !log Added twentyafterfour as an admin to the integration project [01:12:15] Logged the message, Master [01:12:26] hurray for leaning [01:15:01] *learning* [01:15:24] learning to lean [01:16:00] marxarelli: you could merge https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/vagrant+branch:master+topic:iso-dist,n,z if you were bored [01:18:42] bd808: done-zo [01:19:00] kewl. i'll update the build server [01:28:28] fwict, gate-and-submit jobs are flowing again [01:30:25] yeah it healed up right after I !log-ed the whine about it [01:34:18] right on. thanks! [01:35:02] marxarelli: Here's the mw-v iso image -- https://wikimania-vagrant.wmflabs.org/mediawiki-vagrant/ [01:35:13] I haven't downloaded it to test it out yet [01:35:27] cool. i'll test it out tomorrow [01:35:32] sweet [01:35:46] * bd808 needs to go watch football now [01:37:07] oh boy, this download is going to take a while [01:37:25] * marxarelli loves and hates his sonic dsl [01:38:48] btw, greg-g, do they have sonic fiber in petaluma? [01:44:38] marxarelli: parts of it [01:47:14] greg-g: that's cool. looks like they've expanded their operation quite a bit of the past few years [01:47:36] it used to just be in sebastopol [01:49:05] signing off. see y'all! [02:02:07] 3Continuous-Integration: [jsduck] Various custom tags should be easily shareable between projects - https://phabricator.wikimedia.org/T86587#972692 (10Jdforrester-WMF) [02:06:37] 3Continuous-Integration, Mobile-Web: [jsduck] Various custom tags should be easily shareable between projects - https://phabricator.wikimedia.org/T86587#972696 (10Jdlrobson) Seems related to #Mobile-Web and probably VE James since the code currently lives in VE... Would be great to break it out into its own exte... [02:33:05] 3Continuous-Integration, Mobile-Web: [jsduck] Various custom tags should be easily shareable between projects - https://phabricator.wikimedia.org/T86587#972707 (10Jdforrester-WMF) >>! In T86587#972696, @Jdlrobson wrote: > Seems related to #Mobile-Web and probably VE James since the code currently lives in VE...... [03:07:00] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #347: FAILURE in 59 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/347/ [03:18:15] This is the new -qa? [03:40:06] Yippee, build fixed! [03:40:06] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #249: FIXED in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/249/ [03:44:27] 3Wikibugs, Phabricator: Set up dumping Phabricator's project taxonomy to a wiki - https://phabricator.wikimedia.org/T85096#972737 (10scfc) @valhallasw: Thanks! [03:57:15] Project browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce build #425: FAILURE in 27 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce/425/ [04:10:38] !log Jenkins/zuul/whatever not working, investigating. [04:10:42] Logged the message, Master [04:15:37] > whatever [04:15:39] Uhuh [04:15:45] James_F: What's happening now? [04:15:48] !log Flagged and unflagged Jenkins for restart, no effect. [04:15:50] Logged the message, Master [04:16:35] Well, some effect [04:16:51] Poor wmf-insecte [04:17:05] > queued 2 hr 11 min ago [04:17:10] Hmm. Is that related? [04:17:17] Yeah, as I said, no effect. [04:17:41] https://integration.wikimedia.org/zuul/ [04:17:53] There's an older one though [04:18:09] ...that got merged successfully [04:18:10] beta-mediawiki-config-update-eqiad frequently gets stuck. [04:18:22] Yeah, it's a postmerge task. [04:19:03] !log Disabled and re-enabled Gearman, no effect. [04:19:05] Logged the message, Master [04:19:12] None of the tasks are still running...so... [04:19:19] Restart zuul, I guess [04:19:23] RIP zuul [04:19:32] Except I don't have Zuul access. [04:19:36] And neither do you, I imagine. [04:19:43] Hm, that sounds like a thing I might have [04:19:55] Integration cluster access is rare. [04:20:01] Bitch, I'm rare [04:20:43] But, there's a scary warning label about restarting zuul. [04:20:57] marktraceur: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart [04:21:00] Maybe a reload would do it [04:21:21] Almost certainly not. [04:21:54] Well. [04:22:02] Wait for one last thing. [04:23:01] > 2015-01-13 03:34:05,631 ERROR zuul.GerritEventConnector: Received unrecognized event type 'topic-changed' from Gerrit. [04:23:05] That might be benign [04:23:13] Hmm. Should be. [04:23:51] Other stuff seems mostly fine [04:23:57] Some failure logs [04:24:27] !log Took the gallium Jenkins slave offline, disconnected and relaunched; no effect. [04:24:29] Logged the message, Master [04:24:32] 2015-01-13 02:11:17,694 INFO zuul.Gearman: Build complete, result SUCCESS [04:24:32] * James_F sighs. [04:24:39] That was the last successful build [04:24:49] 2 hours ago? [04:24:51] Oy. [04:24:52] Yup [04:24:59] OK, try to restart it then? [04:25:05] There's one other build log [04:25:06] 2015-01-13 04:16:01,706 INFO zuul.Gearman: Build complete, result None [04:25:15] Restart is graceful. [04:25:17] I imagine involving you doing something, not sure [04:25:26] I'm going to try reload as a hail mary first [04:26:05] !log Reloaded zuul to see if it will help [04:26:07] Logged the message, Master [04:27:30] … nope. [04:27:36] > If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. [04:27:51] Try it anyway. [04:28:01] If we can… [04:28:11] !log Attempting graceful zuul restart [04:28:13] Logged the message, Master [04:28:23] That shit ain't gon' work. [04:28:50] ... waiting for jobs to complete ............................................. [04:29:02] Ha. [04:29:08] OK, kill it hard then. [04:29:08] * marktraceur aborts [04:29:29] !log FORCE RESTART ZUUL (James_F told me to) [04:29:31] Logged the message, Master [04:29:32] Done [04:29:34] :-P [04:29:45] That had an immediate effect. [04:30:00] SOMETHING happened. [04:30:04] Are you bringing it back up? [04:30:10] It's coming [04:30:15] Aha, yes. [04:30:17] Log just flew by [04:30:37] Aha. [04:30:38] Yay. [04:30:42] It's back to working. [04:30:44] Go marktraceur. [04:30:53] Yay being up late and rare [04:30:56] And in other news, I need gallium ssh access if I'm going to fix this. [04:31:05] * James_F bows before almighty marktraceur. [04:31:07] James_F: You have me! I'm the next best thing. [04:31:13] James_F: What went wrong? [04:31:14] !log Zuul now appears fixed. [04:31:17] Logged the message, Master [04:31:24] marktraceur: In future, I meant. [04:31:33] Ah. [04:32:06] Well, I have it, Krinkle has it, and hashar has it. I guess all we need is someone in the third part of the globe. [04:32:10] Or I could just stay up late [04:32:19] Or first part. [04:34:31] Still seems mighty sluggish [04:34:59] Same as before. [04:35:12] The VE test stuff takes forever now. [04:35:29] All of MW-core, Mobile, VE, Echo and a dozen other things need to run in Zend, HHVM and npm. [04:35:51] Yeah [04:38:04] We really need to throw more hardware at the problem. [04:38:27] (03PS5) 10Jforrester: Clean up phpcs macros and jobs (remove strict/lenient split) [integration/config] - 10https://gerrit.wikimedia.org/r/166071 (https://phabricator.wikimedia.org/T50420) [04:40:44] * James_F heads home. [04:46:30] Yippee, build fixed! [04:46:30] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #486: FIXED in 49 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/486/ [04:51:42] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #402: FAILURE in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/402/ [04:52:05] 'kay, if zuul crashes again after about an hour from now, I might be asleep [05:24:40] more channel renames? egads. [05:34:54] Project beta-scap-eqiad build #37978: FAILURE in 50 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37978/ [05:45:04] Project beta-scap-eqiad build #37979: STILL FAILING in 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37979/ [05:55:07] Yippee, build fixed! [05:55:07] Project beta-scap-eqiad build #37980: FIXED in 1 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37980/ [06:14:53] (03PS2) 10KartikMistry: Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 [06:24:38] (03CR) 10KartikMistry: "https://integration.wikimedia.org/ci/job/cxserver-npm/440/console already checking jshint." [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [07:11:50] greg-g: uh, not sure if the !log worked, but whatever I did in the log did work and sentry2 got unfailed [07:22:14] 3Phabricator, § Phabricator-Sprint-Extension, Phabricator.org: Restricting modification of tasks when they enter sprints - https://phabricator.wikimedia.org/T819#972812 (10Qgil) I think it is worth trying this restriction in the context of Sprint projects. [07:26:51] 3Code-Review, Engineering-Community: How to prioritize code review of patches submitted by volunteers - https://phabricator.wikimedia.org/T78768#972819 (10Qgil) @Awjrichards, at least the #Engineering-Community team thinks that "How to prioritize code review of patches submitted by volunteers" should be a topic... [07:34:18] Project beta-scap-eqiad build #37987: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37987/ [07:34:27] 3Fundraising-Backlog, Phabricator: Migration of Fundraising Tech team to Phabricator - https://phabricator.wikimedia.org/T831#972831 (10Eloquence) @atgo Thanks for taking the plunge, in spite of the remaining issues with the burndown extension. [07:39:37] Yippee, build fixed! [07:39:37] Project beta-scap-eqiad build #37988: FIXED in 4 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37988/ [08:11:34] Changed default instance of #wikimedia-releng to wm-bot [08:11:34] @instance wm-bot #wikimedia-releng [08:18:24] 3Continuous-Integration: Jenkins: Use node-jscs as checkstyle for javascript coding style - https://phabricator.wikimedia.org/T56218#972923 (10adrianheine) Is there work being done to get this run on ci? I'd like to push this for Wikibase as well :) [08:29:55] (03PS1) 10Gergő Tisza: Add Unicodesnowman to the CI whitelist [integration/config] - 10https://gerrit.wikimedia.org/r/184578 [08:30:29] 3LabsDB-Auditor, Continuous-Integration: Setup jenkins jobs for labsdb-auditor - https://phabricator.wikimedia.org/T86622#972953 (10yuvipanda) 3NEW a:3Legoktm [08:39:42] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:43:45] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [08:51:39] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [09:18:11] (03PS3) 10Zfilipin: Run VisualEditor screenshot job on a Mac [integration/config] - 10https://gerrit.wikimedia.org/r/183480 (https://phabricator.wikimedia.org/T78648) [09:18:37] 3LabsDB-Auditor, Continuous-Integration: Setup jenkins jobs for labsdb-auditor - https://phabricator.wikimedia.org/T86622#973002 (10hashar) The tox based convention for python and related CI configuration are described at https://www.mediawiki.org/wiki/Continuous_integration/Tutorials/Test_your_python should be... [09:18:45] 3LabsDB-Auditor, Continuous-Integration: Setup jenkins jobs for labsdb-auditor - https://phabricator.wikimedia.org/T86622#973004 (10hashar) p:5Triage>3Normal [09:22:00] 3Continuous-Integration: Jenkins: Use node-jscs as checkstyle for javascript coding style - https://phabricator.wikimedia.org/T56218#973009 (10hashar) >>! In T56218#972923, @adrianheine wrote: > Is there work being done to get this run on ci? I'd like to push this for Wikibase as well :) You will need a package... [09:22:42] (03PS11) 10Adrian Lang: Make mwext-WikibaseJavaScriptApi-qunit voting [integration/config] - 10https://gerrit.wikimedia.org/r/180418 (https://phabricator.wikimedia.org/T86176) [09:35:25] 3Release-Engineering: Amir does not have +2 on integration/config - https://phabricator.wikimedia.org/T86629#973036 (10zeljkofilipin) 3NEW [09:36:49] (03CR) 10Zfilipin: [C: 032] Run VisualEditor screenshot job on a Mac [integration/config] - 10https://gerrit.wikimedia.org/r/183480 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:38:19] 3Phabricator, Analytics-Tech-community-metrics, Engineering-Community: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#973051 (10Qgil) I would like to propose some changes in the data ranges used in this report, but before let's agree on {T86630}. You opinions are wel... [09:40:32] 3Continuous-Integration, Mobile-Web: [jsduck] Various custom tags should be easily shareable between projects - https://phabricator.wikimedia.org/T86587#973055 (10hashar) Could the CustomTags we use be made a ruby gem? This way we could have it added to Gemfile and have bundler install it for us. The invocation... [09:42:59] 3Release-Engineering: Amir does not have +2 on integration/config - https://phabricator.wikimedia.org/T86629#973060 (10Amire80) {F28326} [09:44:24] (03Merged) 10jenkins-bot: Run VisualEditor screenshot job on a Mac [integration/config] - 10https://gerrit.wikimedia.org/r/183480 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:48:29] (03CR) 10Hashar: "> I just realized we cannot IP-whitelist, because IPs are not available in tool labs." [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [09:50:28] (03CR) 10Hashar: "YuviPanda confirmed the tools labs reverse proxy does not emit X-Forwarded-For header :-/ So the lame rate limiting is probably fine." [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [09:51:39] (03PS1) 10Zfilipin: Deleted Jenkins job that was running VisualEditor screenshots using a local browser [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) [09:51:58] (03PS3) 10KartikMistry: Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 [09:52:20] (03CR) 10Amire80: [C: 031] Deleted Jenkins job that was running VisualEditor screenshots using a local browser [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:53:22] (03CR) 10Zfilipin: [C: 032] Deleted Jenkins job that was running VisualEditor screenshots using a local browser [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:53:29] (03CR) 10Hashar: [C: 032] "{{be bold}} +2 it :]" [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:53:48] (03CR) 10Hashar: "https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-linux-firefox has been deleted already." [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [09:55:38] 3Release-Engineering, Continuous-Integration: Amir does not have +2 on integration/config - https://phabricator.wikimedia.org/T86629#973088 (10hashar) [09:56:56] hashar: the IP is not exposed on purpose; I think because the tool labs privacy rules are stricter than the general labs ones [09:57:20] hashar: but I suppose a X-Is-WMF-Internal: True header would be fun :-D [09:57:47] 3Release-Engineering, Continuous-Integration: Amir does not have +2 on integration/config - https://phabricator.wikimedia.org/T86629#973089 (10hashar) Timo any thoughts? Zeljkof is on par for JJB and often pairs with Amir. Since both review each others, I think it would make sense to grant amir +2 right on the... [09:58:04] valhallasw`cloud: :-D [09:58:14] valhallasw`cloud: the lame rate limiter is probably enough [09:58:44] hashar: yeah. it should prevent tool labs/gerrit from being overwhelmed, and if we run into issues with merges that are too close together, I can change it to 1 second or so [09:58:50] hashar: https://gerrit.wikimedia.org/r/184369 at your service :) [09:59:31] 3operations, Beta-Cluster: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973092 (10yuvipanda) 3NEW [10:00:29] (03PS4) 10Hashar: + wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:01:30] (03Merged) 10jenkins-bot: Deleted Jenkins job that was running VisualEditor screenshots using a local browser [integration/config] - 10https://gerrit.wikimedia.org/r/184587 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [10:02:41] valhallasw`cloud: I am going to deploy the labs-tools-wikibugs2-autopull tool [10:02:46] hashar: <3 [10:03:21] hashar: oh, I was wondering. Does the command return text show up in gerrit, or does it just show a ' - succesful' ? [10:05:28] (03PS5) 10Hashar: Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:06:02] curl: option --max-time=10: is unknown [10:06:03] wtf! :D [10:06:05] oh, concurrent: False. That's awesome! [10:06:36] hashar: should be --max-time 10 [10:06:37] :< [10:07:03] yeah amending [10:07:03] sorry [10:07:16] https://integration.wikimedia.org/ci/job/labs-tools-wikibugs2-autopull/3/console ! [10:07:16] hashar: it's not you, it's curl! ;-) [10:07:34] I'll add an extra '\n' in the php script :-) [10:07:44] and maybe silence curl [10:07:45] lemmesee [10:07:47] (03PS6) 10Hashar: Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:08:14] hashar: -s silences the progress bar [10:08:22] ah sure [10:08:38] hashar: but also doesn't emit errors anymore :/ [10:09:18] https://integration.wikimedia.org/ci/job/labs-tools-wikibugs2-autopull/4/console \O/ [10:09:23] ah that is not ideal [10:09:40] 3operations, Beta-Cluster: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973108 (10yuvipanda) Note that hiera changes for deployment-prep must be made before any of these get merged. [10:09:50] -sS ! [10:10:01] -S, --show-error [10:10:01] When used with -s it makes curl show an error message if it fails. [10:10:41] added some \n's [10:11:00] still exit 0 though :( [10:11:08] (on a 404) [10:11:22] :/ [10:12:13] -f helps [10:12:49] welcome to #curl [10:15:00] (03PS1) 10Adrian Lang: Add npm job to wikibase [integration/config] - 10https://gerrit.wikimedia.org/r/184592 [10:15:24] valhallasw`cloud: so: curl --fail --max-time 10 https://tools.wmflabs.org/wikibugs/pull.php [10:15:25] ? [10:15:42] 3Continuous-Integration: Jenkins: Use node-jscs as checkstyle for javascript coding style - https://phabricator.wikimedia.org/T56218#973118 (10adrianheine) Cool, thanks. I added `npm test`, although I didn't go through grunt since we are not running any other grunt jobs, yet: * Wikibase: https://gerrit.wikimedi... [10:15:53] curl --fail --silent --show-error --max-time 10 https://tools.wmflabs.org/wikibugs/pull.php [10:16:32] https://integration.wikimedia.org/ci/job/labs-tools-wikibugs2-autopull/6/console :D [10:17:00] hashar: \o/ [10:17:11] hashar: er, --show-error use useful [10:17:22] otherwise it just shows return code 22, not that it was a 404 (or 426, or whatever) [10:19:27] (03PS7) 10Hashar: Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:19:31] (03CR) 10Hashar: Add wikibugs2 auto-pull after merge (033 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:19:35] (03PS8) 10Hashar: Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:20:50] (03CR) 10Hashar: "What a mess. So we end up with:" [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:21:10] hashar: *grin* I was basically cargo-culting the settings. Zuul only pulls the repo? I thought it was what linked gerrit and jenkins together [10:21:28] Zuul receives event from Gerrit [10:21:31] (03CR) 10Merlijn van Deen: [C: 031] "Perfect!" [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:21:39] takes some decisions and trigger jobs in Jenkins, passing them a bunch of parameters [10:21:48] the parameters are passed over Gearman though [10:21:59] (03CR) 10Hashar: [C: 032] Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:22:07] let's see if it works [10:22:21] valhallasw`cloud: in a few minutes the change will merge then I will deploy it and reload zuul [10:22:29] ok :-) [10:22:43] kart_: will look at cxserver / npm now :-] ( https://gerrit.wikimedia.org/r/#/c/184369/ ) [10:28:30] kart_: the repo does not seem to work though http://paste.openstack.org/show/156920/ [10:29:43] (03Merged) 10jenkins-bot: Add wikibugs2 auto-pull after merge [integration/config] - 10https://gerrit.wikimedia.org/r/183636 (owner: 10Merlijn van Deen) [10:31:16] (03CR) 10Hashar: [C: 04-1] "You can remove the Jenkins job so. In jjb/mediawiki-services.yaml in project 'cxserver' drop the job '{name}-jslint' :-)" [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [10:31:44] (03CR) 10Hashar: "Oh and mediawiki/services/cxserver/deploy probably needs .jshintrc / .jshintignore files as well!" [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [10:32:26] hashar: thanks. [10:32:30] valhallasw`cloud: change deployed. Happy merging! [10:35:02] Yippee, build fixed! [10:35:02] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » en,contintLabsSlave && UbuntuTrusty build #4: FIXED in 16 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=en,label=contintLabsSlave%20&&%20UbuntuTrusty/4/ [10:42:03] hashar: https://integration.wikimedia.org/ci/job/labs-tools-wikibugs2-autopull/7/console woot :D [10:48:54] (03PS4) 10KartikMistry: Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 [10:52:46] valhallasw`cloud: great! [10:57:53] (03PS5) 10Hashar: Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [10:58:10] (03CR) 10Hashar: [C: 032] Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [10:58:19] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » fa,contintLabsSlave && UbuntuTrusty build #5: FAILURE in 20 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=fa,label=contintLabsSlave%20&&%20UbuntuTrusty/5/ [10:58:23] kart_: lets see what happens :D [10:58:44] (03CR) 10jenkins-bot: [V: 04-1] Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [11:05:52] RECOVERY - Puppet failure on deployment-sca-cache01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:06:14] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » nb,contintLabsSlave && UbuntuTrusty build #5: ABORTED in 28 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=nb,label=contintLabsSlave%20&&%20UbuntuTrusty/5/ [11:06:14] (03Merged) 10jenkins-bot: Enable npm check on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [11:08:10] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:11:54] 3Continuous-Integration, MediaWiki-extensions-ConfirmAccount: Setup jenkins job for ConfirmAccount extension - https://phabricator.wikimedia.org/T86637#973167 (10Nemo_bis) 3NEW [11:16:57] zeljkof: hi [11:17:02] hi aharoni [11:17:57] 3Continuous-Integration, MediaWiki-extensions-ConfirmAccount: Setup jenkins job for ConfirmAccount extension - https://phabricator.wikimedia.org/T86637#973174 (10hashar) 5Open>3Resolved a:3hashar Seems it has the PHPUnit tests running, example https://gerrit.wikimedia.org/r/#/c/182579/ [11:19:33] zeljkof: rerunning for three languages: [11:19:37] fa, gl, he [11:19:41] fa is the last one that failed, [11:19:48] gl was requested by elitre [11:19:52] and he is mine ;) [11:19:57] aharoni: :) [11:20:02] https://integration.wikimedia.org/ci/view/BrowserTests/view/VisualEditor/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/ [11:20:16] 3Beta-Cluster: deployment-mx is its own puppetmaster - https://phabricator.wikimedia.org/T86575#973177 (10hashar) [11:22:17] 3Beta-Cluster: deployment-mx is its own puppetmaster - https://phabricator.wikimedia.org/T86575#971745 (10hashar) Ccing Jeff Green It should be possible to point it back to the deployment-salt puppet master. Never tried though :( Note: all patches applied on puppetmaster must be sent to Gerrit first then cherr... [11:26:41] (03CR) 10Hashar: "Change deployed. The deploy repo fails though:" [integration/config] - 10https://gerrit.wikimedia.org/r/184369 (owner: 10KartikMistry) [11:26:51] kart_: cxserver/deploy change has been deployed. The npm job fails though https://integration.wikimedia.org/ci/job/cxserver-npm/442/console [11:28:35] zeljkof: the test that failed for Persian earlier, now passes. [11:28:37] https://integration.wikimedia.org/ci/view/BrowserTests/view/VisualEditor/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=fa,label=contintLabsSlave%20&&%20UbuntuTrusty/lastBuild/console [11:28:38] hashar: it looks, it doesn't need npm check as no real code will be there? [11:28:47] src/ is gitsubmodule. [11:28:51] * aharoni is making stuff green \o/ [11:29:01] kart_: just remembered that we do for parsoid [11:29:14] aharoni: great :) [11:29:14] kart_: there is a job template which is shared by both the parsoid and parsoid/deploy repo [11:29:25] I see. [11:29:28] kart_: for parsoid, we use npm install , but for parsoid/deploy we point the npm module path to /deploy [11:29:51] kart_: thus on the /deploy repo, we run the test using whatever module is shipped by the deploy repo and don't rely on npm install. Exactly like in production [11:30:01] hashar: can you merge https://gerrit.wikimedia.org/r/#/c/184600/ [11:30:03] kart_: might want to mimic that behavior for cxserver. Sorry figured that after the deployment [11:30:05] zeljkof: It would be great to document somewhere as a good practice not to use text to find elements without a very good reason, because it's not internationalized by definition. [11:30:10] the only good reason [11:30:18] hashar: thanks :) [11:30:44] aharoni: yes, T2001 :) [11:30:46] hashar: where should I look for it? [11:30:48] the only good reason to use text that I can think of is when you are typing particular and looking for that text, but that should be rare. [11:31:23] kart_: jjb/parsoidsvc.yaml specially the macro parsoid-set-env [11:31:39] kart_: it is used to figure out which NODE_PATH to use, generate a env file that can be sourced by bash [11:31:48] kart_: then each shell command starts by sourcing that crafted env [11:32:13] kart_: ideally we would make it more generic so it can be reused by all the mediawiki/services/* repos [11:32:34] kart_: OR we could create a new git repository that would hold the code/logic used to setup and run tests for such repos [11:33:33] hashar: I'm open for anything that works (good if works everywhere) [11:34:03] kart_: though I have exactly Zero time to work on that this week [11:34:11] so the npm job is probably good enough for now [11:35:24] Yippee, build fixed! [11:35:25] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » fa,contintLabsSlave && UbuntuTrusty build #6: FIXED in 16 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=fa,label=contintLabsSlave%20&&%20UbuntuTrusty/6/ [11:35:28] hashar: doing it. [11:37:09] 3Phabricator, Wikimedia-Git-or-Gerrit: Migrate Gerrit project ownership request system to Phabricator - https://phabricator.wikimedia.org/T86639#973194 (10TTO) 3NEW [11:37:48] ... I don't remember joining this channel, did it used to be something else? [11:38:09] wikimedia-qa [11:38:27] Ah. Thanks Nikerabbit :) [11:39:22] Lcawte: yeah we have migrated yesterday and greg-g has set a redirect [11:39:29] vikasyaligar: Hi! Did you notice that your works is gradually being improved? [11:40:01] hashar: no. he infact kicked me :( ;) [11:40:30] kart_: yeah I recommended to kick everyone from -qa . Most clients would attempt to reconnect automatically and thus end up joining this -releng :-] [11:40:57] zeljkof: Persian: GREAT SUCCESS * https://integration.wikimedia.org/ci/view/BrowserTests/view/VisualEditor/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=fa,label=contintLabsSlave%20&&%20UbuntuTrusty/lastBuild/console [11:41:12] aharoni: :) [11:42:07] zeljkof: and the fonts work very well as far as I can see. [11:42:39] aharoni: it should, it's a mac ;) [11:42:41] * aharoni is dancing around as the language screenshots are coming back to life [11:42:48] * zeljkof is out of lunch [11:42:50] aharoni: Hello ! wow ! awesome :) [11:43:00] vikasyaligar: we'd love your help ;) [11:43:15] aharoni: what do you mean by back to life ? was it down ? [11:43:38] the test jobs were failing for a few months because of various bugs, [11:43:51] and I didn't have much time to work on it because I just became a father ;) [11:44:09] but I'm gradually fixing the bugs and the jobs are going back to green. [11:44:20] aharoni: Congratulations ! [11:44:43] aharoni: yup ! I am ending one of my intern on 15th Jan,after which I will back to contribute on this :) [11:45:13] I work on it no more than a couple of hours a week because it's not my main work, but I'm trying to keep this alive. [11:45:26] I would really love to push it forward - complete all the screenshots for VisualEditor, [11:45:42] refactor this to a separate gem (N.B.: zeljkof), [11:45:51] and start applying it to other extensions. [11:46:00] so vikasyaligar , we'd really love your help. [11:46:20] aharoni: awesome ! yup; I will start doing that, from this Friday :) [11:46:40] we are fixing little bugs in VisualEditor itself as we are going along ;) [11:47:51] yup ! I remember seeing a mail regarding changes on UI of VisualEditor [11:49:29] (03PS1) 10KartikMistry: Add cxserver-set-env to fix npm on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [11:49:35] hashar: ^ [11:54:51] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » gl,contintLabsSlave && UbuntuTrusty build #6: SUCCESS in 35 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=gl,label=contintLabsSlave%20&&%20UbuntuTrusty/6/ [11:55:27] 3Phabricator, Wikimedia-Git-or-Gerrit: Migrate Gerrit project ownership request system to Phabricator - https://phabricator.wikimedia.org/T86639#973212 (10Aklapper) Dup of T38269 ? [11:58:55] 3Phabricator, Wikimedia-Git-or-Gerrit: Migrate Gerrit project ownership request system to Phabricator - https://phabricator.wikimedia.org/T86639#973218 (10TTO) Not a dupe; that task is about requesting new Gerrit repositories; this one is about +2 rights. But the two tasks are similar in scope. T38269 is also as... [12:11:32] kart_: :-] [12:12:45] kart_: ahhh code duplication [12:13:00] kart_: I am not sure what the PARSOID_PATH is for [12:14:28] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » he,contintLabsSlave && UbuntuTrusty build #6: FAILURE in 55 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=he,label=contintLabsSlave%20&&%20UbuntuTrusty/6/ [12:14:29] Let me see how I can improve it. [12:15:37] anyway. copy-paste found. [12:15:43] (03CR) 10Hashar: [C: 04-1] "Yeah that is the idea. I am wondering whether we could make the Parsoid macro more generic so it can be shared by both parsoid and cxserve" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/184609 (owner: 10KartikMistry) [12:15:45] (03PS2) 10KartikMistry: Add cxserver-set-env to fix npm on cxserver/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [12:16:57] hashar: trying to fix this. [12:17:02] ie fixing. [12:19:37] 3Phabricator, Wikimedia-Git-or-Gerrit: Migrate Gerrit project ownership request system to Phabricator - https://phabricator.wikimedia.org/T86639#973256 (10Qgil) Can we just reuse #Wikimedia-Git-or-Gerrit for this process, asking people to file the same information that i now requested in the wiki page? [12:21:33] kart_: and I think Gabriel Wicked filled a task about it [12:22:41] 3Continuous-Integration: Design the Jenkins isolation architecture - https://phabricator.wikimedia.org/T86171#973257 (10hashar) I started writing the document at https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation . Will poke once ready. [12:24:20] kart_: https://phabricator.wikimedia.org/T78410 might be relevant [12:25:23] (03PS3) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [12:25:25] hashar: first try now. [12:25:44] hashar: what is value of {repository} ? [12:25:51] (should have asked earlier :D) [12:27:06] 3RESTBase, Continuous-Integration, Parsoid, Services: Move Parsoid and RESTBase testing from Travis CI to our Jenkins - https://phabricator.wikimedia.org/T78410#973258 (10hashar) Possibly we could get cassandra added to the current CI slaves but the setup/teardown is going to be a bit cumbersome. CiviCRM has a s... [12:27:24] kart_: source or deploy [12:27:37] kart_: the job template has in its name {repository} [12:27:51] so for the parsoid project we release the job template with the values: repository: [ 'source', 'deploy' ] [12:28:04] in the shell script, you can then switch between the values to set the NODE_PATH to different paths [12:28:11] i.e. for source point to /node_modules [12:28:18] for 'deploy', point to '/deploy/node_modules' [12:29:36] (03PS4) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [12:29:56] bah. [12:30:05] 3Beta-Cluster: Kill role::logstash::beta - https://phabricator.wikimedia.org/T86642#973263 (10yuvipanda) 3NEW [12:30:23] I think I'll stop here, not really time to fix new thing :) [12:30:38] hashar: I'll work on this later after meeting. [12:31:39] PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [12:33:43] (03CR) 10Hashar: WIP: Add generic npm-set-env to fix npm on */deploy repos (034 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/184609 (owner: 10KartikMistry) [12:33:53] kart_: yeah and sync with cscott / gwicke [12:41:42] YuviPanda: how do you refer to the zone / network segment that hold labsmonXXX ? is that just "labs infra" ? [12:41:56] YuviPanda: I mean how it is different from the virtXXX machines [12:42:00] hashar: hmm, labs subnet, I think? [12:42:05] they’re all in the same subnet, IIRC [12:42:06] unsure [12:42:12] I will go for that thanks [12:44:56] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10yuvipanda) 3NEW [12:46:14] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10yuvipanda) [12:47:40] 3Beta-Cluster: Kill role::syslog::centralserver::beta - https://phabricator.wikimedia.org/T86645#973309 (10yuvipanda) 3NEW [12:49:45] 3Beta-Cluster: Kill role::syslog::centralserver::beta - https://phabricator.wikimedia.org/T86645#973321 (10yuvipanda) Hiera change in https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=140717&oldid=140084 [12:59:33] YuviPanda: s/kill/phase out/ [12:59:38] YuviPanda: kill is offending :] [13:00:02] YuviPanda: can we get the hiera stuff in puppet yaml files? [13:00:13] hashar: instead of wikitech? [13:00:18] yup [13:00:22] just wondering [13:00:26] hmm [13:00:29] theoretically, sure [13:00:29] in case we want to have the hiera changes to be reviewed [13:00:34] don't waste your time on it :] [13:00:35] yeah [13:07:54] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973348 (10yuvipanda) [13:07:55] 3Beta-Cluster: Kill role::syslog::centralserver::beta - https://phabricator.wikimedia.org/T86645#973346 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [13:33:21] greg-g: hashar lots of puppet failures because someone again forgot to keep the local cherry picks on betalabs puppet master [13:34:40] fixed [13:37:13] 3Phabricator: Phabricator project names and hashtags inconsistently use spaces, hyphens, and underscores - https://phabricator.wikimedia.org/T75994#973389 (10Aklapper) [13:46:23] YuviPanda: :-( [13:46:31] YuviPanda: we need a better system [13:46:40] hashar: yeah. need to get rid of the local hacks [13:46:49] YuviPanda: maybe we can list in hiera a serie of patches to cherry pick [13:47:04] nooooo [13:47:08] hmm [13:47:14] hashar: we could write a helper script [13:47:18] that does the cherry-picking for you [13:47:23] I thought about having a branch that would be merged locally [13:47:26] so you don’t accidentlaly end up doing something else [13:47:31] but branch rebasing would not work nicely with gerrit [13:49:45] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [13:51:33] !log created user wikiadmin on deployment-db1 [13:51:38] Logged the message, Master [13:55:23] 3Beta-Cluster: shell wrapper to connect to databases - https://phabricator.wikimedia.org/T47706#973426 (10yuvipanda) I just created the wikiadmin user with same password as mw user and dropped that LOCAL HACK from deployment-salt. Seems to work fine on deployment-bastion. [13:55:42] hashar: do you know where to change ^ for mediawiki? [13:56:25] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:56:40] YuviPanda: PrivateSettings.php wherever operations/mediawiki-config.git is cloned at [13:56:49] hashar: yeah, looking through that now [13:56:49] probably /srv/mediawiki-staging on deployment-bastion [13:56:59] would need to adjust / create the relevant db user on db1 and db2 [13:57:24] hashar: yeah, have done [13:57:49] hashar: do I need to run something after modifying this? [13:58:01] !log modified PrivateSettings.php to make it use wikiadmin user rather than mw user [13:58:03] Logged the message, Master [13:58:10] no clue :] [13:58:19] heh [13:58:33] !log running scap, because why not [13:58:34] Logged the message, Master [13:58:35] there is a `sql` wrapper script that queries mediawiki config to get the user/pass [13:58:43] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:58:46] !log scap failed [13:58:48] Logged the message, Master [13:58:49] bah [13:58:56] you can also run scap from jenkins :] [13:58:58] hashar: yeah, I fixed that as well [13:59:12] by running https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [13:59:23] well done yuvi! [13:59:52] !log running scap via jenkins, hitting buttons on https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [13:59:54] Logged the message, Master [14:01:26] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:01:40] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [14:03:30] 3Release-Engineering, Beta-Cluster: Reduce [LOCAL HACK] changes on Beta Cluster to zero - https://phabricator.wikimedia.org/T76392#973439 (10yuvipanda) [14:03:34] 3Beta-Cluster: shell wrapper to connect to databases - https://phabricator.wikimedia.org/T47706#973436 (10yuvipanda) 5Open>3Resolved a:3yuvipanda And switched MW to run from that, by editing PrivateSettings.php on deployment-bastion and running scap. \o/ [14:06:35] Project browsertests-Wikidata-PerformanceTests-linux-firefox-sauce build #118: FAILURE in 34 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-PerformanceTests-linux-firefox-sauce/118/ [14:08:46] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:44] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:53] YuviPanda: the future mess I am working on https://commons.wikimedia.org/wiki/File:Integrationwikimediaci-architecture-isolation.svg :-D [14:16:42] hashar: nice! [14:16:50] but rather messy [14:19:14] puff phabricator does not support svg [14:21:03] 3Continuous-Integration: Design the Jenkins isolation architecture - https://phabricator.wikimedia.org/T86171#973521 (10hashar) I have drawn a general overview of the envisioned architecture https://www.mediawiki.org/wiki/File:Integrationwikimediaci-architecture-isolation.svg {F28403} [14:23:04] 3Continuous-Integration: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#973532 (10hashar) 3NEW [14:25:06] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973540 (10hashar) 3NEW [14:25:23] 3Continuous-Integration: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#973548 (10hashar) [14:25:48] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#514898 (10hashar) [14:27:39] 3Continuous-Integration: Disable Gerrit replication to production slaves - https://phabricator.wikimedia.org/T86661#973557 (10hashar) 3NEW [14:27:51] 3Continuous-Integration: Disable Gerrit replication to production slaves - https://phabricator.wikimedia.org/T86661#973557 (10hashar) [14:28:10] 3Continuous-Integration: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#973564 (10hashar) [14:29:09] (03PS5) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [14:30:25] 3Phabricator: Create a green goal project 'contint-isolation' - https://phabricator.wikimedia.org/T86662#973567 (10hashar) 3NEW [14:32:29] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973586 (10hashar) [14:33:00] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973540 (10hashar) [14:37:41] (03PS1) 10Hashar: Move mediawiki-gate to lab slaves [integration/config] - 10https://gerrit.wikimedia.org/r/184634 [14:40:25] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973606 (10hashar) [14:41:07] (03CR) 10Hashar: [C: 032] Move mediawiki-gate to lab slaves [integration/config] - 10https://gerrit.wikimedia.org/r/184634 (owner: 10Hashar) [14:44:46] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973625 (10hashar) [14:45:40] 3Phabricator, Wikimedia-Git-or-Gerrit: Migrate Gerrit project ownership request system (+2 rights) to Phabricator - https://phabricator.wikimedia.org/T86639#973626 (10Aklapper) [14:47:28] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:47:48] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:49:38] (03Merged) 10jenkins-bot: Move mediawiki-gate to lab slaves [integration/config] - 10https://gerrit.wikimedia.org/r/184634 (owner: 10Hashar) [14:49:39] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:49:44] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:50:34] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:50:40] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:51:34] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:51:44] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:51:48] PROBLEM - Puppet failure on deployment-sca-cache01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:52:16] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:52:25] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:52:25] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:52:27] PROBLEM - Puppet failure on deployment-sentry2 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:52:45] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:52:59] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:53:05] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:54:47] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:54:50] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10yuvipanda) 3NEW [14:55:16] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:56:24] PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:56:43] greg-g: hashar ^ all caused by coren messing around in nfs, I’m told [14:57:47] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:57:53] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » ar,contintLabsSlave && UbuntuTrusty build #7: FAILURE in 12 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=ar,label=contintLabsSlave%20&&%20UbuntuTrusty/7/ [14:57:58] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:58:51] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973691 (10yuvipanda) New users created by wikitech should also be set to /bin/bash rather than sillyshell [14:59:12] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:59:42] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [14:59:46] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:00:04] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:00:40] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:00:57] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:01:37] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:01:41] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:01:45] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:01:50] YuviPanda: well done [15:02:43] hashar: hopefully I can kill the second local hack tomorrow [15:03:35] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:03:55] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:04:41] PROBLEM - Puppet failure on deployment-restbase03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:04:55] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [15:05:50] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:06:18] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:07:00] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:07:15] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:07:31] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:07:33] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » it,contintLabsSlave && UbuntuTrusty build #7: FAILURE in 21 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=it,label=contintLabsSlave%20&&%20UbuntuTrusty/7/ [15:08:11] hashar: no reason for sillyshell right now. I also emailed ops@ [15:09:07] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:09:42] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973726 (10Chad) Sillyshell, for those unaware, was a dead-simple shell that wrapped svnserve & co. and we used to run it on the Subversion box. We committed using svn+ssh but didn't want to giv... [15:10:41] YuviPanda: yeah I am not sure what it does exactly. Brion or Tim wrote it ages ago [15:10:44] (pre git area for sure [15:11:01] hashar: yeah, ^d just commented on the phab ticket [15:11:36] <^d> hashar: That'd be Brion, fwiw :) [15:22:16] hashar: can you remove your -1? :) [15:22:23] ^d: also can you +1? :) [15:22:37] https://gerrit.wikimedia.org/r/#/c/184635/ [15:23:20] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973762 (10hashar) [15:24:15] YuviPanda: done [15:24:22] thank you [15:25:03] 3Release-Engineering, Beta-Cluster: Reduce [LOCAL HACK] changes on Beta Cluster to zero - https://phabricator.wikimedia.org/T76392#973766 (10yuvipanda) Really close now, just needs T86668 which should happen in a day or two :D [15:36:09] Is zuul still up? [15:36:22] hashar: Did you see our !log entries from last night? I got to help James_F|Away fix zuul. [15:36:40] marktraceur: haven't seen them [15:36:49] Had to do a hard restart. [15:36:51] marktraceur: I was probably asleep already [15:36:53] :-( [15:36:55] Figured you should be aware [15:36:57] what happened? [15:37:09] if you had a bunch of changes blocked on zuul status page with job completed [15:37:19] the root cause is Zuul deadlocking waiting for Gerrit [15:37:35] in this case one has to restart Zuul entirely. It started occurring 2 weeks ago [15:37:45] Not really sure! There was a postmerge build that finished about two hours prior, but showed up on zuul-status as pending, and a few VE tasks that showed up as mostly successful but weren't going through [15:38:16] Like, I think it was in the log as a success, but the +2 from zuul was not forthcoming [15:38:35] It was 22:30 here, 20:30 in SF, I hardly think there was a deadlock, not a lot of traffic. But I could be wrong. [15:39:21] (03PS1) 10Hashar: Move mediawiki regression jobs to lab slaves [integration/config] - 10https://gerrit.wikimedia.org/r/184641 [15:40:04] hashar: I gotta say, I agree with James_F|Away, we should throw more hardware at the problem. :) [15:40:18] marktraceur: that looks like the Zuul deadlocking on Gerrit not replying :-/ [15:40:27] in this case, that is a bug in Zuul [15:40:37] when it attempts to comment back in Gerrit, there is no timeout [15:41:09] so if the Gerrit comment fails for some reason (such as not being able to reach out the database), the ssh connection is stalled and Zuul deadlocks waiting for a reply :-/ [15:41:33] marktraceur: and more hardware is being worked on https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation [15:42:20] (03CR) 10Hashar: [C: 032] "Updated jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/184641 (owner: 10Hashar) [15:43:04] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973811 (10hashar) [15:44:05] hashar: Maybe the answer is to write a more graceful failure status. [15:44:24] Like...shunt the gerrit messages to a redis queue and have a different service try posting the queued messages to gerrit. [15:44:40] I know a certain YuviPanda who loves doing shit with redis... [15:46:09] I DIDN’T DO IT [15:46:37] marktraceur: that’s mostly an acquired habit from spending time looking at ori’s code [15:47:17] also not it :) [15:47:20] * YuviPanda goes away to a bus [15:47:29] COWAAAAAAAAAARD [15:48:46] * YuviPanda hits marktraceur on the face with a glove [15:48:51] (03Merged) 10jenkins-bot: Move mediawiki regression jobs to lab slaves [integration/config] - 10https://gerrit.wikimedia.org/r/184641 (owner: 10Hashar) [15:48:55] hmm, much more effective with a metal gauntlet, I suppose [15:49:15] Yes, but funnier with a dainty glove. [15:49:22] And if I precede it by throwing wine in your face. [15:49:29] git pull && git rebase --preserve-merges && git lg [15:49:31] marktraceur: until my hand freezes from the cold? [15:49:33] pff [15:49:37] sorry [15:49:48] * YuviPanda pokes bd808 with https://gerrit.wikimedia.org/r/#/c/184618/ [15:51:59] YuviPanda: sure. If you really want to clean it up though the config should probably become params that can be set with hiera [15:52:17] bd808: yup, yup. one step at a time, etc [15:52:25] right now you just took one hard coded role and made it two. :) [15:52:39] bd808: composing them around :) [15:52:46] bd808: just trying to kill all ::beta roles [15:53:01] *nod* a noble goal [15:53:25] bd808: also do we use the IRC log at all? [15:53:31] we should either switch all SAL to that [15:53:32] or not [15:53:57] It's active and I use it occasionally, but its more of a POC than anything today [15:54:28] There is an old feature request to kill SAL which kind of got me started on it [15:54:41] yeah [15:54:52] Extension:SAL! [15:54:53] :) [15:55:03] also being able to show SAL events on the log message timeline would be handy for forensics [15:55:26] bd808: yup, we need an ‘operations event log’ showing SAL, deploys, puppet merges [15:55:37] uh, and icinga alerts [15:55:59] bd808: and it should probably be in prod rather than labs [15:56:06] agreed [15:56:13] logstash seems like a good thing to use for all of this [15:56:45] once you get the data in elasticsearch you can do lots of things with it. [15:56:46] bd808: also local hacks on deployment-prep should all be gone by tomorrow :) [15:56:48] yeah [15:56:52] all? [15:57:03] I don't believe you [15:57:20] bd808: I mean, all the [LOCAL HACKS] that can’t be merged in prod :) [15:57:39] bd808: I fixed the db related one already, and have patches in place to fix the underlying cause of the mwdeploy one [15:57:58] (there’s an ops@ thread) [15:58:09] oh. found ways around them all? That's cool. The latest one that got a -2 in prod was the apache used uid cleanup [15:58:18] 3Continuous-Integration: Design the Jenkins isolation architecture - https://phabricator.wikimedia.org/T86171#973835 (10hashar) I have poked the internal ops list to get some early feedbacks. [15:58:36] YuviPanda: https://gerrit.wikimedia.org/r/#/c/178690/ [15:58:53] bd808: bah, not *that*. Haven’t understood that one properly yet [15:58:59] bd808: should prefix with [LOCAL HACK] :) [15:59:23] Well I hoped that folks would actually fix the mess when I wrote it [15:59:51] but I didn't catch it until after the trusty reimage :( [16:02:35] (03PS1) 10Hashar: Move *puppet-validate jobs to labs [integration/config] - 10https://gerrit.wikimedia.org/r/184644 [16:03:14] (03PS2) 10Hashar: Move *puppet-validate jobs to labs [integration/config] - 10https://gerrit.wikimedia.org/r/184644 [16:06:24] hashar: just woke up & saw your mail re VM isolation in CI ;) [16:08:19] it's also proposed for discussion in https://phabricator.wikimedia.org/T86372, but maybe it'll be moot by then [16:10:38] (03PS3) 10Hashar: Move *puppet-validate jobs to labs [integration/config] - 10https://gerrit.wikimedia.org/r/184644 (https://phabricator.wikimedia.org/T62164) [16:11:07] 3Continuous-Integration: Jenkins: Use web proxy to let git access repositories on from GitHub (e.g. submodules) - https://phabricator.wikimedia.org/T62164#973856 (10hashar) I have made translatewiki-puppet-validate (which has github repos as submodule) to run on labs slaves. It fetches the submodules and pass. [16:11:24] 3Continuous-Integration: Jenkins: Use web proxy to let git access repositories on from GitHub (e.g. submodules) - https://phabricator.wikimedia.org/T62164#973857 (10hashar) 5Open>3Resolved a:3hashar [16:11:47] 3Multimedia, Beta-Cluster, MediaWiki-extensions-Sentry: Deploy Sentry on beta cluster - https://phabricator.wikimedia.org/T78807#973862 (10greg) [16:12:18] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973865 (10hashar) [16:12:48] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#973540 (10hashar) [16:13:20] gwicke: damn the mw summit [16:13:31] gwicke: I forgot to schedule sessions or to write slides :-: [16:14:24] Jenkins is deadlocked again :( [16:14:37] the beta-mediawiki-config-update-eqiad have no executors available [16:14:49] hashar: it's early days for slides [16:15:13] gwicke: my week is full already :-/ Got board meetings + tenant visiting the apartment [16:15:21] most of the sessions that are currently scheduled are discussions [16:15:27] so not many slides needed [16:15:32] sounds good :] [16:15:52] ah beta Jenkins jobs are just waiting for scap to finish [16:16:18] gotta rush out for groceries, be back for the weekly checkin [16:16:24] gwicke: feel free to reply to my ops mail :] [16:16:47] gwicke: also kartik started refactoring some parsoid job so it can be used by other mediawiki/services repos (ex: cxserver ) [16:17:24] gwicke: there https://gerrit.wikimedia.org/r/#/c/184609/ could in theory let us use the same job template for all the node js repos having a /deploy repo :] [16:17:27] hashar: kk, could be great to support .jenkins.yml or something very similar [16:17:47] .travis.yml [16:17:50] well we have entry points such as https://www.mediawiki.org/wiki/Continuous_integration/Test_entry_points [16:17:54] ie npm test [16:18:05] but you can't apt-get install cassandra this way hehe [16:18:20] yeah, for that you need more things in .travis.yml [16:18:35] https://github.com/wikimedia/restbase/blob/master/.travis.yml#L6 [16:19:11] the node_js stanza is also interesting, as it determines the node versions to test [16:19:20] 3Multimedia, Beta-Cluster, MediaWiki-extensions-Sentry: Deploy Sentry on beta cluster - https://phabricator.wikimedia.org/T78807#973883 (10greg) @tgr tells me that this extension will change a lot between now and deployment to production, but I'd like @csteipp to do a cursory glance at the idea of the extension... [16:20:04] bd808: btw, you were cc'd on that task initially, I removed you assuming it was just "add bd808 because he does beta things", if it was an otherwise legitimate cc, sorry! :) [16:20:20] bd808: where "that task" == https://phabricator.wikimedia.org/T78807 (sentry) [16:21:06] (03CR) 10Hashar: [C: 032] "Jobs updated and passing \O/" [integration/config] - 10https://gerrit.wikimedia.org/r/184644 (https://phabricator.wikimedia.org/T62164) (owner: 10Hashar) [16:21:08] I had the distinction of hand installing the first version of Sentry in beta but at this point I think tgr probably knows a lot more about it than I do. [16:21:36] gwicke: maybe one day we will be able to reuse the Travis client :] [16:21:59] off for quick grocery shopping. be back for the weekly checkin [16:23:48] greg-g: I have large doubts that Sentry will easily scale to handle actual production traffic. Emphasis on "easily" in that statement. I'm sure there are ways to scale it, but my limited Django production experience was that it had challenges growing past certain levels of traffic. [16:24:28] In the case of the app I was trying to scale it was due to the choices inherent in the Django ORM [16:25:04] I will happily stay the hell out of the way and let tgr work with ops however :) [16:28:38] (03Merged) 10jenkins-bot: Move *puppet-validate jobs to labs [integration/config] - 10https://gerrit.wikimedia.org/r/184644 (https://phabricator.wikimedia.org/T62164) (owner: 10Hashar) [16:29:02] 3Beta-Cluster, operations: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973943 (10greg) p:5Triage>3Normal [16:29:22] (03PS6) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [16:29:37] greg-g: hola [16:34:40] Project beta-scap-eqiad build #38041: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38041/ [16:37:32] kart_: hola senor [16:38:07] bd808: noted on staying the hell out of the way :) [16:38:51] 3Services, Continuous-Integration, RESTBase, Parsoid: Move Parsoid and RESTBase testing from Travis CI to our Jenkins - https://phabricator.wikimedia.org/T78410#973956 (10Jdouglas) It's worth noting that Travis CI slowness hasn't been an issue recently. It only occasionally crops up when the team is working on... [16:39:41] Eurgh. Has Zuul died again? [16:40:05] Last item queued 10 minutes ago; first, 14. No activity in the graphs. [16:40:11] Yippee, build fixed! [16:40:11] Project beta-scap-eqiad build #38042: FIXED in 2 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38042/ [16:40:39] Is it stuck on writing responses into gerrit again? [16:42:59] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [16:44:11] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [16:44:20] James_F: again? is there a bug or something about that where someone like twentyafterfour or chrismcmahon or anyone could see what the issue was? [16:44:39] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:44:41] (if not, someone on my team should do that :) ) [16:45:02] greg-g: I thought hashar was dealing with that earlier [16:45:13] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:45:14] greg-g: See scrollback for last night. This is a recurring issue. [16:45:33] chrismcmahon: [16:45:33] 11:21 <+ hashar> off for quick grocery shopping. be back for the weekly checkin [16:45:37] 11:22 < hashar!~sempitern@mediawiki/hashar [] [16:45:41] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:21] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:46:23] in that case, redirecting: chrismcmahon < James_F> greg-g: See scrollback for last night. This is a recurring issue. :) [16:46:37] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:42] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:49] Last night marktraceur and I fixed it by hard-restarting Zuul (after trying all the alternatives). [16:47:04] I already told hashar [16:47:21] James_F: It was a problem with zuul getting deadlocked because Gerrit wasn't responding [16:47:29] Yup. [16:47:41] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:47:55] The task in Phabricator is https://phabricator.wikimedia.org/T65760 [16:48:10] Oh, wait, no. Different thing? [16:48:10] James_F: what are the down sides of that fix? do we lose all the builds in-progress waiting to report back? [16:48:18] Yeah. [16:48:20] :( [16:48:24] greg-g: If we have to hard-restart, yes, loss of all in-flight stuff. [16:48:26] greg-g: A bunch of rechecks happened last night. [16:48:39] greg-g: But it does actually come back up, so people can get on with their work. :-) [16:48:43] right [16:48:54] marktraceur: can you tell chrismcmahon what you did? [16:48:58] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [16:49:19] chrismcmahon: Logged into Gallium, sudo -su zuul, service zuul restart [16:49:22] greg-g: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart [16:49:28] * greg-g nods [16:49:30] I think a grand total of five people have the rights for that [16:49:44] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:49:47] Maybe we should give more people rights, but yes. [16:50:02] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:50:10] marktraceur: see also: https://phabricator.wikimedia.org/T85936 [16:50:31] chrismcmahon: can you do that before our team meeting, please [16:50:54] greg-g: nope. I can't restart Zuul [16:50:56] RECOVERY - Puppet failure on deployment-cache-mobile03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:51:12] chrismcmahon: no access or why? [16:51:32] I haven't been able to log in to gallium for some time. I need to figure out why my ssh is not working off bastion. [16:52:18] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [16:52:24] chrismcmahon: please figure that out as soon as possible, we need you to be able to help in these situations and an ssh config issue shouldn't be the limiting part [16:53:33] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:53:36] chrismcmahon: if you can get it resolved shortly, file a bug for it and get hashar/ops/someome's help to get it resolved [16:54:10] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:54:35] are these shinken "recovery" notifications at all helpful? I find it incredibly distracting when every-other-message is a bot [16:54:44] RECOVERY - Puppet failure on deployment-restbase03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:54:58] RECOVERY - Puppet failure on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:55:10] <^d> They'd be nice if puppet stopped flapping and the message was useful. [16:55:22] that [16:55:48] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [16:56:11] but they don't convey any information at all [16:56:16] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:56:24] ah, as in, the exact error/whatever? [16:56:35] anything [16:56:42] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:56:50] sure, consider that part of your wheel house for beta cluster q3 work :) [16:56:51] RECOVERY - Puppet failure on deployment-sca-cache01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:56:55] puppet failure is ok: ok? less than 1.00% above threshold? [16:58:38] it's particularly bad right now [16:58:38] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] shinken doesn't ever tell you anything. it's worthless information [16:58:39] even if I log into the web interface it's got nothing [16:58:39] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:58:39] twentyafterfour: 11:56 <+ greg-g> sure, consider that part of your wheel house for beta cluster q3 work :) [16:59:20] (you can probably con yuvi to help you with it) [16:59:45] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:59:45] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:59:45] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:00:03] aloha [17:00:31] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [17:00:41] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [17:00:46] aharoni: going into our team meeting now, will bbiab [17:01:23] RECOVERY - Puppet failure on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [17:01:34] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:01:43] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [17:02:29] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:02:29] RECOVERY - Puppet failure on deployment-sentry2 is OK: OK: Less than 1.00% above the threshold [0.0] [17:02:39] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [17:03:09] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [17:03:59] I wonder why is this a total failure: https://integration.wikimedia.org/ci/view/BrowserTests/view/VisualEditor/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=he,label=contintLabsSlave%20&&%20UbuntuTrusty/ [17:04:14] It worked earlier today, [17:04:23] and VE seems to work on http://en.wikipedia.beta.wmflabs.org/ [17:05:05] but now the language screenshots tests fail completely. [17:07:50] aharoni: Zuul is currently not processing, so… [17:13:24] zuul stuck? [17:13:27] 3Release-Engineering: Fix browsertests-VisualEditor-language-screenshot-windows_8.1-firefox Jenkins job - https://phabricator.wikimedia.org/T76133#973978 (10greg) a:3zeljkofilipin [17:14:41] 3Release-Engineering: Fix browsertests-VisualEditor-language-screenshot-windows_8.1-firefox Jenkins job - https://phabricator.wikimedia.org/T76133#973981 (10hashar) That job has been deleted by https://gerrit.wikimedia.org/r/#/c/183480/ which replaced it with a mac os version. [17:14:56] (03CR) 10Hashar: "The firefox job was failing apparently ( https://phabricator.wikimedia.org/T76133 )" [integration/config] - 10https://gerrit.wikimedia.org/r/183480 (https://phabricator.wikimedia.org/T78648) (owner: 10Zfilipin) [17:22:28] legoktm: what we just typed in our meeting etherpad: [17:22:29] Zuul deadlocks from time to time due to Gerrit not answering when Zuul report a comment and Zuul has no timeout [17:22:32] * https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul [17:22:37] that's ^ the issue [17:22:59] antoine/someone will restart zuul soon [17:23:55] Didn't hashar say he was going somewhere? [17:24:15] He's offline. [17:24:18] Want me to do it? [17:24:37] Timo just arrived here at the office and urged me to do the earlier steps first. [17:24:47] Well, let me know [17:25:00] It looks the same as it did last night. [17:25:06] That's what I said. [17:25:13] And now he's gone. Oy. [17:25:36] I have a reload queued up, I can change it to a restart right quick [17:26:07] marktraceur: Should we at least try a Gearman restart? [17:26:09] I'mma put some water on to boil. [17:26:16] James_F: Can you do that or do I have to? [17:26:19] I can. [17:26:23] It's something I'm not familiar with, but yeah, go for it [17:26:58] !log Trying a shutdown/re-enable of Jenkins. [17:27:00] Logged the message, Master [17:30:35] !log No effect. Restarting Gearman. [17:30:37] Logged the message, Master [17:31:15] RIP wmf-insecte. [17:31:21] it's zuul [17:31:25] not the other things [17:31:35] greg-g: Timo insists it probably isn't. [17:31:37] need a restart of either zuul or gerrit [17:31:40] huh [17:31:42] kk [17:32:31] !log No effect from restarting Gearman. Getting Timo to restart Zuul. [17:32:33] Logged the message, Master [17:35:03] Well fine. [17:35:17] I KNOW WHEN I'M NOT WANTED. [17:38:06] Welcome back hashar [17:42:25] hashar: Did you restart Zuul? [17:42:59] James_F: yes [17:43:03] James_F: got deadlocked [17:43:05] hashar: Without a !log? [17:43:12] in ops [17:43:25] hashar: RelEng restarts are meant to be logged in here. [17:43:25] !log Restarted deadlocked Zuul , which drops ALL events. Reason is Gerrit lost connection with its database which is not handled by Zuul . See https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul [17:43:27] Logged the message, Master [17:43:31] mid-meeting zuul restart :) [17:43:31] !sal [17:43:31] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:43:36] I don't think I ever logged them here [17:43:39] but can :] [17:43:44] hashar: The rest of us do. [17:43:46] :-) [17:44:01] thanks all for trying to help [17:44:04] James_F: will do from now on! [17:44:12] marktraceur: James_F thank you [17:44:24] James_F: the root cause is the Gerrit mysql database being flappy [17:44:26] hashar: Thanks. [17:44:40] James_F: and Zuul not handling the case properly, there is no timeout and a stall connection ends up deadlocking the poor zuul [17:44:46] there is some stacktrace at https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul [17:44:48] hashar: Should we move gerrit to be more performant or something? [17:44:58] well [17:45:09] 3Release-Engineering: Create an outline of QA/Browser test workshops to give - https://phabricator.wikimedia.org/T425#974018 (10Cmcmahon) 5Open>3Resolved a:3Cmcmahon Done: Advanced Topics proposed session: https://phabricator.wikimedia.org/T86070 - https://www.mediawiki.org/wiki/MediaWiki_Developer_Summit... [17:45:15] could get someone to investigate whether there is an issue with whatever DB host host the Gerrit DB [17:45:24] or whether Gerrit has some issue on its own [17:45:30] + have to teach Zuul to timeout :-] [17:46:06] Fixing Zuul to be less fragile would be good, yes. :-) [17:46:38] yup [17:46:49] but I am reconsidering my python skill level [17:46:57] seems each time I send a patch my tests fails horribly :-( [17:47:07] I have a patch which is "just" swapping two lines [17:47:21] it is definitely right, but the test keeps failing for random / different reason [17:48:51] Helpful. [17:48:58] !log If Zuul status page ( https://integration.wikimedia.org/zuul/ ) shows a lot of changes with completed jobs and the number of results growing, Zuul is deadlocked waiting for Gerrit. Have to restart it on gallium.wikimedia.org with /etc/init.d/zuul restart [17:49:01] Logged the message, Master [17:49:10] ok I am out of here. See you tomorrow! [17:50:17] hashar: Hm.. why would a deadlock need a full zuul restart? [17:50:29] shouldn't gearman relaunch have fixed it? [17:50:33] The same happened yesterday [17:51:10] that is a deadlock in the process queue manager :-/ [17:51:15] while sending report of result back to Gerrit [17:51:22] there is some stacktrace at https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul#Diagnostic [17:51:44] surely the reporting should be async / retry on Gerrit error [17:52:26] hashar: did restart actually work? Or did you have to stop/start? [17:52:33] I find that with deadlock, restart doesn't work [17:53:11] yup [17:53:23] the restart is sending an event to the queue asking for reconfiguration [17:53:36] but since the deadlock is in the queue processing loop, the reconfigure event is never handled [17:53:43] s/reconfiguration/restart/ [17:54:04] one way to fix it is to kill the stall ssh connection [17:54:09] but I haven't found how to do it in Gerrit [17:54:31] one sure thing, whenever Gerrit is unreachable, Zuul misbehave and raise exception for a bunch of actions :-] [17:55:10] And gerrit is getting unreachable more often? [17:55:27] Or it's just bad luck that this has happened 3 (more?) times in a week? [17:55:34] probably more often [17:55:43] I have seen a few people complaining about the reviewDB not being reacheable [17:55:48] and got hit by it as well [17:55:54] never happened before ™ :-D [17:56:06] I got hit by the reviewDB thing this morning [17:59:36] I am off *wave* [18:05:49] Project browsertests-ZeroBanner-en.m.wikipedia.org-linux-phantomjs build #384: FAILURE in 46 sec: https://integration.wikimedia.org/ci/job/browsertests-ZeroBanner-en.m.wikipedia.org-linux-phantomjs/384/ [18:07:01] Yippee, build fixed! [18:07:02] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #348: FIXED in 58 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/348/ [18:11:11] (03PS1) 10Legoktm: Configure flake8 job for operations/software/labsdb-auditor [integration/config] - 10https://gerrit.wikimedia.org/r/184693 (https://phabricator.wikimedia.org/T86622) [18:12:54] bd808: I don't want to take you away from anything, but if you have a moment, I am stumped as to why I can't ssh off of bastion.wmflabs.org. I set up ProxyCommand per https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 and still can't get off bastion to anywhere [18:13:12] probably a character flaw on my part [18:13:24] heh. [18:13:39] tried -vvv or whatever the super verbose flag is? [18:14:08] lemme check... [18:14:59] With the proxy command you would be sshing from your laptop directly to the VMs behind the bastion, not ssh to bastion and then ssh from there to a VM. [18:15:51] But this all started with you not being able to use `ssh -A` to hop to the bastion and then beyond right? [18:16:38] bd808: https://gist.github.com/chrismcmahon/03a1d9e0cbfa4dde83ea [18:17:24] OK, so I should be able to just e.g. "ssh cmcmahon@gallium.wmflabs.org" directly from my local machine? [18:17:59] yeah, except gallium isn't in labs. [18:18:20] ssh gallium.wikimedia.org -- it's in prod [18:18:33] ssh deployment-bastion.eqiad.wmflabs [18:18:38] it's in labs [18:19:40] that ssh trace shows that when you are on bastion1 you don't have any local ssh keys or an active agent (which is right with the proxycommand setup) [18:20:10] which would be true, I have the ProxyCommand in place and got a fresh shell just in case [18:20:50] So try `ssh -vvv deployment-bastion.eqiad.wmflabs` from your laptop and see what happens [18:21:45] bd808: OK cool, that worked. [18:21:51] w00t [18:22:20] cmcmahon@deployment-bastion:~$ pwd [18:22:20] /home/cmcmahon [18:22:27] progress [18:22:45] bd808: OK, so now how do I get to gallium? [18:23:27] You need to setup a proxy command for production and then `ssh gallium.wikimedia.org` [18:23:37] * bd808 looks for the instructions [18:24:55] I'm not finding a great wiki page for a non-ops user [18:25:10] quelle surprise [18:25:12] My .ssh/config has "ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org" for Host gallium.wikimedia.org [18:26:12] Actually since it is wikimedia.org you may be able to ssh directly to it? [18:27:03] There is this page -- https://wikitech.wikimedia.org/wiki/SSH_configuration_notes -- but its from an ops POV and a bit outdated too I think [18:28:49] Ah "Users without root should simply replace iron.wikimedia.org with bastion1001.wikimedia.org, and bastion-restricted.wmflabs.org with bastion.wmflabs.org." [18:30:53] bd808: so am I just not allowed on bastion1001 right now? https://gist.github.com/chrismcmahon/29f9f454025a4ddfc099 [18:31:56] Looks like it. Let me see if you are in puppet. [18:32:27] bd808: the ultimate point of this exercise is for me to be able to restart Zuul from time to time [18:32:40] *nod* [18:33:01] (and read logs and stuff on beta labs too, which I haven't done in some time) [18:33:16] I think zuul is stuck again... [18:33:17] `git grep chrismcmahon` returns nothing in operations/puppet.git [18:34:03] chrismcmahon: You can read beta logs from deployment-bastion (or any other bet host) in /data/project/logs [18:34:08] yeah, I don't remember ever having prod access. I asked for it a long time ago but never got it and it didn't interfere with anything. [18:34:18] Yippee, build fixed! [18:34:18] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #414: FIXED in 13 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/414/ [18:34:24] yep, I know about the beta logs from before [18:34:38] or I was just being impatient :P [18:35:05] chrismcmahon: for access to gallium you are going to have to submit a production access request. greg-g can help you with that since he'll have to sign off on it anyway. [18:35:10] so how would request access to gallium with the privileges to "sudo -su zuul"? [18:36:11] Honestly I'd just submit that and make ops figure it out :) [18:36:57] bd808: know off the top of your head where to submit the request? [18:37:03] You need prod access in general and then "contint-admins" rights specifically [18:37:45] It's done in phabricator now. Let me see if I can find the right project [18:38:05] chrismcmahon: https://wikitech.wikimedia.org/wiki/Requesting_shell_access [18:38:12] and https://phabricator.wikimedia.org/tag/ops-access-requests/ [18:40:19] chrismcmahon: You want to be added to this group -- https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L119-L129 [18:40:28] OK [18:46:24] thanks for the help bd808 [18:46:34] np [19:00:09] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974165 (10chasemp) [19:00:39] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:01:36] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:01:52] Yippee, build fixed! [19:01:52] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #304: FIXED in 10 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/304/ [19:02:40] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974175 (10chasemp) [19:04:33] 3Phabricator.org, Phabricator: Users CCed in private tasks should be able to access them - https://phabricator.wikimedia.org/T518#974183 (10chasemp) [19:04:36] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:05:22] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974187 (10chasemp) Merging in T518 as we are settled on this functionality: Security-bug issues (I think meant o be named Mediawiki Related Security Bugs) will... [19:05:46] Yippee, build fixed! [19:05:47] Project browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce build #426: FIXED in 22 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-chrome-sauce/426/ [19:07:12] Yippee, build fixed! [19:07:13] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #341: FIXED in 1 min 24 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/341/ [19:10:20] 3VisualEditor, VisualEditor-MediaWiki, Beta-Cluster: On Beta Labs, switching from VisualEditor to edit source mode intermittently loads the wikitext editor without any CSS - https://phabricator.wikimedia.org/T86624#974208 (10Jdforrester-WMF) [19:12:52] 3Code-Review, Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#974221 (10RobH) There are two issues here that I can see: 1) gerrit.wikimedia.org certificate is sha1 2) gerrit.wikimedia.org is rapidssl certificate, a... [19:16:35] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974225 (10chasemp) [19:17:40] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:19:24] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974244 (10chasemp) [19:21:45] Yippee, build fixed! [19:21:45] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #234: FIXED in 36 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/234/ [19:22:09] 3Fundraising-Backlog, Phabricator: Migration of Fundraising Tech team to Phabricator - https://phabricator.wikimedia.org/T831#974255 (10atgo) @eloquence already liking it way more than Mingle, and looking forward to Sprint/Burndown getting cleaned up. :) [19:23:20] Yippee, build fixed! [19:23:21] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #420: FIXED in 15 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/420/ [19:26:43] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974277 (10chasemp) [19:27:31] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974281 (10scfc) See also T67591 (maybe duplicate). IIRC the two questions there were: # Don't accidentally unlock the Subversion server for anyone with shell access. # Don't accidentally lock... [19:29:27] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974310 (10chasemp) [19:33:12] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974325 (10chasemp) [19:33:30] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:34:03] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:35:17] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974331 (10chasemp) [19:35:25] Project beta-scap-eqiad build #38064: FAILURE in 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38064/ [19:35:29] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974333 (10Chad) >>! In T86668#974281, @scfc wrote: > See also T67591 (maybe duplicate). IIRC the two questions there were: > > # Don't accidentally unlock the Subversion server for anyone wit... [19:35:37] Yippee, build fixed! [19:35:38] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #442: FIXED in 36 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/442/ [19:38:13] 3Phabricator: Nonexistent change in custom policy logged when mentioning a security task - https://phabricator.wikimedia.org/T76008#974343 (10chasemp) [19:38:16] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974344 (10chasemp) [19:38:49] 3Wikimedia-Git-or-Gerrit, Phabricator: Migrate Gerrit project ownership request system (+2 rights) to Phabricator - https://phabricator.wikimedia.org/T86639#974345 (10chasemp) p:5Triage>3Normal [19:39:53] Yippee, build fixed! [19:39:54] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #377: FIXED in 3 min 41 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/377/ [19:45:05] Project beta-scap-eqiad build #38065: STILL FAILING in 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38065/ [19:46:23] Yippee, build fixed! [19:46:24] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #285: FIXED in 8 min 55 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/285/ [19:55:40] Yippee, build fixed! [19:55:40] Project beta-scap-eqiad build #38066: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38066/ [20:19:32] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:20:14] Yippee, build fixed! [20:20:14] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #453: FIXED in 23 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/453/ [20:49:38] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:13] (03PS1) 10Legoktm: Whitelist Jeff Janes [integration/config] - 10https://gerrit.wikimedia.org/r/184747 [21:07:24] (03CR) 10Hashar: [C: 032] "Job updated:" [integration/config] - 10https://gerrit.wikimedia.org/r/184693 (https://phabricator.wikimedia.org/T86622) (owner: 10Legoktm) [21:09:08] Yippee, build fixed! [21:09:08] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #403: FIXED in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/403/ [21:13:36] Yippee, build fixed! [21:13:37] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #276: FIXED in 1 min 17 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/276/ [21:15:15] (03Merged) 10jenkins-bot: Configure flake8 job for operations/software/labsdb-auditor [integration/config] - 10https://gerrit.wikimedia.org/r/184693 (https://phabricator.wikimedia.org/T86622) (owner: 10Legoktm) [21:19:27] (03PS1) 10Hashar: labsdb-auditor flake8 should be in check-voter [integration/config] - 10https://gerrit.wikimedia.org/r/184778 [21:19:38] (03CR) 10Hashar: [C: 032] labsdb-auditor flake8 should be in check-voter [integration/config] - 10https://gerrit.wikimedia.org/r/184778 (owner: 10Hashar) [21:21:38] (03Merged) 10jenkins-bot: labsdb-auditor flake8 should be in check-voter [integration/config] - 10https://gerrit.wikimedia.org/r/184778 (owner: 10Hashar) [21:25:35] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974590 (10scfc) >>! In T86668#974333, @Chad wrote: >>>! In T86668#974281, @scfc wrote: >> See also T67591 (maybe duplicate). IIRC the two questions there were: >> >> # Don't accidentally unlo... [21:26:22] (03CR) 10Legoktm: "check-voter means the user has to be whitelisted right? Is there a reason for that? flake8 shouldn't execute any arbitrary code?" [integration/config] - 10https://gerrit.wikimedia.org/r/184778 (owner: 10Hashar) [21:28:04] (03CR) 10XZise: "Yeah had rerun and with another script even more problems appeared. Fix is in I95ed0de30d2993d41a70c9b47b76c6461f40e8a0 which would make i" [integration/config] - 10https://gerrit.wikimedia.org/r/182067 (owner: 10XZise) [21:28:26] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974594 (10Chad) If we do {T86674}/{T86655} this won't be a problem [21:31:25] (03PS2) 10Hashar: Whitelist Jeff Janes [integration/config] - 10https://gerrit.wikimedia.org/r/184747 (owner: 10Legoktm) [21:31:32] (03CR) 10Hashar: [C: 032] Whitelist Jeff Janes [integration/config] - 10https://gerrit.wikimedia.org/r/184747 (owner: 10Legoktm) [21:32:03] (03CR) 10Hashar: "Arharhghghhhh" [integration/config] - 10https://gerrit.wikimedia.org/r/184778 (owner: 10Hashar) [21:32:30] (03Merged) 10jenkins-bot: Whitelist Jeff Janes [integration/config] - 10https://gerrit.wikimedia.org/r/184747 (owner: 10Legoktm) [21:32:53] heh [21:33:51] too fast for me to put the +1 :p [21:37:37] chrismcmahon: when you file the access request, have it block this one as well: https://phabricator.wikimedia.org/T85936 [21:38:07] !log deployment-prep upgraded nutcracker on mw1/mw2 to 0.4.0+dfsg-1+wm1 [21:38:10] Logged the message, Master [21:39:36] Yippee, build fixed! [21:39:37] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #348: FIXED in 1 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/348/ [21:43:15] 3Ops-Access-Requests, Continuous-Integration: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#974614 (10Cmcmahon) Need this merged: https://phabricator.wikimedia.org/T86685 [21:44:03] 3Ops-Access-Requests, Continuous-Integration: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#974615 (10Cmcmahon) [21:49:59] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974626 (10chasemp) [21:56:39] 3Quality-Assurance: Sauce Labs screencast for "No JavaScript" Flow tests shows empty browser window. - https://phabricator.wikimedia.org/T86707#974634 (10Spage) 3NEW [22:00:44] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974657 (10chasemp) [22:05:34] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974669 (10chasemp) [22:11:09] 3Quality-Assurance: Sauce Labs screencast for "No JavaScript" Flow tests shows empty browser window. - https://phabricator.wikimedia.org/T86707#974678 (10Spage) [22:13:18] Yippee, build fixed! [22:13:18] Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #422: FIXED in 1 hr 4 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/422/ [22:15:41] 3LabsDB-Auditor, Continuous-Integration: Setup jenkins jobs for labsdb-auditor - https://phabricator.wikimedia.org/T86622#974703 (10Legoktm) 5Open>3Resolved [22:22:10] 3Wikimedia-CiviCRM, Continuous-Integration, § Fundraising Sprint Abba, § Fundraising Tech Backlog: Deploy CiviCRM integration job to WMF integration server - https://phabricator.wikimedia.org/T86374#974705 (10greg) [22:35:46] greg-g: I have largely ignored Adam and his civicrm job :( [22:36:00] greg-g: had to make choices; I hope to be able to poke him while in SF [22:36:16] sleeping time *wave* [22:37:51] !log Restarted Zuul, deadlocked waiting for Gerrit [22:37:54] Logged the message, Master [22:38:37] hashar: g'night! no worries! [22:38:46] gah, another deadlock :/ [22:38:48] I will have to dig in Zuul python code to fix that one [22:38:54] yeah Gerrit DB is in bad shape apparently [22:38:58] :( [22:38:58] :( [22:39:16] I have !log that at least a couple times today [22:39:22] and it started fairly recently [22:39:23] yeah :( [22:39:29] go sleep [22:39:33] yeah thanks [22:39:55] a volunteer proposed to code a bot that would remember me to head to bed [22:40:23] :) good idea [22:41:41] chrismcmahon, any idea what might have caused https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/275/testReport/junit/%28root%29/Modal%20on%20editable%20returnto%20page/User_sees_modal_call_to_action_after_registration/ ? [22:41:46] Something about the log out link not appearing. [22:41:54] Failed two consecutive days. [22:42:08] It looks like that selector is still there, when I inspect the page manually. [22:44:35] superm401: the test never logs on. in the screencast, it never even enters a username or password [22:44:56] superm401: I'm in a pairing session right now, will look at it in about 20 minutes [22:45:04] chrismcmahon, okay, no rush. [22:45:32] Project beta-scap-eqiad build #38090: FAILURE in 1 min 35 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38090/ [22:51:56] Project beta-scap-eqiad build #38091: STILL FAILING in 1 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38091/ [22:53:38] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974750 (10chasemp) [22:55:37] Project beta-scap-eqiad build #38092: STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38092/ [22:58:02] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974765 (10chasemp) [22:58:20] PROBLEM - Puppet failure on deployment-apertium01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:58:46] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:01:37] superm401: the screencast of that test shows nothing happening for a long time, maybe Sauce or beta labs was slow. waiting/looking for pt-logout is usually a check for being logged in on a wiki page. But modal_on_editable_returnto.feature seems to just login as the Selenium user, which isn't the same as "I have just registered". [23:02:28] spagewmf, yeah, I know. There is a bug to make it use an actual registration, but it is not a high priority for me right now. [23:03:59] Nothing has changed in GettingStarted recently, so maybe it's a fluke, and will go away tomorrow. [23:05:07] Yippee, build fixed! [23:05:08] Project beta-scap-eqiad build #38093: FIXED in 1 min 1 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/38093/ [23:05:41] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [23:05:43] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [23:08:30] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:09:22] Krinkle: https://gerrit.wikimedia.org/r/#/c/184003/ [23:10:30] superm401: latest GettingStarted build is green and passes for me against beta labs locally. I'd bet on a temporary problem in beta labs. [23:10:49] Great, thanks for checking. [23:12:14] (03CR) 10Krinkle: Load extensions using wfLoadExtensions() if possible (031 comment) [integration/jenkins] - 10https://gerrit.wikimedia.org/r/184003 (https://phabricator.wikimedia.org/T86359) (owner: 10Legoktm) [23:18:20] RECOVERY - Puppet failure on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:18:46] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:19:17] 3Phabricator: Next Phabricator upgrade on 2015-01-14 - https://phabricator.wikimedia.org/T78243#974803 (10chasemp) [23:25:40] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:30:43] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:32:04] (03PS3) 10Legoktm: Load extensions using wfLoadExtensions() if possible [integration/jenkins] - 10https://gerrit.wikimedia.org/r/184003 (https://phabricator.wikimedia.org/T86359) [23:32:18] (03CR) 10Legoktm: Load extensions using wfLoadExtensions() if possible (031 comment) [integration/jenkins] - 10https://gerrit.wikimedia.org/r/184003 (https://phabricator.wikimedia.org/T86359) (owner: 10Legoktm) [23:33:29] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:33:55] 3Continuous-Integration: yamllint on integration/jenkins.git broken due to running on example files from vendor/node_modules - https://phabricator.wikimedia.org/T86719#974818 (10Krinkle) 3NEW [23:34:30] (03CR) 10Krinkle: "Filed yamllint failure: https://phabricator.wikimedia.org/T86719" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/184003 (https://phabricator.wikimedia.org/T86359) (owner: 10Legoktm) [23:37:16] 3Continuous-Integration: yamllint on integration/jenkins.git broken due to running on example files from vendor/node_modules - https://phabricator.wikimedia.org/T86719#974835 (10Krinkle) Fixed by https://gerrit.wikimedia.org/r/#/c/184324/ [23:37:29] 3Continuous-Integration: yamllint on integration/jenkins.git broken due to running on example files from vendor/node_modules - https://phabricator.wikimedia.org/T86719#974836 (10Krinkle) 5Open>3Resolved p:5Triage>3Unbreak! a:3hashar [23:39:43] (03PS2) 10XZise: Activate flake8 on Python 3 for pywikibot [integration/config] - 10https://gerrit.wikimedia.org/r/182067 [23:40:32] (03CR) 10XZise: [C: 031] Activate flake8 on Python 3 for pywikibot [integration/config] - 10https://gerrit.wikimedia.org/r/182067 (owner: 10XZise)