[00:00:19] <AndyRussG>	 Hi! In preparation for tomorrow's launch of our year-end FR campaign, there is a core patch that we'd like to delay, as it's a medium-complex change to a part of core that banners rely on (MessageCache)... Is the correct approach to revert it in master and then re-apply before branch cut for the week we'd like it to go out (probably 1 or 2 weeks)? Here's the patch BTW
[00:00:21] <AndyRussG>	 https://gerrit.wikimedia.org/r/#/c/318488/
[00:01:43] <AndyRussG>	 ...also, is there any extra tagging or stuff to do to indicate that this was the intention of the revert?
[00:02:46] <greg-g>	 AndyRussG: reverting out and then merging back in when ready is sensible yes. And no, no other tagging tht I know of.
[00:06:45] <AndyRussG>	 greg-g: cool beans, thanks much!
[00:27:34] <mutante>	 paladox: on iridium, there is a group "vcs" and a user "vcs".. BUT
[00:27:42] <mutante>	 vcs is _not_ member of vcs
[00:27:55] <mutante>	 which seems unusual
[00:28:10] <mutante>	 vcs is member of "phd" group
[00:28:21] <mutante>	 twentyafterfour: ^
[00:28:42] <mutante>	 yet in labs puppet run fails because it is missing the vcs group
[00:28:53] <mutante>	 so.. not sure yet about https://gerrit.wikimedia.org/r/#/c/323972/2
[01:00:08] <shinken-wm>	 PROBLEM - Puppet run on integration-publisher is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[01:02:16] <wikibugs>	 10Deployment-Systems, 03Scap3, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2829547 (10fgiunchedi) 05Open>03Resolved Package built and deployed.
[01:40:10] <shinken-wm>	 RECOVERY - Puppet run on integration-publisher is OK: OK: Less than 1.00% above the threshold [0.0]
[01:40:33] <wikibugs_>	 06Release-Engineering-Team, 03Scap3, 13Patch-For-Review: scap3 repos permission errors after cloning by puppet in production. - https://phabricator.wikimedia.org/T151231#2829591 (10thcipriani) ^ patch I uploaded above is meant to address group writable `./scap` directories in repos.  The `.git` directory sho...
[01:50:30] <twentyafterfour>	 mutante: that's why I submitted a patch to remove the vcs group - it's unused
[01:50:57] <mutante>	 twentyafterfour: aha! gotcha
[01:51:17] <twentyafterfour>	 https://gerrit.wikimedia.org/r/#/c/323996/
[01:55:42] <mutante>	 bookmarked, will do. just in the middle of upgrading like "all" appservers
[01:55:50] <mutante>	 because library upgrade
[02:06:02] <wikibugs_>	 03Scap3 (Scap3-Adoption-Phase1), 10Wikimedia-Wikimania-Scholarships, 15User-bd808: Deploy scholarships with scap3 - https://phabricator.wikimedia.org/T129134#2829611 (10thcipriani) hrm, not too many php app examples outside of phabricator which has some weird edgecases, IIRC.  The migration guide: https://wi...
[02:13:47] <wmf-insecte>	 Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #232: 04FAILURE in 46 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/232/
[02:26:16] <AndyRussG>	 greg-g: another suggestion (from the author of the patch we want to temporarily revert) is just that we revert on the branch that will go out tomorrow, once it's cut, rather than reverting on master and re-applying. What would be the time window for doing that tomorrow? Do you recommend for or against that? (AFIK, disadvantage is, no testing of the actual result on the beta cluster, advantage is,
[02:26:18] <AndyRussG>	 cleaner overall history going forward...). Thx in advance!
[02:42:56] <AndyRussG>	 twentyafterfour: ostriches: Reedy: ^ ?
[02:43:45] <ostriches>	 AndyRussG: Timeline is "when I start the branching"
[02:43:59] <ostriches>	 Officially 12pst, but I usually start earlier about 10 with the cutting
[02:45:14] <AndyRussG>	 ostriches: any thoughts on whether to revert in master or just on the branch? It's a medium-complex change in MessageCache, on which CentralNotice depends to retrieve banners. And tomorrow is day 1 (e.g. a quite important day) of the year-end fundraiser
[02:45:46] <AndyRussG>	 It's already been decided to temporarily revert so we're not trying this new code at scale for the first time on that very day
[02:45:58] <ostriches>	 Eh, probably easiest to revert from the branch
[02:46:19] <AndyRussG>	 However I'm not sure whether to go along to the branch-revert or insist on reverting on master, since that'd give us some beta cluster testing time with the actual code itself
[02:46:54] <ostriches>	 I think revert in branch is cleanest, but I'm not picky...whatever's easiest for you
[02:47:21] <AndyRussG>	 The patch did introduce a new public function on MessageCache, though I don't see anywhere else in core that calls it (other than the call in the patch being reverted)
[02:48:11] <AndyRussG>	 cleanest from git history, but now I'm looking at most reliable as the most important cretirion
[02:48:28] <AndyRussG>	 tho maybe I'm exaggerating
[02:48:36] <ostriches>	 What's the change in question?
[02:49:10] <AndyRussG>	 https://gerrit.wikimedia.org/r/#/c/318488/
[02:49:31] <AndyRussG>	 I could grep thru all the extension code to see if anyone has called it. I guess it's pretty unlikely
[02:49:51] <ostriches>	 updateMessageOverride() should be pretty easy to grep for tomorrow :)
[02:50:13] <AndyRussG>	 yeah
[02:50:23] <ostriches>	 Question though: how long does this need to be held back?
[02:50:29] <ostriches>	 Like, just the one week?
[02:50:29] <AndyRussG>	 At least 1 week
[02:50:38] <AndyRussG>	 two might be preferable, not sure
[02:51:17] <ostriches>	 Oh, dur.
[02:51:22] <ostriches>	 We can automates this.
[02:51:48] <ostriches>	 Well, if we revert from branch, we can automate future branches
[02:52:36] <AndyRussG>	 I don't think it'd be more than two weeks
[02:52:51] <AndyRussG>	 Basically after that there should be the typical lull in the campaign
[03:18:30] <wikibugs_>	 03Scap3 (Scap3-Adoption-Phase1), 10Wikimedia-Wikimania-Scholarships, 15User-bd808: Deploy scholarships with scap3 - https://phabricator.wikimedia.org/T129134#2829649 (10bd808) >>! In T129134#2829611, @thcipriani wrote: > The fact that scholarships uses a `.env` file for configuration in the root directory of...
[03:29:25] <AndyRussG>	 ostriches: reverted on master.... thx much 4 weighing in!!! :)
[03:29:58] <ostriches>	 No worries. Reverting on master means we don't have to coordinate tomorrow during branch time :)
[03:40:18] <AndyRussG>	 yep!
[04:12:19] <wikibugs>	 10Gerrit, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2829687 (10demon) p:05High>03Normal So, pretty sure this only affects core and **maybe** extensions that get wmf branches. Best we can...
[04:18:39] <wmf-insecte>	 Yippee, build fixed!
[04:18:40] <wmf-insecte>	 Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #218: 09FIXED in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/218/
[05:52:31] <grrrit-wm>	 (03PS2) 10Hashar: test_zuul_coverage replace extdistribution with gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323840 
[05:54:25] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] test_zuul_coverage replace extdistribution with gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323840 (owner: 10Hashar) 
[05:55:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: test_zuul_coverage replace extdistribution with gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323840 (owner: 10Hashar) 
[05:57:34] <grrrit-wm>	 (03PS2) 10Hashar: test: support state of Gerrit repositories [integration/config] - 10https://gerrit.wikimedia.org/r/323856 
[05:57:43] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] test: support state of Gerrit repositories [integration/config] - 10https://gerrit.wikimedia.org/r/323856 (owner: 10Hashar) 
[05:58:43] <grrrit-wm>	 (03Merged) 10jenkins-bot: test: support state of Gerrit repositories [integration/config] - 10https://gerrit.wikimedia.org/r/323856 (owner: 10Hashar) 
[06:01:25] <grrrit-wm>	 (03PS2) 10Hashar: test: simplify zuul_project_in_gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323845 
[06:03:18] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] test: simplify zuul_project_in_gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323845 (owner: 10Hashar) 
[06:04:03] <grrrit-wm>	 (03Merged) 10jenkins-bot: test: simplify zuul_project_in_gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/323845 (owner: 10Hashar) 
[06:10:49] <bd808>	 hashar: if you get really bored you could review some python for me :) -- https://gerrit.wikimedia.org/r/#/q/topic:create-accounts+is:open
[06:49:55] <wmf-insecte>	 Yippee, build fixed!
[06:49:55] <wmf-insecte>	 Project selenium-Wikibase » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #192: 09FIXED in 2 hr 9 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/192/
[06:54:01] <grrrit-wm>	 (03CR) 10Legoktm: "Late +1 from me :)" [integration/config] - 10https://gerrit.wikimedia.org/r/323840 (owner: 10Hashar) 
[06:54:20] <hashar>	 legoktm: \o/
[06:54:34] <hashar>	 test should run at https://integration.wikimedia.org/ci/job/integration-config-qa/ :D
[06:54:42] <legoktm>	 I'm now wondering whether ExtensionDistributor should hide read-only repos too
[06:55:24] <hashar>	 legoktm: we lack a document/process about extension lifecycle
[06:55:30] <hashar>	 which we support
[06:55:40] <hashar>	 which ones are in CI 
[06:55:41] <legoktm>	 for CI or just in general?
[06:55:45] <hashar>	 how we move them to readoly etc
[06:55:49] <hashar>	 in general I guess
[06:55:49] <legoktm>	 hmm
[06:56:13] <hashar>	 and CI in particular.  I have found out there are 73 extensions not configured in CI
[06:56:14] <legoktm>	 well there's https://www.mediawiki.org/wiki/Gerrit/Inactive_projects
[06:56:25] <hashar>	 most are probably abandoned  ones
[06:56:42] <hashar>	 ah that page is nice
[06:56:51] <hashar>	 we also have ExtensionDistributor usage statistics
[06:57:38] <grrrit-wm>	 (03PS1) 10Hashar: (WIP) register moar extensions (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/324027 
[06:57:45] <hashar>	 anyway breakfast / shower etc
[06:57:46] <hashar>	 bbl
[07:21:07] <Dereckson>	 Hi. Could someone with create reference permission create a branch REL1_28 from master in mediawiki/skins/Metrolook project?
[07:21:48] <Dereckson>	 This is to solve https://phabricator.wikimedia.org/T151842 the skin currently being shipped with a version 2 manifest
[07:35:03] <Dereckson>	 thanks paladox
[08:53:12] <wikibugs_>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T150465#2829923 (10Tgr)
[08:54:34] <wikibugs>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T150465#2786724 (10Tgr) https://gerrit.wikimedia.org/r/#/c/323099/ should probably be merged first, it fixes a very spammy warning.
[09:35:10] <grrrit-wm>	 (03PS2) 10Hashar: Register more extensions [integration/config] - 10https://gerrit.wikimedia.org/r/324027 
[09:41:10] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Register more extensions [integration/config] - 10https://gerrit.wikimedia.org/r/324027 (owner: 10Hashar) 
[09:41:58] <grrrit-wm>	 (03Merged) 10jenkins-bot: Register more extensions [integration/config] - 10https://gerrit.wikimedia.org/r/324027 (owner: 10Hashar) 
[11:04:13] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-Other, 07Mobile: CommentStreams: The module 'ext.CommentStreams' must not have target 'mobile' because its dependency 'jquery.ui.dialog' does not have it - https://phabricator.wikimedia.org/T151863#2830242 (10hashar)
[11:22:34] <hashar>	 ah good rush
[11:22:41] <hashar>	 I have added an hundred of extension
[11:22:59] <hashar>	 and only 15 are failling https://gerrit.wikimedia.org/r/#/q/topic:qavalidate+is:open :D
[12:01:48] <wmf-insecte>	 Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #225: 04FAILURE in 48 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/225/
[12:46:53] <grrrit-wm>	 (03PS1) 10Hashar: [reCaptcha] switch to composer [integration/config] - 10https://gerrit.wikimedia.org/r/324157 
[12:54:36] <grrrit-wm>	 (03PS1) 10Hashar: [SpellingDictionary] depends on ULS [integration/config] - 10https://gerrit.wikimedia.org/r/324159 
[13:00:21] <Amir1>	 Hey, Can someone give me admin and 'crat rights in beta enwiki? https://en.wikipedia.beta.wmflabs.org/wiki/Special:Contributions/Ladsgroup 
[13:00:23] <Amir1>	 thanks
[13:19:12] <grrrit-wm>	 (03PS1) 10Hashar: [NumberOfComments] depends on Comments [integration/config] - 10https://gerrit.wikimedia.org/r/324170 
[13:41:55] <grrrit-wm>	 (03PS1) 10Zfilipin: WIP mediawiki-core-qunit-jessie Jenkins job needs Vector skin [integration/config] - 10https://gerrit.wikimedia.org/r/324178 (https://phabricator.wikimedia.org/T139740) 
[13:46:12] <wmf-insecte>	 Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #228: 04FAILURE in 2 min 11 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/228/
[14:08:00] <Nikerabbit>	 hashar: I am seeing  more build failures due to timeouts recently, is that expected?
[14:15:24] <Reedy>	 ApiRevThankIntegrationTest is broken
[14:15:27] <Reedy>	 https://phabricator.wikimedia.org/T151878
[14:15:27] <Reedy>	 ffs
[14:15:31] <Reedy>	 Blocking core merges, it seems
[14:24:27] <hashar>	 !log Refreshing Nodepool Trusty snapshot to get php5-xsl installed T151879
[14:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[14:24:51] <hashar>	 Nikerabbit: Reedy: please fill bugs as needed!
[14:28:53] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "Yeah that would work. Can you please add the job mediawiki-core-qunit-jessie so it get triggers by mediawiki/skins/Vector ? This way ch" [integration/config] - 10https://gerrit.wikimedia.org/r/324178 (https://phabricator.wikimedia.org/T139740) (owner: 10Zfilipin) 
[14:30:01] <hashar>	 !log Image ci-trusty-wikimedia-1480429423 in wmflabs-eqiad is ready  T151879
[14:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[14:34:39] <Reedy>	 I did :P
[14:35:15] <wikibugs_>	 10Browser-Tests-Infrastructure, 15User-zeljkofilipin: Ensure chromedriver is installed (for Selenium) - https://phabricator.wikimedia.org/T117418#2830833 (10zeljkofilipin) My puppet fu is not strong. I have found this on creating symlinks:  https://www.puppetcookbook.com/posts/creating-a-symlink.html  ``` clas...
[14:36:33] <wikibugs_>	 10Continuous-Integration-Config, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-Other, 07Mobile: CommentStreams: The module 'ext.CommentStreams' must not have target 'mobile' because its dependency 'jquery.ui.dialog' does not have it - https://phabricator.wikimedia.org/T151863#2830840 (10cicalese) Than...
[14:38:23] <wikibugs_>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T150465#2786724 (10Reedy) As T151702 is ongoing, I've turned the bandaid into https://gerrit.wikimedia.org/r/#/c/324182 and it's been reapplied to .3  It may be fortuitous to che...
[14:48:15] <grrrit-wm>	 (03PS2) 10Zfilipin: WIP mediawiki-core-qunit-jessie Jenkins job needs Vector skin [integration/config] - 10https://gerrit.wikimedia.org/r/324178 (https://phabricator.wikimedia.org/T139740) 
[14:49:07] <grrrit-wm>	 (03CR) 10Zfilipin: "Patch set 2 runs mediawiki-core-qunit-jessie for mediawiki/skins/Vector in test and gate-and-submit pipelines." [integration/config] - 10https://gerrit.wikimedia.org/r/324178 (https://phabricator.wikimedia.org/T139740) (owner: 10Zfilipin) 
[15:07:06] <hashar>	 andrewbogott: thank you :]
[15:07:27] <andrewbogott>	 hashar: those vms are all size 'small' right?  (wondering if the RAM quota is big enough)
[15:07:38] <hashar>	 there is a slight bump https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&from=now-3h&to=now :]
[15:07:53] <andrewbogott>	 ok, must be working then :)
[15:07:56] <hashar>	 the task has the quota recommendations
[15:07:58] <hashar>	 let me check
[15:08:28] <hashar>	 andrewbogott: gotta bump the vCPU quota from 40 to 44
[15:08:42] <hashar>	 we have 20 instances +  2 slots for refreshing the snapshots
[15:08:52] <hashar>	 so at max we can have up to 22 instances each having 2 cpu
[15:08:56] <hashar>	 or 44 cpu :]
[15:09:14] <hashar>	 the extra 2 instances are only used  temporarly when the snapshot are refreshed on a daily basis
[15:09:44] <hashar>	 oh it is has vcpu 44
[15:09:56] <hashar>	 ram is ok
[15:10:14] <hashar>	 and instances quota is at 23 
[15:10:16] <chasemp>	 do we need a restart of nodepool service now to actually enable?
[15:10:26] <hashar>	 the 20 instances + 2 snapshot  + 1 overhead in the quota in case something goes wrong
[15:10:34] <hashar>	 nodepool read the configuration file automatically
[15:10:40] <hashar>	 so it already took in account the change
[15:10:45] <hashar>	 well whenever puppet ran
[15:10:57] <hashar>	 hello chase :D
[15:11:10] <chasemp>	 morning / afternoon
[15:11:47] <hashar>	 if you notice any issue, the easiest is to bring back max-server down to 12
[15:11:58] <chasemp>	 I still only see 10 slots used
[15:12:06] <chasemp>	 10 / 9 / 10 etc
[15:12:35] <chasemp>	 we should consider bumping up the ready nodes as well it seems?
[15:13:36] <chasemp>	 hashar: would you expect nodes in use to be >10 atm?
[15:13:43] <hashar>	 definitely
[15:13:46] <wikibugs_>	 10Continuous-Integration-Config, 13Patch-For-Review: Create composer-php70 job - https://phabricator.wikimedia.org/T144961#2830993 (10hashar)
[15:13:50] <wikibugs_>	 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Migrate mediawiki-core-phpcs job to Nodepool - https://phabricator.wikimedia.org/T133976#2830995 (10hashar)
[15:13:52] <wikibugs_>	 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 10releng-201516-q3, 07WorkType-NewFunctionality: [keyresult] Migrate php (Zend and HHVM) CI jobs to Nodepool - https://phabricator.wikimedia.org/T119139#2830996 (10hashar)
[15:13:58] <wikibugs>	 10Continuous-Integration-Config, 13Patch-For-Review: Run MediaWiki tests on PHP 7 - https://phabricator.wikimedia.org/T144962#2830992 (10hashar)
[15:13:59] <wikibugs_>	 05Continuous-Integration-Scaling, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Bump quota of Nodepool instances (contintcloud tenant) - https://phabricator.wikimedia.org/T133911#2830989 (10hashar) 05Open>03Resolved a:03hashar Nodepool loaded the new configuration and the OpenStack quota have be...
[15:14:08] <chasemp>	 no dice
[15:14:21] <hashar>	 typically during european evening / SF morning
[15:14:28] <chasemp>	 no dice
[15:14:35] <chasemp>	 oops accidental dupe
[15:14:35] <hashar>	 which is the patches traffic jam hours
[15:14:45] <chasemp>	 I'm going to restart nodepool for fun
[15:14:48] <chasemp>	 any objection?
[15:15:12] <hashar>	 🎲 🎲 🎲 
[15:15:18] <hashar>	 hmazeoir
[15:15:25] <hashar>	 why do you want to restart it ?
[15:15:42] <chasemp>	 see if it starts using more than 10 slots
[15:15:47] <hashar>	 it does :)
[15:15:56] <hashar>	 https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&from=now-6h&to=now
[15:16:04] <wikibugs_>	 10Browser-Tests-Infrastructure, 15User-zeljkofilipin: Ensure chromedriver is installed (for Selenium) - https://phabricator.wikimedia.org/T117418#2831012 (10zeljkofilipin) Looks like this is the way to go:  https://docs.puppet.com/puppet/latest/reference/types/file.html   ``` file { '/etc/inetd.conf':   ensure...
[15:16:08] <chasemp>	 ah nice
[15:16:12] <hashar>	 in its main loop, nodepool reload the configuration file 
[15:16:16] <hashar>	 and self update
[15:16:19] <hashar>	 which is quite handy
[15:16:22] <chasemp>	 sure, I just wasn't seeing it
[15:17:09] <hashar>	 it has done some initial spike to refill the number of min servers idling
[15:22:45] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "The Qunit job do not need any skin. It is possible that Vector has a few qunit tests but quite unlikely really." [integration/config] - 10https://gerrit.wikimedia.org/r/324178 (https://phabricator.wikimedia.org/T139740) (owner: 10Zfilipin) 
[15:23:01] <hashar>	 zeljkof: guess I will have to fix up my mw local setup :]
[15:23:11] <hashar>	 !log Image ci-jessie-wikimedia-1480432368 in wmflabs-eqiad is ready
[15:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:23:37] <chasemp>	 hashar: does it make sense now to reaise the min-ready proportionally?
[15:24:12] <hashar>	 I am not sure :]
[15:24:28] <hashar>	 the idea is of min-ready is to get instances immediately available in case of a spike
[15:24:44] <hashar>	 the drawback is less free slots to automatically balance the load between jessie and trusty images
[15:24:45] <hashar>	 eg 
[15:24:49] <chasemp>	 sure, and there is churn issues where it holds ready nodes for a flavor that isn't needed
[15:24:50] <chasemp>	 right
[15:25:02] <hashar>	 if one has  min-ready:  4 trusty + 4 jessie (= 8) and a quota of 20 instances
[15:25:03] <chasemp>	 ok let's sit as-is for a bit and see what's in use
[15:25:09] <hashar>	 that is 12 slots that could be all assigned to jessie
[15:25:14] <hashar>	 but maybe I am just overthinking
[15:25:42] <hashar>	 the jessie vs trusty is at https://grafana.wikimedia.org/dashboard/db/continuous-integration?panelId=8&fullscreen
[15:25:49] <chasemp>	 it's 6 jessie vs 3 trusty now
[15:26:04] <hashar>	 yeah  2/3  vs 1/3
[15:26:20] <hashar>	 the aim is to eventually phase out Trusty entirely
[15:26:51] <chasemp>	 I'm ok w/ sitting as-is for a bit and seeing how it shakes out
[15:27:03] <chasemp>	 but I do think bumping up min-ready for both is going to be valuable
[15:27:09] <hashar>	 will be busier in a couple hours from now
[15:27:16] <chasemp>	 I'm just not sure exatly to what :)
[15:27:19] <chasemp>	 k
[15:27:21] <hashar>	 when our mighty west coasters start sending random stuff all around gerrit
[15:27:44] <chasemp>	 what remains not on nodepool w/ this change?
[15:27:45] <hashar>	 if we bump min,ready
[15:27:50] <hashar>	 surely nodepool will refil while it is idling
[15:28:02] <hashar>	 so patches might end up being process slightly faster
[15:28:16] <hashar>	 have to migrate all the PHP jobs now :D
[15:28:33] <chasemp>	 when do you plan to do that?
[15:28:55] <hashar>	 I thought about leaving this evening pass
[15:29:00] <hashar>	 and do the switch tomorrow
[15:29:07] <wikibugs>	 10Browser-Tests-Infrastructure, 15User-zeljkofilipin: Ensure ChromeDriver is installed for jobs that run Selenium tests - https://phabricator.wikimedia.org/T117418#2831063 (10zeljkofilipin)
[15:30:44] <chasemp>	 sure
[15:30:54] <chasemp>	 sounds good, drop a note if you would when you are in flight
[15:31:03] <hashar>	 sure thing!
[15:31:38] <hashar>	 and I get the https://grafana.wikimedia.org/dashboard/db/zuul-job board to check the amount of builds per day that will be shifted to nodepool
[15:31:44] <hashar>	 then
[15:31:52] <hashar>	 we can phase out most of the rest of the permanent slaves
[15:32:15] <hashar>	 and start figuring out a more robust replacement for the whole CI
[15:35:04] <wmf-insecte>	 Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #244: 04FAILURE in 13 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/244/
[15:38:23] <grrrit-wm>	 (03PS1) 10Hashar: [CustomPage] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324204 
[15:38:25] <grrrit-wm>	 (03PS1) 10Hashar: [GooglePlaces] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324205 
[15:39:48] <shinken-wm>	 PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:41:45] <wmf-insecte>	 Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #244: 04FAILURE in 19 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/244/
[15:49:54] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] [GooglePlaces] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324205 (owner: 10Hashar) 
[15:49:57] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] [CustomPage] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324204 (owner: 10Hashar) 
[15:50:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: [CustomPage] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324204 (owner: 10Hashar) 
[15:50:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: [GooglePlaces] switch to non voting [integration/config] - 10https://gerrit.wikimedia.org/r/324205 (owner: 10Hashar) 
[15:51:04] <shinken-wm>	 PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[16:11:12] <wikibugs>	 10Deployment-Systems, 03Scap3 (Scap3-Adoption-Phase1), 10scap, 10MediaWiki-JobRunner: Deploy jobrunner with scap3 - https://phabricator.wikimedia.org/T129148#2831243 (10hashar) a:03hashar From a conversation with @thcipriani, I will read the scap doc, craft some patch and we will polish/review it together.
[16:45:48] <greg-g>	 AndyRussG|bassoo: I dislike that idea for the reason you state (no testing on beta cluster) and the time between branch cut and deploy is relatively small
[16:49:43] <hashar>	 so going to try to migrate jobrunner to scap per chat with thcipriani
[16:49:44] <hashar>	 and
[16:49:56] <hashar>	 for zeljkof forge a jenkins job for selenium + JS 
[16:50:01] <hashar>	 busy day tomorrow
[16:50:09] <thcipriani>	 \o/
[17:12:54] <shinken-wm>	 PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[17:48:35] * paladox is upgrading to a Apple Mac pro mac on christmas :), but will still use windows on my old pc :)
[17:50:05] <Amir1>	 I repeat my request, please give me admin and 'crat rights in enwiki in beta. I want to test deferred changes and It'll be needed in future. Thanks :)
[17:50:39] <paladox>	 Amir1 you should create a task for requesting rights.
[17:51:31] <Amir1>	 Is it that complicated? I already have root access in beta cluster. I add change the db and give the rights to myself but rather not :D
[17:51:49] <Amir1>	 s/add/can
[17:51:51] <Reedy>	 Amir1: What's what the rest of us would do
[17:52:15] <Reedy>	 That's what
[17:52:57] <shinken-wm>	 RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0]
[17:53:21] <Reedy>	 Amir1: what username?
[17:53:31] <Amir1>	 Ladsgroup
[17:53:44] <Reedy>	 Done
[17:53:51] <Amir1>	 Thanks Reedy 
[18:16:49] <wikibugs_>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T150465#2831828 (10demon)
[19:06:23] <wikibugs>	 06Release-Engineering-Team, 10Elasticsearch, 10Phabricator: Determine if we can run both elasticsearch and myisam fulltext backends in parallel - https://phabricator.wikimedia.org/T150223#2832109 (10mmodell) 05Open>03Resolved Ok so this is possibly we just have to use config environments to provide the e...
[19:08:25] <wikibugs>	 06Release-Engineering-Team, 10Elasticsearch, 10Phabricator: Determine if we can run both elasticsearch and myisam fulltext backends in parallel - https://phabricator.wikimedia.org/T150223#2832116 (10mmodell)
[20:26:17] <grrrit-wm>	 (03CR) 10Krinkle: "todo: We should probably replace this with 'npm run doc'. We've standardised on that entry point and is already used elsewhere for pre/pos" [integration/config] - 10https://gerrit.wikimedia.org/r/323872 (owner: 10Jforrester) 
[20:30:38] <nuria>	 hola! after creating a new repo (node module) in gerrit. Do i need to do anything so it gets published to github?
[20:31:26] <hasharAWay>	 nuria: gotta create the repository manually under github.com/wikimedia/
[20:31:29] <hasharAWay>	 with the proper name
[20:31:33] <hasharAWay>	 then Gerrit magically replicate it
[20:31:54] <hashar>	 nuria: what is the repo in Gerrit?
[20:32:12] <nuria>	 hashar: git fetch https://gerrit.wikimedia.org/r/node-rdkafka-statsd refs/changes/71/319671/22 && git checkout FETCH_HEAD
[20:32:17] <nuria>	 hashar: sorry 
[20:32:22] <hashar>	 node-rdkafka-statsd
[20:32:23] <hashar>	 pff
[20:32:24] <nuria>	 hashar: https://gerrit.wikimedia.org/r/node-rdkafka-statsd 
[20:32:30] <nuria>	 hashar: yes!
[20:32:38] * hashar shakes fists at whoever uses - instead of / in Gerrit :]
[20:32:49] <hashar>	 are you admin on github / wikimedia ?
[20:33:46] <nuria>	 hashar: yes please
[20:33:55] <hashar>	 https://github.com/wikimedia/node-rdkafka-statsd :D
[20:34:03] <nuria>	 hashar: wow
[20:34:05] <hashar>	 gerrit should then replicate there at some point
[20:34:12] <hashar>	 basically create the repo
[20:34:17] <nuria>	 hashar: i think "-" are node convention for module names
[20:34:21] <hashar>	 and in description add some boiler plate stating it is a mirror
[20:34:25] <nuria>	 hashar: they all live at top level
[20:34:35] <hashar>	 yeah yeah
[20:34:36] <nuria>	 hashar: but hey node dep chain is such a mess 
[20:34:36] <hashar>	 that is dumb
[20:34:39] <wmf-insecte>	 Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #226: 15ABORTED in 7.8 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/226/
[20:34:39] <hashar>	 but I can survive :]
[20:34:40] <wmf-insecte>	 Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #226: 15ABORTED in 7.8 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/226/
[20:34:49] <hashar>	 I am being too pedantic as I get older
[20:34:59] <hashar>	 oh
[20:35:04] <hashar>	 replicated!
[20:35:06] <nuria>	 hashar: no, really, node dep chain is a mess. facebook just rewrote it
[20:35:08] <hashar>	 that was fast
[20:35:18] <hashar>	 yeah I have seen their "yarn" initiative
[20:35:19] <wmf-insecte>	 Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #227: 09SUCCESS in 37 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/227/
[20:35:21] <nuria>	 hashar: npm to be able to use node packages in a reliable manner
[20:35:25] <hashar>	 it has a lot of nicer feature compared to npm
[20:35:34] <wmf-insecte>	 Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #227: 04STILL FAILING in 53 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/227/
[20:35:58] <hashar>	 would you like a dessert with the github mirror repo ?
[20:36:23] <hashar>	 🍪 🍪 🍪
[20:38:15] <wikibugs>	 10Continuous-Integration-Config, 06Release-Engineering-Team, 10QuickSurveys, 10RelatedArticles: QuickSurveys and RelatedArticles configured with wrong url - https://phabricator.wikimedia.org/T151937#2832641 (10Jdlrobson)
[20:40:16] <grrrit-wm>	 (03PS1) 10Hashar: node-rdkafka-statsd add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/324265 
[20:40:42] <hashar>	 nuria: I am adding a npm test job for node-rdkafka-statsd in gerrit
[20:40:55] <nuria>	 hashar: thnak you, will CR 
[20:40:59] <nuria>	 *thank you
[20:41:13] <hashar>	 super straightforward https://gerrit.wikimedia.org/r/#/c/324265/1/zuul/layout.yaml 
[20:41:26] <hashar>	 hopefull there are no weird lib**-dev dependencies and it is just going to work
[20:42:20] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] node-rdkafka-statsd add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/324265 (owner: 10Hashar) 
[20:43:17] <grrrit-wm>	 (03PS2) 10Hashar: node-rdkafka-statsd add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/324265 
[20:43:33] <Krinkle>	 nuria: for the record, the convention is to use / in Gerrit repos, but also note that it's super easy to have a different name on GitHub. An admin can rename the repo on GitHub without any puppet or gerrit changes. For example: https://github.com/wikimedia/mediawiki (mediawiki/core.git) oojs (oojs/core.git) https://github.com/wikimedia/restbase
[20:43:33] <Krinkle>	 (mediawiki/services/restbase.git) etc.
[20:44:04] <nuria>	 Krinkle: but none of our node modules are done that way, right?
[20:44:04] <Krinkle>	 Looks like there is also an empty 'node-rdkafka-stats' repo
[20:44:21] <hashar>	 I am not sure whether we have any node modules in Gerrit 
[20:44:22] <Krinkle>	 nuria: We have various node modules, none of them are node-* in Gerrit until now.
[20:44:26] <nuria>	 Krinkle: ah! sorry, that should have been deleted but none of us had permits cc hashar 
[20:44:42] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] node-rdkafka-statsd add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/324265 (owner: 10Hashar) 
[20:45:12] <Krinkle>	 they are operations/software/*, VisualEditor/*, oojs/ui, mediawiki/services/parsoid, mediawiki/services/restbase,  etc.
[20:45:30] <Krinkle>	 analytics/node-rdkafka-statsd or something like that would've been natural I think.
[20:45:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: node-rdkafka-statsd add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/324265 (owner: 10Hashar) 
[20:45:56] <nuria>	 Krinkle: ya, the node- is  a little strange  eh?
[20:46:43] <nuria>	 Krinkle: but it is not analytics is kafka though
[20:46:54] <hashar>	 maybe we could have come with node/modules/rdkafka-statsd
[20:47:27] <nuria>	 hashar: I guess it segues this one: https://github.com/wikimedia/node-rdkafka cc Krinkle 
[20:47:29] <hashar>	 nuria: Jenkins managed to build something and pass ( did a recheck on https://gerrit.wikimedia.org/r/#/c/319671/22 )
[20:47:49] <Krinkle>	 nuria: That one isn't on Gerrit though, that's a plain gh repo
[20:47:56] <hashar>	 then the Services team has most of their repositories on Github afaik
[20:48:09] <Krinkle>	 nope, renamed mirrors 
[20:48:09] <hashar>	 the Gerrit ones being mirrors for deployment on wmf infra
[20:48:40] <Krinkle>	 the only repos I know on github pure are mostly mobile apps afaik
[20:48:46] <nuria>	 Krinkle: i see, and that is named that way cause other clients are php-rdkafka and dotnet-rdkafka.. 
[20:48:58] <Krinkle>	 Yeah, and using that name on github is fine
[20:49:19] <nuria>	 Krinkle: ok, we know that for next time
[20:49:26] <Krinkle>	 anyway, there's no real good reason beyond convention to have it different in gerrit. Consistency for ACL as well.
[20:49:56] <nuria>	 hashar: could you be so kind to delete the node-rdkafka-stats depot ?
[20:50:46] <hashar>	 nuria: from Gerrit?
[20:51:36] <hashar>	 ah without a d
[20:51:37] <hashar>	 bah
[20:51:45] <nuria>	 hashar: note this is a different one , name has a typo
[20:51:50] <nuria>	 hashar: right!
[20:51:59] <nuria>	 hashar: ya  ahem .. do not delete the other plis
[20:52:47] <hashar>	 Your repository "wikimedia/node-rdkafka-stats" was successfully deleted. 
[20:53:24] <hashar>	 so we have https://github.com/wikimedia/node-rdkafka
[20:53:30] <hashar>	 "Direct node.js wrappers over C++ librdkafka API "
[20:53:56] <hashar>	 and https://github.com/wikimedia/node-rdkafka-statsd  a mirror from gerrit
[20:55:14] <hashar>	 chasemp: andrewbogott: Nodepool reach the vcpu quota for some reason :(    Quota exceeded for cores: Requested 2, but already used 44 of 44 cores
[20:55:36] <hashar>	 though it has 20 servers
[20:55:42] <chasemp>	 hashar: scale the max servers back a bit for now?
[20:55:58] <hashar>	 I would rather bump the quota
[20:56:04] <hashar>	 but something is off 
[20:56:35] <hashar>	 that happens when the pool is full.  I guess Nodepool attempt to spawn an instance when others havent been deleted yet
[20:56:48] <chasemp>	 yes I understand but runaway usage is a good reason to up the quota further
[20:56:54] <chasemp>	 isn't that just an arms race we can't win 
[20:57:04] <chasemp>	 esp when we can't explain usage beyond expected now?
[20:57:11] <chasemp>	 is not a good reason I mean :)
[20:57:18] <hashar>	 oh I just noticed it literally a minute ago!
[20:58:08] <hashar>	 so I dont quite now what is going on yet
[20:59:14] <hashar>	 chasemp: anything of interest on the OpenStack side?
[20:59:32] <hashar>	 not related to the quota issue
[20:59:50] <hashar>	 but overall I mean. Like a labvirt suddenyl dieing, rabbit mq overloaded, random new stacktrace  of doom
[21:00:15] <chasemp>	 not that I've seen today
[21:00:22] <chasemp>	 atm I only see 16 instances also
[21:00:31] <chasemp>	 but 3 are in build
[21:01:34] <hashar>	 ohhh
[21:01:39] <hashar>	 there are some stuck instances
[21:02:14] <chasemp>	 stuck as in failed to delete?
[21:02:25] <chasemp>	 it would make sense if it's 'leaking instances'
[21:02:26] <hashar>	 as in spawned in openstack and running
[21:02:32] <hashar>	 but not taken in account by nodepool itself
[21:02:35] <hashar>	 and I guess it is confused
[21:02:54] <hashar>	 will dig in logs
[21:04:06] <chasemp>	 if you figure out UUID's of suspected dead VMs taking up a slot let me know
[21:04:23] <chasemp>	 curious which side of the equation is confused
[21:04:30] <hashar>	 | 96604607-8bdc-4cd0-bc95-8c0a74504c9f | ci-trusty-wikimedia-431179 | BUILD  | scheduling | NOSTATE     |                     |
[21:04:35] <hashar>	 apparently stuck
[21:06:03] <hashar>	 there are two others but which are spawned instance
[21:06:07] <hashar>	 and reacheable
[21:06:12] <hashar>	 but somehow nodepool lost track of them
[21:07:28] <chasemp>	 stuck in scheduling 96604607-8bdc-4cd0-bc95-8c0a74504c9f, which may be legit or not I'm not sure
[21:08:11] <hashar>	 all three got requested at 19:10utc
[21:08:15] <hashar>	 or around that
[21:08:25] <chasemp>	 what are the other two?
[21:09:18] <hashar>	 | wmflabs-eqiad | ci-trusty-wikimedia-431181 | 72f0add1-d740-4eb9-bcaf-1e8d562d0a87 | 10.68.18.121 |
[21:09:18] <hashar>	 | wmflabs-eqiad | ci-jessie-wikimedia-431180 | ff9cf317-4222-4367-8b1d-c8cc87ac82b3 | 10.68.20.210 |
[21:10:15] <hashar>	 for 431181  nodepool reports a read timeout at 60 seconds
[21:10:23] <hashar>	 while trying to launch the instance
[21:10:44] <hashar>	 so it assumes the instance never spawned, delete it from it is internal tracking base
[21:10:54] <hashar>	 but eventually wmflabs manage to spawn it at some point
[21:11:16] <hashar>	 that instance is thus unknown to nodepool   but known / idling in the tenant
[21:11:36] <chasemp>	 yes I'd say this qualifies as a problem
[21:11:40] <chasemp>	 I removed those 3 manually
[21:12:20] <hashar>	 the nodepool client as a command to compare what nodepool knows versus what is in the tenant
[21:12:33] <hashar>	 leftover are named aliens.  Can get list with 'nodepool alien-list'
[21:12:40] <hashar>	 but it should really garbage collect them
[21:12:52] <chasemp>	 yeah but I think there is a hole where
[21:12:58] <chasemp>	 if out for scheduling it's both reserved and not reported
[21:13:02] <chasemp>	 at least that's my theory atm
[21:13:17] <hashar>	 (oh and the instance apparently got spawned on 11/28 at 19:10,  or yesterday)
[21:13:29] <hashar>	 so I guess I have fail to notice the issue up until now
[21:14:24] <chasemp>	 possibly it was surfaced w/ the increase in instance churn but I can't recall this being a thing for hte last few months
[21:20:54] <hashar>	 chasemp: can confirm our Nodepool does not deal with leaked instances :/
[21:21:06] <chasemp>	 how did you confirm it?
[21:21:15] <hashar>	 they wrote a patch to deal with the leakage
[21:21:18] <hashar>	 and we dont have the patch
[21:21:25] <hashar>	 so I am assuming our current version leaks :]
[21:22:05] <hashar>	 https://review.openstack.org/#/c/190827/
[21:22:10] <hashar>	 Delete leaked instances
[21:22:27] <hashar>	 which has a bunch of follow up such as "Do not delete non-leaked instances".  Scary
[21:24:08] <chasemp>	 that would make sense
[21:24:30] <chasemp>	 you should back down / revert to previous until that's deployed
[21:24:35] <chasemp>	 we'll leave the quota as-is openstack side
[21:24:50] <hashar>	 well
[21:25:00] <hashar>	 that issue occured before the quota bump
[21:25:30] <chasemp>	 sure but now it's going to be more pronounced
[21:25:39] <chasemp>	 espcially as indicated by it being more pronounced
[21:27:01] <hashar>	 ahh yeah
[21:27:10] <hashar>	 so previosuly it would shoke on the 15/15 quota
[21:27:15] <hashar>	 but we had 12 max servers
[21:27:23] <hashar>	 so we could get 3 leaked instances and not notice
[21:27:36] <hashar>	 now we can get 40 CPU and the quota is 44 cpu
[21:27:47] <hashar>	 or up to 2 leaked instances (from 3)
[21:27:53] <hashar>	 so yeah more pronounced indeed
[21:28:24] <hashar>	 getting a tea and thinking about it. But probably we can lower max-server from 20 to 19
[21:28:33] <chasemp>	 kk
[21:28:46] <hashar>	 and I definitely need a  monitoring probe to notify those leakages
[21:28:52] <hashar>	 + grab upstream patches and package them
[21:28:58] <chasemp>	 andrewbogott: ^ nodepool is leaking instances and can't compensate, a patch exists we don't have.  hashar is considering options
[21:29:31] <hashar>	 and the bug has always been there, but get found due to less quota allowance
[21:29:32] <hashar>	 neat!
[21:30:38] <andrewbogott>	 I don't understand why this wasn't /more/ of a problem when the quota was lower
[21:32:13] <chasemp>	 I'm not sure exactly
[21:32:55] <hashar>	 https://gerrit.wikimedia.org/r/324328 nodepool: lower max-server from 20 to 19  
[21:33:09] <hashar>	 andrewbogott: chasemp: the puppet patch above would lower the max-server by one
[21:33:17] <hashar>	 allowing for up to 3 leaked instances
[21:33:22] <wmf-insecte>	 Project beta-scap-eqiad build #131036: 04FAILURE in 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/131036/
[21:33:29] <andrewbogott>	 do the leaks persist forever, or just hang around for extra long?
[21:33:36] <hashar>	 the trick is previously we had  12 instances of 15 max 
[21:33:56] <hashar>	 now we have 40 CPU out of 44 CPU,  wich translate to  20 instance out of 22 max
[21:34:07] <hashar>	 15 - 12  =>  3 instances leakable
[21:34:14] <hashar>	 22 - 20 =>  only 2 instances leakable
[21:34:20] <andrewbogott>	 yeah, I understand the math.  But why is 3 the magic number of leaks?
[21:34:24] <hashar>	 I am not sure whether the leak persist
[21:34:33] <chasemp>	 we have no reason to think they don't
[21:34:35] <hashar>	 maybe I just randomly notice them and delete them 
[21:34:40] <andrewbogott>	 Isn't it just a question of 'things break in 90 minutes' vs 'things break in 60 minutes'?
[21:34:50] <hashar>	 or nodepool manage to garbage collect them after a while (unlikely)
[21:35:02] <hashar>	 OHH soryr
[21:35:03] <hashar>	 (:(
[21:35:04] <andrewbogott>	 Oh, so the problem now is that these leaked VMs have been around for quite a while?
[21:35:09] <chasemp>	 it's more likely the more aggressive scheduling w/ the greater count led to a scheduler issue
[21:35:13] <chasemp>	 which is the thing w/ the one I saw
[21:35:27] <hashar>	 yeah the leaked one got spawned yesterday at 19:10 UTC  or roughly 27 hours ago
[21:35:28] <chasemp>	 and that it was use patterns that let it live so long at previous thresholds
[21:35:32] <chasemp>	 not the thresholds themselves
[21:35:49] <wmf-insecte>	 Project beta-scap-eqiad build #131037: 04STILL FAILING in 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/131037/
[21:36:08] <hashar>	 a potential concern on the openstack side is the timeout when spawning builds
[21:36:14] <chasemp>	 hashar: I'm advocating a full revert on nodepool side, but you should let the US releng folks know to watch out if you are ok w/ a bit of mitigation in teh short term
[21:36:14] <hashar>	 then nodepool should cleanly handle it
[21:37:46] <hashar>	 pretty sure it is unrelated
[21:37:53] <hashar>	 something happened yesterday that caused the leak
[21:38:11] <hashar>	 nodepool did alarm before the quota bump  (but complaiend about 15/15 instances quota reached)
[21:38:31] <hashar>	 and when we bumped the instance quota and max-server, the message simply changed to complain about vCPU
[21:38:45] <hashar>	 but essentially that is the same root cause. Leaked instances from yesterday which Nodepool does not handle
[21:45:17] <wmf-insecte>	 Project beta-scap-eqiad build #131038: 04STILL FAILING in 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/131038/
[21:46:32] <andrewbogott>	 hashar: I'm going to merge that 20->19 patch, you can clean up the leaked VMs
[21:46:39] <hashar>	 neat
[21:46:43] <andrewbogott>	 and then we'll see if the problem is more common than it was before
[21:46:50] <hashar>	 two instances leaked but spawned properly
[21:46:53] <hashar>	 one got stall somehow
[21:46:58] <hashar>	 they are deleted
[21:47:05] <hashar>	 and the patch bring us with the same allowance for leaked instances
[21:47:22] <hashar>	 I will dig in logs / watch tomorrow
[21:47:31] <hashar>	 and move a very small batches of jobs
[21:48:49] <wikibugs_>	 10Beta-Cluster-Infrastructure: Redirecting to a page on beta labs doesn't work - https://phabricator.wikimedia.org/T151894#2833000 (10yuvipanda) No such thing as beta labs. Please see https://wikitech.wikimedia.org/wiki/Labs_node_setup
[21:51:36] <wikibugs_>	 10Continuous-Integration-Infrastructure, 07Nodepool: Nodepool leaks instances and does not gabage collect them - https://phabricator.wikimedia.org/T151949#2833005 (10hashar)
[21:51:45] <hashar>	 nodepool leak filled ^ 
[21:53:23] <wikibugs_>	 10Continuous-Integration-Infrastructure, 07Nodepool: Add monitoring and capacity planning for Nodepool - https://phabricator.wikimedia.org/T113806#2833027 (10hashar) 05Resolved>03Open Reopening. Would need some notifications when pool is exhausted, server side errors, and leaked instances (or alien instanc...
[21:55:18] <wmf-insecte>	 Project beta-scap-eqiad build #131039: 04STILL FAILING in 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/131039/
[21:55:25] <wikibugs>	 10Beta-Cluster-Infrastructure: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2833037 (10bd808)
[21:56:16] <hashar>	 andrewbogott: I ran puppet on nodepool server and it is down to 19 servers max. Thx!
[21:56:27] <andrewbogott>	 cool
[21:59:53] <hashar>	 enough for today. Will dig logs tomrorwo again :]
[22:01:51] <paladox>	 This is the diff between nodepool 0.1 and 0.2 https://github.com/openstack-infra/nodepool/compare/0.1.0...0.2.0
[22:02:04] <paladox>	 Is it possible to update to 0.2.0?
[22:02:14] <paladox>	 or does it include packages that need backporting?
[22:02:32] <wmf-insecte>	 Yippee, build fixed!
[22:02:33] <wmf-insecte>	 Project beta-scap-eqiad build #131040: 09FIXED in 5 min 37 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/131040/
[22:03:10] <paladox>	 hashar ^^
[22:04:50] <paladox>	 It seems that https://review.openstack.org/#/c/190827/ would require https://github.com/openstack-infra/nodepool/commit/00e39427c248725b21e56508cd223ba83734422d ?
[22:07:17] <paladox>	 hashar according to this https://github.com/openstack-infra/nodepool/commit/da80841ed8a32e8d05a64491a0c82122c4d06e0c that will allow you to detect leaked instances
[22:07:35] <hashar>	 paladox: we already can see the leaked ones
[22:07:41] <hashar>	 and I got the serie of patches on my comptuer
[22:07:46] <paladox>	 oh.
[22:07:48] <hashar>	 cant bump nodepool further
[22:07:56] <hashar>	 it introduced a dependency on python-shade
[22:08:00] <paladox>	 Oh
[22:08:03] <hashar>	 and basically changed the way they talk to the openstack api
[22:08:12] <hashar>	 could not figure out how to run the deb packaging for that lib
[22:08:18] <hashar>	 and much less how to figure out whether it will work
[22:08:27] <paladox>	 Oh, i guess we could request a repo in gerrit and build the dep for jessie / trusty?
[22:08:35] <hashar>	 that is what we do
[22:08:40] <hashar>	 operations/debs/nodepool
[22:08:51] <paladox>	 oh i mean for the python shade
[22:09:04] <hashar>	 then the trouble is finding which version of shade to use
[22:09:08] <hashar>	 but maybe any will work
[22:09:12] <paladox>	 oh
[22:09:14] <hashar>	 hard to know without testing it :
[22:09:15] <hashar>	 (
[22:09:18] <paladox>	 yep
[22:09:24] <paladox>	 We could always ask openstack
[22:09:32] <paladox>	 who may know which versions work and which doint?
[22:09:40] <hashar>	 yeah maybe
[22:09:46] <hashar>	 but unlikely
[22:09:51] <hashar>	 shade is a spinoff of nodepool
[22:10:08] <paladox>	 Oh, but coulden we use pip or does it have to be installed as python-shade?
[22:10:09] <hashar>	 as I understood it they took the logic to serializes API requests out of Nodepool to a standalone lib: shade
[22:10:14] <paladox>	 oh
[22:10:19] <hashar>	 so maybe based on date we can find out a version that works
[22:10:29] <paladox>	 yep
[22:10:35] <hashar>	 and incrementally update nodepool  with the shade version that was available at that time
[22:10:38] <hashar>	 then hope ;]
[22:10:46] <paladox>	  around march / june time
[22:10:48] <paladox>	 2015
[22:11:20] <wikibugs>	 10Beta-Cluster-Infrastructure: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2831435 (10Krenair) I imagine this would be something in the mobile_redirect function in modules/varnish/templates/text-frontend.inc.vcl.erb
[22:12:53] <hashar>	 paladox: then I am not sure I will go with the struggle of upgrading nodepool
[22:13:04] <hashar>	 we might well phase it out entirely
[22:13:16] <paladox>	 ok
[22:16:19] <paladox>	 hashar it dosen't show here http://metadata.ftp-master.debian.org/changelogs/main/p/python-shade/python-shade_1.7.0-1_changelog that it requires a newer nodepool version
[22:16:31] <paladox>	 We could use version 0.6.1-1
[22:17:19] <hashar>	 maybe
[22:18:11] <hashar>	 I am off for sleep *wave*
[22:20:12] <paladox>	 hashar http://snapshot.debian.org/package/python-shade/
[22:20:30] <paladox>	 we could use http://snapshot.debian.org/package/python-shade/0.6.1-1/
[22:21:10] <paladox>	 It should probaly install find on jessie.
[22:21:57] <wikibugs_>	 05Continuous-Integration-Scaling, 06Operations, 07Nodepool, 07WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2833236 (10Paladox) We could upgrade to nodepool 0.2 and use this http://snapshot.debian.org/package/python-shad...
[22:23:20] <andre__>	 Hmm, which Phab project / CC would be appropraite "loss of session data" bugs?
[22:23:21] <andre__>	 https://phabricator.wikimedia.org/T151770
[22:23:50] <paladox>	 Would that be auth?
[22:23:59] <paladox>	 session auth / mediawiki auth?
[22:26:12] <andre__>	 yeah, good point. I'll add that one. Thanks!
[22:26:52] <paladox>	 Your welcome :)
[22:27:43] <wikibugs>	 10Beta-Cluster-Infrastructure: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2833266 (10Krenair) Oh, no, maybe not, I misunderstood the hackery going on here: ```krenair@deployment-mediawiki04:~$ curl -vvv -H 'X-Subdomain: M' -H 'Host: en.wikipedia.beta.wmflabs.or...
[22:35:47] <wikibugs_>	 10Beta-Cluster-Infrastructure: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2833323 (10Krenair) a:03Krenair ```=$wgMobileUrlTemplate "%h0.m.%h1.%h2"```
[22:47:55] <wikibugs>	 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2833376 (10Krinkle)
[22:48:48] <wikibugs_>	 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Redirecting to a page on beta cluster doesn't work - https://phabricator.wikimedia.org/T151894#2831435 (10Krinkle)
[22:48:51] <wikibugs>	 10Continuous-Integration-Config, 06Release-Engineering-Team, 10QuickSurveys, 10RelatedArticles: QuickSurveys and RelatedArticles configured with wrong url - https://phabricator.wikimedia.org/T151937#2833383 (10Krinkle)
[22:49:46] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10QuickSurveys: QuickSurveys and RelatedArticles configured with wrong url - https://phabricator.wikimedia.org/T151937#2832641 (10Krinkle)
[22:51:13] <wikibugs_>	 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Mobile view url broken on beta cluster (redirect, mobile view, etc.) - https://phabricator.wikimedia.org/T151894#2833389 (10Krinkle)
[22:51:40] <wikibugs>	 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2833394 (10fgiunchedi)
[22:51:56] <Krenair>	 ty Krinkle 
[23:01:05] <wikibugs_>	 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T150465#2833420 (10demon)
[23:14:35] <wikibugs_>	 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2833467 (10ssastry) Thanks @tstarling for @fgiunchedi for fixing the visibility.  >>! In T151702#2831201, @Joe wrote: > In the long chain changeprop => restb...
[23:56:08] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2833652 (10fgiunchedi)
[23:56:13] <wikibugs_>	 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery, 06Operations, 13Patch-For-Review: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2635904 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving this in fav...