[00:16:47] <wmf-insecte>	 Project selenium-Flow » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #114: 04FAILURE in 46 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/114/
[00:19:36] <wikibugs>	 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 13Patch-For-Review, 15User-zeljkofilipin: Various browser tests failing due to login error - https://phabricator.wikimedia.org/T142600#2563084 (10Tgr) I re-ran the failing test but nothing was logged on beta (see [[https://logstash-beta.wmflabs.org/app...
[00:23:43] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1004 is OK: OK: Less than 1.00% above the threshold [0.0]
[00:53:10] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-jessie-1005 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[04:19:10] <wmf-insecte>	 Yippee, build fixed!
[04:19:11] <wmf-insecte>	 Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #112: 09FIXED in 23 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/112/
[04:32:05] <shinken-wm>	 RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is OK: OK: Less than 100.00% above the threshold [0.0]
[06:37:19] <wikibugs>	 05Gitblit-Deprecate, 10Diffusion, 10Internet-Archive: Some months-old commits are still marked as "importing" - https://phabricator.wikimedia.org/T143298#2563402 (10Nemo_bis)
[06:37:31] <wikibugs>	 05Gitblit-Deprecate, 10Diffusion: Some months-old commits are still marked as "importing" - https://phabricator.wikimedia.org/T143298#2563416 (10Nemo_bis)
[08:05:09] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 588 bytes in 0.002 second response time
[08:06:17] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.m.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 590 bytes in 0.002 second response time
[10:21:19] <grrrit-wm>	 (03CR) 10Zfilipin: [C: 04-1] "From phab task: "@hashar: Since I left WMDE, both addresses will be bouncing. Just remove me from the list."" [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[10:42:48] <shinken-wm>	 PROBLEM - Puppet run on deployment-imagescaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[10:52:24] <grrrit-wm>	 (03CR) 10Zfilipin: Update Adrian Heine email (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[10:56:44] <grrrit-wm>	 (03PS2) 10Zfilipin: Remove Adrian Heine as owner of Wikibase and Wikidata Selenium tests. [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[10:58:03] <grrrit-wm>	 (03CR) 10Zfilipin: Remove Adrian Heine as owner of Wikibase and Wikidata Selenium tests. (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[10:58:49] <grrrit-wm>	 (03PS3) 10Zfilipin: Remove Adrian Heine as owner of Wikibase and Wikidata Selenium tests. [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[10:58:58] <grrrit-wm>	 (03CR) 10Zfilipin: [C: 032] Remove Adrian Heine as owner of Wikibase and Wikidata Selenium tests. [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[11:00:05] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove Adrian Heine as owner of Wikibase and Wikidata Selenium tests. [integration/config] - 10https://gerrit.wikimedia.org/r/304740 (https://phabricator.wikimedia.org/T85913) (owner: 10Hashar)
[11:12:49] <shinken-wm>	 RECOVERY - Puppet run on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:49:29] <shinken-wm>	 PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[11:55:34] <shinken-wm>	 PROBLEM - Puppet run on deployment-mediawiki02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[11:56:58] <shinken-wm>	 PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[12:15:30] <wmf-insecte>	 Yippee, build fixed!
[12:15:30] <wmf-insecte>	 Project selenium-Echo » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #120: 09FIXED in 59 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/120/
[12:15:53] <wmf-insecte>	 Yippee, build fixed!
[12:15:53] <wmf-insecte>	 Project selenium-Flow » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #115: 09FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/115/
[12:30:34] <shinken-wm>	 RECOVERY - Puppet run on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:31:31] <wmf-insecte>	 Yippee, build fixed!
[12:31:32] <wmf-insecte>	 Project selenium-WikiLove » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #116: 09FIXED in 1 min 51 sec: https://integration.wikimedia.org/ci/job/selenium-WikiLove/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/116/
[12:31:58] <shinken-wm>	 RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:38:01] <shinken-wm>	 PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:23:02] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:55:48] <greg-g>	 hrmmm, there are 4 nodepool instances just hanging out in the jenkins ui: https://integration.wikimedia.org/ci/ that's probably no good (I suspect)
[14:57:43] <thcipriani>	 greg-g: should be fine. They're all listed as ready according to nodepool list
[14:58:07] <shinken-wm>	 RECOVERY - Puppet run on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:58:16] <thcipriani>	 this is just the backlog of instances it's supposed to keep at the ready so that large influxes of patches don't overwhelm (although you rarely see it in practice :))
[14:58:44] <greg-g>	 ahhhhh
[14:58:51] <greg-g>	 yeah, I do rarely see it in practice :)
[14:59:06] <thcipriani>	 Deficit: ci-trusty-wikimedia: 0 (start: 0 min-ready: 2 ready: 2 capacity: 5); Deficit: ci-jessie-wikimedia: 0 (start: 0 min-ready: 2 ready: 2 capacity: 5)
[14:59:42] <greg-g>	 word, ty
[15:17:01] <greg-g>	 non-ping-to-tyler: and yeah, now that tons of the jobs are moved back on to permanent machines I'm more likely to see that ready-state now when it's slow
[15:17:21] * greg-g just saw the very busy permanent machines and relatively unbusy nodepool ones
[16:00:00] <chasemp>	 yeah greg-g I think that's as it should be it just never really had much standby room before :)
[16:00:42] <greg-g>	 exactly :)
[16:01:02] <greg-g>	 (I honestly just forgot about the stand-by feature)
[16:15:11] <bd808>	 anyone know where to start digging into why the post-merge test for ops/puppet started failing all the time? -- https://integration.wikimedia.org/ci/job/operations-puppet-doc/25672/console
[16:15:20] <bd808>	 "(Errno::ENOENT) No such file or directory - dummy.rb"
[16:18:36] <greg-g>	 cf: https://phabricator.wikimedia.org/T143233
[16:18:56] <greg-g>	 that tab has been open in my browser since rob reported it, and no, not me
[16:21:48] <shinken-wm>	 PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[16:29:44] <chasemp>	 thcipriani: https://gerrit.wikimedia.org/r/#/c/305529/
[16:30:03] <bd808>	 it hates something about that modules/puppetdbquery/bin/find-nodes file that _joe_ added
[16:30:34] <bd808>	 I wonder if it's related to the rubocop exclusion for that whole module?
[16:30:51] <bd808>	 looks like it comes from an upstream
[16:31:46] * thcipriani looks
[16:35:59] <chasemp>	 thcipriani: trying to ditch that trailing newline in init.pp there and gerrit rejects as no change bah
[16:37:08] <legoktm>	 thcipriani: thanks
[16:37:29] * thcipriani tips hat
[16:38:53] <thcipriani>	 legoktm: thank you for jjb wrangling, I get lost in there :)
[16:39:36] <legoktm>	 :D
[16:40:31] <wikibugs>	 10Continuous-Integration-Config: Move npm-node-4 jobs off of nodepool - https://phabricator.wikimedia.org/T142892#2564850 (10Legoktm) a:03Legoktm
[16:46:39] <bd808>	 greg-g: https://phabricator.wikimedia.org/T143233#2564858 -- maybe we are jsut using an ancient ruby in the doc build step
[16:47:17] <bd808>	 1.9.1 is pretty old school
[16:48:44] <grrrit-wm>	 (03PS1) 10Legoktm: Move npm-node-4 off of nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/305532 (https://phabricator.wikimedia.org/T142892) 
[16:48:49] <grrrit-wm>	 (03PS2) 10Legoktm: Move npm-node-4 off of nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/305532 (https://phabricator.wikimedia.org/T142892) 
[16:51:47] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Move npm-node-4 off of nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/305532 (https://phabricator.wikimedia.org/T142892) (owner: 10Legoktm)
[16:52:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Move npm-node-4 off of nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/305532 (https://phabricator.wikimedia.org/T142892) (owner: 10Legoktm)
[16:53:03] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/#/c/305532/
[16:53:06] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[16:59:46] <greg-g>	 bd808: /me shrugs
[17:01:35] <wikibugs>	 05Gitblit-Deprecate, 13Patch-For-Review: Fix references to git.wikimedia.org in all repos - https://phabricator.wikimedia.org/T139089#2419069 (10greg) >>! In T139089#2564318, @Paladox wrote: > I uploaded https://github.com/wmde/Wikiba.se/pull/50 for #wikidata team. I am not sure where the task for updating tha...
[17:02:52] <chasemp>	 thcipriani: https://gerrit.wikimedia.org/r/#/c/305537/ :)
[17:04:27] <thcipriani>	 chasemp: looks good, I really think the change in rate helped a ton.
[17:04:51] <chasemp>	 I think so too, my guess is we can up the ready count
[17:04:51] <chasemp>	 I just want to settle here for a bit
[17:04:53] <chasemp>	 at least 24 hours
[17:05:06] <thcipriani>	 sure, makes sense.
[17:05:49] <thcipriani>	 watched it for a while last night after you made that change, and for a while this morning: nothing looks crazy in the debug log :)
[17:06:15] <wikibugs>	 10Continuous-Integration-Config, 13Patch-For-Review: Move npm-node-4 jobs off of nodepool - https://phabricator.wikimedia.org/T142892#2565030 (10Legoktm) 05Open>03Resolved
[17:06:46] <chasemp>	 the main thing w/ rate is does it affect outcome timeline?  but I think no way it may even be uselessly aggressive now honestly
[17:06:50] <chasemp>	 but 10s seems to be benign even where it isn't particularly impactful
[17:06:53] <chasemp>	 so far
[17:08:25] <thcipriani>	 eh, I'll watch the graphs, I'd be surprised if that made anything worse. I think it's likely more disruptive to add a server to your database, hit a 403, delete, repeat super rapidly. At least that's my theory.
[17:08:49] <chasemp>	 same
[17:08:58] <thcipriani>	 probably know more at CI peak times today :)
[17:09:14] <chasemp>	 when is that?
[17:09:42] <thcipriani>	 eh, varies, but seemingly early afternoon
[17:13:29] <chasemp>	 thcipriani: so there was that issue w/ instance stuck in delete (I think possibly from some race condition you noted above), but been thinking about it
[17:13:33] <chasemp>	 nodepool reports instance age
[17:13:48] <chasemp>	 is there ever a time when a disposal VM should live greater than say 10m in a healthy way?
[17:13:53] <chasemp>	 we should possibly monitor on that
[17:14:20] <legoktm>	 isn't a security release supposed to be happening soon? those are the best CI stress tests because like 20-30 MW core patches get uploaded simultaneously
[17:15:14] <thcipriani>	 ostriches: did mention something about that yesterday. Dunno how much work yet remains there, though.
[17:21:05] <thcipriani>	 also, fwiw, 10 minutes is probably pushing it for some of the jobs that are running on nodepool.
[17:23:23] <legoktm>	 10 minutes for...? sorry, missing context
[17:24:05] <thcipriani>	 just replying to chasemp 's statement: <chasemp> is there ever a time when a disposal VM should live greater than say 10m in a healthy way?
[17:24:37] <chasemp>	 that's my point :) let's pick an upper limit and alert on anything that hangs out too long
[17:24:46] <chasemp>	 catches a handful of bad cases I can think of
[17:24:54] <chasemp>	 catching the failure to delete itself is pretty edge but falls in this too
[17:25:15] <chasemp>	 idk if 10m, whatever makes sense for "should never happen"
[17:25:24] <thcipriani>	 I meant pushing it the other direction, too tight a timeline in some instances, unfortunately
[17:25:50] <chasemp>	 what takes 10m (just curious)? and what's a reasonable upper bound?
[17:26:25] <thcipriani>	 eh, mediawiki jobs can take some time, latest one took 8:41 https://integration.wikimedia.org/ci/job/mediawiki-phpunit-php55-trusty/1356/
[17:26:35] <thcipriani>	 15m would probably be a good cut off
[17:30:03] <chasemp>	 if nothing else a test that takes 15m to run is probably doing something wrong
[17:30:58] <thcipriani>	 hopefully :P
[17:31:27] <chasemp>	 was reading a bit about the reference implementation of openstack upstream
[17:31:36] <chasemp>	 which...is very different now and they have ditched jenkins entirely
[17:31:41] <chasemp>	 so that puts us in a weird spot
[17:32:01] <chasemp>	 but also a lot (most of? all of?) their providers are basically ppl w/ capacity who said "yeah point your stuff at us and consume x count VMs"
[17:32:18] <chasemp>	 that's why all the scaffolding around providers and divying up test run locations
[17:32:25] <chasemp>	 they load balancing test pools from donors
[17:32:40] <chasemp>	 shed some light on nodepool internals for me
[17:36:08] <thcipriani>	 interesting. certainly explains some of their design decisions.
[17:37:19] <thcipriani>	 like nodepool not really taking no for an answer for any one provider if the config says something different.
[17:37:36] <chasemp>	 right
[17:37:44] <chasemp>	 that and rate is /per provider/
[17:37:53] <chasemp>	 that makes a ton of sense in that perspective
[17:37:57] <thcipriani>	 yarp
[17:37:58] <legoktm>	 looking at http://status.openstack.org/zuul/ is always scary
[17:38:38] <thcipriani>	 heh, no kidding
[17:39:20] <chasemp>	 I just caught it consuming 11 instance slots even at the current 10 max
[17:39:26] <chasemp>	 :)
[17:40:31] <chasemp>	 what it's not doing is: 1) hey delete this 2) wait reclamation period 3) is deleted?
[17:40:49] <chasemp>	 what it is doing: 1) hey delete this 2) (5s later at this point) yeah so I need another one thanks
[17:44:03] <thcipriani>	 that would make sense, I've seen it doing similar things in times where it's 40 instances behind it'll request all the things at once, although I'm still not clear on what was causing the underallocation of 2 nodes that we were seeing. Requests hitting openstack at the same time and so it rejects both? This is also probably something that the openstack folks don't look at given their size and
[17:44:05] <thcipriani>	 provider setup.
[17:45:30] <bd808>	 we should just move all of our CI to Travis
[17:45:31] * bd808 ducks
[17:45:37] <chasemp>	 yeah on the underallocation the really interesting part I think is that it's not deterministic w/ upper bound for quota
[17:45:40] <chasemp>	 it's not always offset 4 or 2 or 3
[17:46:11] <ostriches>	 bd808: Or just write less tests.
[17:46:17] <chasemp>	 I believe it's based on rate of create/test/delete and load on the systems ability to keep up w/ reclamation 
[17:46:17] <chasemp>	 I spent some time looking at whether it was already 4 under etc
[17:46:17] <chasemp>	 nope
[17:46:49] <chasemp>	 so currently nodepool is aggressive enough that at a quota of 15 we can afford max 10 concurent from nodepools perspective
[17:47:02] <chasemp>	 and usually it's less than 10 atm where it was always pegged at 6, now we see dips sub 5 (or based on anecdotal)
[17:47:15] <chasemp>	 and it will catch BUILD on 2 and then delete etc
[17:47:28] <chasemp>	 but it often says delete 2 and create 2 immediately after in the same way as 1
[17:47:36] <thcipriani>	 yarp, hmm, so openstack is responding to the rate of requests as well as to the quota?
[17:48:29] <chasemp>	 more like, at quota of 15 we can afford to say max 10 in nodepools state table w/ a rate 10s due to churn variance for delete/create processing
[17:48:41] <chasemp>	 it won't be sane to say 10 is quota in openstack and 10 max in nodepool like we've been doing
[17:48:46] <chasemp>	 unless all oeprations are perfectly real time
[17:48:53] <chasemp>	 it's funny we have been doing that in hindsight I think
[17:49:11] <chasemp>	 funny head smacking I mean
[17:49:49] <chasemp>	 partially theory partially bears out under testing
[17:51:51] <chasemp>	 (because nodepool doesn't confirm remote side state changes)
[17:51:56] <thcipriani>	 huh, seems like a problem that would be easier to solve by nodepool checking to see if openstack deleted an instance before deleting it from its instance table :\
[17:52:09] <chasemp>	 it's fast and loose I guess? yeah
[17:52:26] <thcipriani>	 like a rocket powered go-cart
[17:52:49] <chasemp>	 a snail duct taped to a bottle rocket ;)
[17:53:15] <thcipriani>	 hehe, that one made me sad.
[17:54:54] <chasemp>	 come to think of it, how does it surface when a node is stuck in delete state?
[18:36:49] <wikibugs>	 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2565426 (10greg) >>! In T141287#2522931, @greg wrote: > * ACTION: Greg will follow up with Faidon and Kevin via email in 2...
[19:03:29] <MaxSem>	 !log Upgrading hhvm-wikidiff2 in beta cluster
[19:03:32] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[19:06:11] <awight>	 I ran into a bit of a CI hurdle today, and would love some help...  https://integration.wikimedia.org/ci/job/php55lint/21408/console
[19:06:36] <awight>	 One of our fundraising repos requires the Symfony http-foundation library, which then requires a php 5.4 polyfill library
[19:07:07] <awight>	 This passes a php5.3 lint job, but fails the php5.5 one because it redefines built-in PHP classes.
[19:08:17] <Reedy>	 if php -l is failing
[19:08:34] <Reedy>	 They shouldn't be saying it's compat with that version of PHP?
[19:09:08] <awight>	 The idea is that the polyfill module has conditionals that prevent this class from being loaded unless the PHP version is correct, I believe
[19:09:27] <awight>	 But calling php -l on the file directly circumvents that
[19:11:34] <Reedy>	 probably
[19:38:49] <shinken-wm>	 PROBLEM - Puppet run on deployment-fluorine02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[20:07:38] <MaxSem>	 in beta cluster, is there a way to execute commands across the cluster, in absense of dsh and salt?
[20:14:32] <MaxSem>	 eh, it actually has dsh
[20:19:44] <thcipriani>	 and it has salt: deployment-salt2.deployment-prep.eqiad.wmflabs
[20:20:06] <thcipriani>	 er, deployment-salt02, rather
[20:22:24] <MaxSem>	 ah, salt has to be run from a dedicated host...
[20:22:38] <MaxSem>	 I obviously know nothing about it because not ops
[20:29:52] <wikibugs>	 10MediaWiki-Releasing, 10MediaWiki-Containers: Ready-to-use Docker package for MediaWiki - https://phabricator.wikimedia.org/T92826#1121273 (10richard.fanning) @TheGleep I reckon the MYSQL version you are using is > 5.7.5  http://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_get-lock...
[20:30:47] <MaxSem>	 !log Restarted hhvm on appservers for wikidiff2 upgrades
[20:30:51] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[20:31:04] <MaxSem>	 *in labs :P
[20:40:43] * greg-g looks at the channel
[20:40:44] <greg-g>	 yep
[20:41:47] <wmf-insecte>	 Project selenium-Echo » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #121: 04FAILURE in 46 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/121/
[21:02:06] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 06Revision-Scoring-As-A-Service, 07Spike: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2565966 (10Halfak) OK.  I think that this is done then.  I don't think we need custom models.  Let's j...
[21:03:43] <wmf-insecte>	 Yippee, build fixed!
[21:03:43] <wmf-insecte>	 Project selenium-Wikidata » firefox,test,Linux,contintLabsSlave && UbuntuTrusty build #90: 09FIXED in 2 hr 13 min: https://integration.wikimedia.org/ci/job/selenium-Wikidata/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/90/
[21:09:47] <wikibugs>	 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Setup poolcounter daemon in Beta Cluster - https://phabricator.wikimedia.org/T38891#418990 (10AlexMonk-WMF) This instance is running precise (T143349) - it should be replaced with a trusty instance to match production codfw (we don't ha...
[21:34:18] <Krenair>	 greg-g, MaxSem, Reedy, thcipriani
[21:34:21] <Krenair>	 looking for some input
[21:34:47] <Krenair>	 for https://phabricator.wikimedia.org/T143349 and https://phabricator.wikimedia.org/T142288 I've made deployment-fluorine02
[21:35:11] <Krenair>	 it runs jessie but there are no udp2log packages for jessie, and so this happened: https://phabricator.wikimedia.org/P3851
[21:35:47] <Krenair>	 technically this means it's not completely puppetised
[21:36:18] <Krenair>	 two things need to happen:
[21:36:26] <wikibugs>	 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2554147 (10chasemp) yesterday we had some issues with CI and @thcipriani and I poked at it for a bit making some change...
[21:36:41] <Krenair>	 * change everything  to send to this, kill -fluorine
[21:37:15] <Krenair>	 * work with ops to get jessie packages
[21:37:16] <Krenair>	 https://phabricator.wikimedia.org/T123728 is also relevant
[21:38:35] <Krenair>	 which should I do first?
[21:40:46] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2566110 (10AlexMonk-WMF)
[21:40:50] <wikibugs>	 10Beta-Cluster-Infrastructure, 05Goal: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] - https://phabricator.wikimedia.org/T142288#2566109 (10AlexMonk-WMF)
[21:40:56] <wikibugs>	 10Beta-Cluster-Infrastructure, 05Goal: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] - https://phabricator.wikimedia.org/T142288#2530145 (10AlexMonk-WMF)
[21:40:57] <wikibugs>	 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2409864 (10AlexMonk-WMF)
[21:48:33] <thcipriani>	 Krenair: hrm, looks like udp2log is running after you did some fancy forcing. I'd guess the easiest course would be (if everything on that machine seems normal) to start pointing beta things over to it.
[21:49:02] <thcipriani>	 looks like there are some other puppet failures though...might just be ldap-related things.
[21:49:27] <Krenair>	 the puppet failure is a long-term known thing
[21:49:36] <Krenair>	 https://phabricator.wikimedia.org/T117028
[21:49:50] <Krenair>	 it used to fail with that on deployment-fluorine, don't know what made it stop
[21:52:04] <thcipriani>	 ah, I see. FWIW, I think the repointing of things will be the most fruitful course of the two you proposed. Ops probably has this in their queue and I don't think that an update to jessie versions of packages will break things (/me finds wood to knock on)
[22:03:44] <bd808>	 !log deployment-fluorine02: Hack 'datasets:x:10003:997::/home/datasets:/bin/bash' into /etc/passwd for T117028
[22:03:48] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[22:11:08] <Krenair>	 bd808, aha thanks
[22:11:43] <bd808>	 I'm not sure if I did that before or someone else did
[22:11:51] <bd808>	 it at least shuts puppet up
[22:13:48] <shinken-wm>	 RECOVERY - Puppet run on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:18:04] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Add detection for calling global functions in target classes. [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie)
[23:18:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add detection for calling global functions in target classes. [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie)
[23:53:44] <Krenair>	 I'm copying deployment-fluorine:/srv/mw-log to deployment-fluorine02:/srv/mw-log-old
[23:55:27] <Krenair>	 will try to move things to the correct positions later
[23:57:21] <Krenair>	 btw, one thing I keep meaning to bring up thcipriani:
[23:57:28] <Krenair>	 krenair@deployment-changeprop:~$ ps auxwf | grep -v grep | grep -A 2 puppet
[23:57:28] <Krenair>	 root     12799  0.0  0.5  84872 11676 pts/4    S+   23:10   0:00              \_ sudo puppet agent -tv
[23:57:28] <Krenair>	 root     12800  0.3  3.7 253096 77612 pts/4    Sl+  23:10   0:10                  \_ /usr/bin/ruby /usr/bin/puppet agent -tv
[23:57:28] <Krenair>	 root     13945  0.0  0.1  22488  2496 ?        Ss   23:13   0:00                      \_ /bin/systemctl start salt-minion
[23:57:53] <Krenair>	 root      2368  0.0  0.0 257304  2012 ?        Ssl  Aug01   0:48 /usr/sbin/rsyslogd -n
[23:57:56] <Krenair>	 puppet on that host gets stuck running salt-minion
[23:58:05] <Krenair>	 I'm guessing for.. trebuchet?
[23:58:21] <thcipriani>	 hrm, possibly?
[23:58:34] <thcipriani>	 I can take a look
[23:58:51] <Krenair>	 or we can file a ticket, if you don't know off the top of your head :)