[07:12:56] marking train-idwiki-dbn-20180215-query_explorer-pruned_mrmr as SUCCESS, seems like the model was generated, skein consider the first attempt to have failed but I don't see what's the error, subsequent runs all failed because the model already exists [07:17:22] something similar happened to train-viwiki-dbn-20180215-query_explorer-pruned_mrmr, I see a weird failure when spark shuts down, "shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory" [07:17:31] marking as success to unblock [07:28:01] quick errand [07:36:14] Do we actually use Kibana on relforge? Context: https://phabricator.wikimedia.org/T392620 [08:01:34] gehel: I never really used it, we stopped pushing query logs there so I'm not sure there's a strong use-case there but better ask Erik if he has plans [08:04:38] ebernhardson: ^ for when you're around [08:04:46] dcausse: thanks [09:42:11] pfischer / dcausse : do we need to care about T395402 ? [09:42:13] T395402: MediaWikiCronJobFailed - https://phabricator.wikimedia.org/T395402 [09:43:58] dcausse / pfischer: are we good on T395425 or do we need anything else? [09:43:58] T395425: Updating weighed tags via EventBus in beta does not work - https://phabricator.wikimedia.org/T395425 [10:05:46] gehel: re T395402 commented, T395425: closed [10:05:47] T395402: MediaWikiCronJobFailed - https://phabricator.wikimedia.org/T395402 [10:05:47] T395425: Updating weighed tags via EventBus in beta does not work - https://phabricator.wikimedia.org/T395425 [10:05:58] lunch [10:27:28] Lunch [12:55:49] o/ [12:57:20] seems like we're hit by https://github.com/opensearch-project/OpenSearch/issues/7860 :( [12:57:39] seeing such errors in the logs illegal_argument_exception: field value function must not produce negative scores, but got: [-0.5083467960357666] for field value: [-0.5083467960357666] [13:09:53] filed T395677, 0.5 qps affected, pretty significant :( [13:09:54] T395677: Search backend error: illegal_argument_exception: field value function must not produce negative scores - https://phabricator.wikimedia.org/T395677 [13:24:25] it's actually glent m1run causing this [13:34:34] we don't seem to store the source there, can't actually check but probably the glent is pushing negative scores there [13:35:18] ebernhardson: probably means that current a/b test might have to be restarted after fixing this [13:38:26] first error is on May 22, 18:37 [13:39:24] the a/b test started on May 29... [13:41:06] or perhaps due to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1144631/ deployed as part of the train [13:52:29] :S ok [14:04:52] I guess they don't plan on fixing that in OpenSearch 1.x? [14:13:46] inflatador: I don't think so but turns out it might not be what's causing the problem for us [14:58:36] heading in to my office, back in 30 [15:15:02] ok, time to shutdown this computer, have a nice week-end! [15:15:09] enjoy! [15:21:16] wonder if glent should save it's aggregated suggestions somewhere...it's a bit tedious but currently loading the last dump out of swift to find where we give negative scores [15:33:50] .o/ [15:46:37] well there are 100% negative scores in there. The lazy answer is to add a big constant to all the scores and call it a day :P Pondering how they actually get negative...probably have to step through the algo [15:47:52] the obvious guess would be that the score is `logGeoMean - dist`, clearly dist is larger than logGeoMean. Maybe we do just add a constant [16:11:08] hey, do we actually do anything with curator on our production hosts? I see we install it but I don't see any timers a la apifeatureusage [16:13:35] inflatador: not for cirrus at least [16:14:59] Ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/cirrus/server.pp#31 ...the version is hard-coded and that's causing puppet failures. If we don't need it I'd just as soon remove the class [16:16:02] Otherwise it's pretty trivial to repeat what I did here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151775/6/modules/profile/manifests/apifeatureusage/logstash.pp [16:17:08] looks like at one time curator was used to do things like update shard allocation states, but not seeing anything current [16:22:16] I made a CR to remove it ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152297 ) , but if you prefer to be more conservative I can make it install without the hard-coded version number [16:23:19] +1, seems fine to remove [16:24:10] Cool, will merge shortly [16:25:14] dammit, that patch was supposed to have an ensure=>absent [16:25:38] oh well, let's see it puppet is happy [19:40:44] heads up ryankemper , I'm gonna start the decom cookbook for relforge100[3-4] [19:41:04] inflatador: ack! [19:46:50] ryankemper I forgot, there's still references to relforge1003 in deployment-charts. I wanna say we already pushed a patch for that? [19:47:05] yup https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1147893 [19:47:39] Oh yeah we forgot to circle back on that, we need to figure out the calico dst stuff [19:48:42] I'm gonna go ahead and merge and deploy the staging updater. I'll keep an eye out but it's not a huge deal if the staging updater is broken [19:53:10] ryankemper just deployed, here's the dashboard for staging cirrus https://grafana.wikimedia.org/goto/wv-dGfBHR?orgId=1 [19:57:38] looks like the consumer is unhappy, `.UnknownTopicOrPartitionException: This server does not host this topic-partition.\n"}` [19:57:41] I'm gonna try a rollback [19:59:19] * inflatador is almost positive this has nothing to do with the changes and everything to do with it pulling in the new flink-app chart [20:01:54] I stand corrected. Rollback was successful [20:02:10] err...nope, wrong again! I'm getting the same errors [20:04:36] I'm giving up on the deploy and the decom for relforge stuff now [20:43:43] I'll talk to d-causse_off , about the streaming updater on Monday, assuming he's not out for awhile [20:46:04] in the meantime I'll decom some of the servers from T394350 [20:46:04] T394350: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350