[07:56:12] joal: o/ [08:06:44] Hi elukey [08:12:12] elukey: When you're brain is on enough cafein and your schedule allows, let's try to fix deployment-aqs if you wish [08:13:56] I am working on it now :) [08:14:16] I thought it was missing grants and I also tried to switch to the 'cassandra' user [08:14:20] but same error [08:18:03] I keep seeing "All host(s) tried for query failed.","levelPath":"error/cassandra/table_creation","name":"aqs"},"req":{"uri":"/analytics.wikimedia.org/sys/table/pageviews.per.article.flat","method":"put"," [08:19:03] have we done changes to the keyspaces in prod that we didn't do in beta maybe? [08:19:30] (so the thing is now complaining because it would like to create a keyspace that is already there?) [08:20:35] elukey: We've not hanged keysapces [08:20:48] elukey: I checked permissions as well yesterday ... [08:21:06] elukey: I wonder if it wouldn't be the localDc config param [08:21:12] but I can't be sure [08:21:47] localDc? [08:23:15] In config file, there is a localDc param that doesn't exist in example config files [08:23:35] we have it in prod.. any idea what that mean? [08:24:01] I think it's to tell cassandra about DCs (since clusters can be cross-DC [08:24:12] elukey@deployment-aqs03:~$ nodetool status [08:24:12] Datacenter: datacenter1 [08:24:12] ======================= [08:24:17] * elukey cries [08:24:58] I'm very sorry about that [08:25:23] * joal pads elukey [08:26:03] "Error: localDc not in configured datacenters\n at validateAndNormalizeDcConf [08:28:25] ahhh [08:28:32] there is also a "datacenters" variable [08:28:49] tcp6 0 0 :::7232 :::* LISTEN 23200/nodejs [08:28:52] joal: --^ [08:29:12] checking the puppet config to fix this mess in labs [08:30:32] Thanks elukey [08:30:38] cassandra_local_dc => $::site, [08:30:50] this is the variable in the aqs profile [08:31:23] atm we don't grab it from hiera [08:32:25] so I can fix puppet to allow the config to be checked in hiera (will require a hopefully no-op change in prod) [08:32:39] or we can nuke the cassandra cluster and create one called eqiad [08:32:52] (in labs) [08:35:18] I'd be inclined to start from a clean state so we are sure that everything is ok [08:36:35] did we create the cassandra cluster with 'datacenter1'? [08:37:18] in aqs/init.pp: [08:37:18] # [*cassandra_local_dc*] [08:37:19] # Which DC should be considered local. Default: 'datacenter1'. [08:37:53] elukey: I second you in thinking starting afresh would be good [08:38:14] elukey: It would also mean us modifying scripts for aqs and dataloader users, no? [08:39:12] it is also the default dc for cassandra/init.pp [08:39:46] joal: for labs do you mean? [08:40:00] elukey: indeed [08:40:16] elukey: this is my wonder: Do we have automated user creation for prod? [08:40:26] If so, should replicable easily for labs [08:40:46] if not, we're at risk in case of cluster nuke (which would be bad for many other reasons, ok :) [08:44:16] so there is adduser.cql file on every hosts on which cassandra runs, containing the steps to re-create the aqs user [08:44:23] not sure about the dataloader [08:44:28] but the user creation is not automatic [08:44:35] (via puppet I mean) [08:44:36] ok elukey [08:44:51] If at least we have the CQL file, it's not that bad :) [08:45:18] elukey: is it pushed by puppet, or have been manually added (I suspect the later) [08:46:05] the latter :) [08:46:16] so https://gerrit.wikimedia.org/r/385332 should be fine to allow us to specify [08:46:22] 'datacenter1' in labs [08:46:34] restbase does it [08:46:47] (this is why I hate having so many things split apart from them in puppet) [08:47:11] and the change seems a no-op https://puppet-compiler.wmflabs.org/compiler02/8391/aqs1004.eqiad.wmnet/ [08:47:43] profile::restbase::cassandra_local_dc: "datacenter1" [08:48:25] Looks good elukey - I'm assuming the change for labs will be done with horizons? [08:48:37] all right joal, if you are ok I'll merge the patch with puppet disabled on aqs prod as uber precaution [08:48:45] deploy it in prod, make sure it is a no-op [08:48:56] elukey: sounds good [08:48:57] and then change labs via horizon [08:49:29] elukey: hi! fyi we started to document the pipeline we use for search: https://wikitech.wikimedia.org/wiki/Search/MLR_Pipeline [08:51:43] \o/ [09:00:56] joal: no op in prod but it doesn't work in labs hahaha [09:01:02] trying to check what is wrong [09:01:06] :( [09:01:09] Crap [09:03:26] hiera weirdness [09:03:36] so if I apply the change in prefix-puppet it doesn't work [09:03:49] but if I apply at host level hiera config it does [09:04:18] O_O [09:05:01] that is weird since I didn't expect it from https://wikitech.wikimedia.org/w/index.php?title=Puppet_Hiera#In_Labs [09:05:50] but this is a role lookup [09:06:20] so it might be that since we have ::site in role/common/aqs it gets applied after prefix-puppet in horizon [09:12:12] joal: cluster up, all good [09:12:20] elukey: you're the man [09:12:21] let's try to deploy and see if all works fine [09:12:37] elukey: deployment-aqs03 has it [09:12:44] elukey: trying o restart it [09:13:27] elukey: looks positive :) [09:13:51] \o/ [09:13:54] elukey: trying to redeploy [09:14:00] currently the druid host is set as aqs01 IIRC [09:14:15] I just put it as placeholder [09:14:20] so let me know if you need another one [09:14:48] elukey: we should remove druid config (comment it with a comment) altogether - No druid in deployement [09:15:02] elukey: reenabling puppet agent on aqs03 [09:24:29] elukey: deployment still failed [09:24:32] so atm the druid config is mandatory since it is taken via hiera [09:24:33] mwarf [09:24:42] all right let's discuss this later on :) [09:24:44] elukey: ok no prob for duid conf [09:26:14] elukey: How do we proceed - Shall I disable puppet agent and check for logs on aqs03? [09:27:32] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore: dbstore1002 (analytics store) enwiki lag due to blocking query - https://phabricator.wikimedia.org/T175790#3699000 (10jcrespo) 05Open>03stalled [09:27:39] so the logs should be available in https://logstash-beta.wmflabs.org [09:27:45] but maybe the config does not point to it [09:27:47] checking [09:28:08] it points to logstash2.deployment-prep.eqiad.wmflabs [09:28:47] that doesn't exists [09:28:48] ahahah [09:28:50] elukey: I didn't know we had logstash for beta :) [09:29:00] actually it seems to worek [09:29:09] not for aqs [09:29:14] yes yes [09:29:20] Just looked for [09:29:21] fixing puppet in horizon [09:29:26] and we have logs [09:29:32] elukey: there are logs [09:30:05] elukey: disabling puppet and adding info logs on aqs03 [09:30:13] joal: wait a sec [09:30:17] sure [09:32:03] there you go, aqs03 is logging to logstash \o/ [09:32:10] elukey: it was already ! [09:32:23] elukey: I tried to tell you :) [09:32:26] there were logs [09:32:41] elukey: But we need more detailed ones [09:33:11] it is not possible, the logstash host was wrong [09:33:21] elukey: I let you look into logstash [09:33:31] were you able to see "Error: Invalid modules definition {"/":[{"path":"projects/aqs_default.yaml" etc.. ? [09:34:25] joal: --^ [09:34:28] Nope [09:34:35] ahhh okok [09:35:28] all right all aqs nodes configured to log properly on logstash [09:36:09] last message I get in logstash is: message not supplied [09:37:25] but there is a err_stack no? [09:38:33] Ah elukey ! I understand now :) [09:38:44] Thanks for helping me - I'm slow this morning :) [09:39:58] elukey: I think the refacto implies some puppet refacto - double checking [09:40:30] * elukey waits :D [09:40:48] elukey: example: https://gerrit.wikimedia.org/r/#/c/384590/3/config.example.wikimedia.yaml [09:41:33] elukey: in the config file, we'll need to remove the '/:', decrement indentation, and add backend = cassandra in table module [09:42:02] but not in prod :P [09:42:12] so I can file a code review and apply it on the labs puppet master [09:42:24] and then we'll need to do it in prod before the next deployment [09:42:29] otherwise everything will fall apart [09:42:31] ahahahha [09:46:27] elukey: indeed !!!! [09:46:57] joal: do you want to modify the puppet config code and file a code review? [09:47:00] * joal feels bad about not having thought of that [09:47:10] elukey: I'll do that yes [09:47:25] otherwise I can do it [09:47:32] elukey: I can do [09:47:41] super [09:47:42] elukey: rollbacking deployment-aqs03 [09:48:01] this time it was really good that we tried the deployment in labs fiest [09:48:04] *first [09:48:17] elukey: yes, it was ! the patch was too big [09:52:55] By the way elukey - I see that druid config in .erb is only rendered if druid_host is set - For labs, couldn't we set druid_host to empty? [09:55:48] elukey: Path submitted [09:56:02] *patch [09:56:22] joal: yes that one is the aqs module, that is then configured in the profile.. the profile requires heira druid parameters :( [09:56:53] elukey: We could fill in fake druid params, and tehm not being rendered because of the if in template? [09:57:42] joal: is there a problem if we leave fake druid config in there? [09:57:54] nope, it'll just fail trying to connect to it in case [09:58:00] okok [10:03:48] joal: config updated in labs, you can retry the deploy [10:04:17] elukey: I'm so glad you're helping me :) [10:04:58] deploying again in beta [10:04:58] joal: :) [10:06:01] elukey: SUCCESS on aqs03! [10:06:09] elukey: conf is changed on all 3 nodes of beta? [10:10:24] yep! [10:10:31] ok, continuing deployment [10:11:29] deployment successful on beta :) [10:12:55] \o/ [10:13:13] * joal bows to elukey for his patience and support :) [10:14:25] * elukey was basically the cause of aqs not working so he should not be cheered :D [10:14:33] elukey: when requesing for a druid-backed endpoint, it just fails with no-route [10:15:20] This is good :) [10:15:41] elukey: I suggest we wait for monday morning to deploy that onto prod cluster ) [10:15:47]  [10:16:04] +2 [10:16:14] great [10:17:04] I think I'm gonna stay away from important code today - I'm too tired and make too many mistakes :) [10:18:46] joal: question for you if you have time - I am exploring the possibility to use graphite rather than prometheus for Druid (at least in the beginning to have metrics) but it'd require http://druid.io/docs/0.9.2/development/extensions-contrib/graphite.html [10:19:45] I'd really love to have metrics *now* rather than when the prometheus poller will be ready [10:20:19] or even http://druid.io/docs/0.9.2/development/extensions-contrib/statsd.html [10:33:18] 10Analytics, 10DBA: Access to x1 broken on stat1006 - https://phabricator.wikimedia.org/T178237#3699150 (10jcrespo) 05duplicate>03Resolved a:03jcrespo This was resolved on T175970. [10:51:05] elukey: graphite extension seems fine, no ? [10:51:40] joal: we use statsd to aggregate metrics before graphite afaik, but I'll investigate [10:51:58] are those extensions easily deployable ? [10:52:13] I guess those are simply a new jar to add but I might be wrong [11:04:30] * elukey lunch! [11:37:55] joal heya, I deployed refinery yesterday, but did not restart the uniques job (on purpose) I wondered if some druid cleaning needs to be done before restart? [11:38:08] I saw the dataset name changed to underscore [11:40:07] (03PS1) 10GoranSMilovanovic: October 20 2017 - naming conventions [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385354 [11:40:27] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] October 20 2017 - naming conventions [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385354 (owner: 10GoranSMilovanovic) [11:40:33] (03Merged) 10jenkins-bot: October 20 2017 - naming conventions [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385354 (owner: 10GoranSMilovanovic) [12:36:23] heya mforns [12:36:38] joal, hellooo [12:37:40] mforns: we indeed need to clean the old datasource, as well as restarting the new jobs from beginning of times [12:38:48] mforns: Crap ! I thought the mediawiki-history-reduced job was merged, it was not :( [12:39:05] merging now [12:39:19] joal, I did not deploy source [12:39:32] mforns: It's not on source [12:39:51] oh [12:41:04] no worries, we didn't want to start production job anyway [12:41:20] mforns: About uniques, 3 datasources are involved [12:42:13] joal, should we redeploy today? it's friday.. [12:46:59] mforns: no need to redeploy [12:47:18] joal, should I rollback refinery? [12:59:33] mforns: no no really [13:00:24] mforns: uniques code is ready to go, mw-h-reduced is not yet to be started so no bog deal :) [13:13:23] mforns: taking a break, will be back soon [13:13:34] ok ok [14:34:06] mforns: I'm back ! [14:43:58] hey joal :] [14:44:39] mforns: anything you need help wih? [14:45:02] eeemmmmmmmm, not for now, but I will [14:49:42] 10Analytics, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3699743 (10elukey) Today I had a great chat with @fgiunchedi and I came up with the following simple script to handle HTTP POSTs with custom code and regular H... [16:55:05] * elukey off! [20:42:52] yo peeps ! [20:43:10] question. anyone got a clue how we count action=raw requests ? [20:43:21] as in, are those pageviews ? [20:43:56] thedj: action=raw for edit requests? [20:44:09] thedj: or can you give an example of aurl? [20:44:28] https://en.wikipedia.org/wiki/User:TheDJ?action=raw [20:45:40] that's how previews done with navigation popups happen (it has it's own parser and renderer.. cause party like it's 2003), and i'm switching them over to api endpoint [20:45:46] thedj: i see, there is no distinction between that and https://en.wikipedia.org/wiki/User:TheDJ [20:46:01] then you might see pageviews dip :) [20:46:48] thedj: sure but that has little to do with the "raw" on teh url does it? [20:48:55] thedj: the popup requests get send via restbase [20:49:05] thedj: like "https://es.wikipedia.org/api/rest_v1/page/summary/SomePage" [20:49:08] navigation popups, not page preview popups. [20:49:28] like the gadget that predates it and still has 40000 users [20:49:34] thedj: ah sorry, what are navigation popups [20:49:36] minimum [20:49:48] https://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups [20:50:25] thedj: is that extension configured on enwiki/eswiki/dewiki? [20:50:39] its not an extension. it's a gadget [20:51:12] thedj: sorry, is the gadget configured in any of those wikis or is it configured client side? [20:51:26] it's configured on all of those yes [20:51:49] thedj: so how do users enable it? [20:51:51] https://en.wikipedia.org/wiki/MediaWiki:Gadget-popups.js [20:51:57] https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets [20:52:16] thedj: ok, so only available to users that know about preferences [20:52:23] English wiki alone: 47,618 users [20:52:34] thedj: ya, that is quite small [20:52:49] thedj: over what timep eriod? [20:52:55] thedj: over what time period? [20:53:08] thedj: or users absolute number? [20:53:25] that ever turned on the gadget? [20:53:44] currently turned on, but not known how many of those users are active. [20:54:00] thedj: ok then pageviews will see no effect, you can be reasurred [20:54:38] thedj: the variation ve pageviews on a daily basis is higher than what those 40.000 users will produce even if they all used the extension that same day [20:54:59] thedj: makes sense? [20:55:02] sure [20:55:05] good to know. [20:55:33] thedj: take a look at pageviews daily: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=eswiki,itwiki,enwiki,jawiki,dewiki,ruwiki,frwiki/metrics=Pageviews [20:55:47] thedj: for the wikis with most traffic [20:56:30] thedj: variations of eswiki for example from 1 day to next are in the millions: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=eswiki/metrics=Pageviews [22:46:00] (03PS1) 10GoranSMilovanovic: Minor Oct 21 2017 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385485 [22:46:18] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] Minor Oct 21 2017 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385485 (owner: 10GoranSMilovanovic) [22:58:40] (03PS1) 10GoranSMilovanovic: WDCM_SemanticsDashboard - Skeleton 21 Oct 2017 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385486 [22:58:57] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] WDCM_SemanticsDashboard - Skeleton 21 Oct 2017 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/385486 (owner: 10GoranSMilovanovic)