[00:14:04] <wikibugs>	 Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2521327 (BBlack) It doesn't make much implementation difference whether we scope this for one, several, or all of our second-level domains.  If anything, doing it for all of them at once is p...
[06:08:12] <wikibugs>	 Analytics: 2016-06-02 hour 14 file missing? - https://phabricator.wikimedia.org/T142052#2521047 (JAllemandou) webrequest data get's deleted in a rolling fashion after 2 month of existence, so it's normal that at the beginning of August, beginning of June starts being deleted.
[06:08:27] <wikibugs>	 Analytics: 2016-06-02 hour 14 file missing? - https://phabricator.wikimedia.org/T142052#2521530 (JAllemandou) Open>Invalid
[06:08:37] <joal>	 hi elukey
[06:45:41] <grrrit-wm>	 (CR) Joal: "backfilling has been tested and works: Since the jobs generates segments for periods, generated segments override the existing ones." (9 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/298131 (https://phabricator.wikimedia.org/T138264) (owner: Joal)
[06:46:41] <grrrit-wm>	 (PS9) Joal: Add Druid loading oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/298131 (https://phabricator.wikimedia.org/T138264)
[07:24:27] <elukey>	 joal: o/
[07:25:10] <elukey>	 argh 1004-b is still working :/
[07:25:24] <joal>	 elukey: aqs1004-a is done compacting, but b is still doing
[07:25:54] <elukey>	 yes I just seen, less than 200 files left so it might complete this morning
[07:30:10] <elukey>	 good news is that I have https://phabricator.wikimedia.org/P3633 to create proper users for cassandra
[07:30:18] <elukey>	 I am going to open a phab task
[07:30:40] <elukey>	 I think that I should have opened one for the re-image
[07:30:50] <elukey>	 so I'll probably create sub-tasks to the main one
[07:30:55] <elukey>	 so we can track this work
[07:31:23] <elukey>	 with user's check to local one we might get even more perf with the actual cluster
[07:32:30] <joal>	 elukey: I think that's right :)
[08:05:33] <wikibugs>	 Analytics-Kanban: Improve user management for AQS - https://phabricator.wikimedia.org/T142073#2521675 (elukey)
[08:13:16] <wikibugs>	 Analytics-Kanban: Replace RAID0 arrays with RAID10 on aqs100[456] - https://phabricator.wikimedia.org/T142075#2521730 (elukey) Already started with aqs1004, instance a has completed meanwhile instance b is still getting data from other ones. @JAllemandou checked on instance a that data schemas were correct a...
[08:37:37] <grrrit-wm>	 (Abandoned) Mforns: [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) (owner: Mforns)
[09:31:34] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: Connection refused
[09:31:45] <elukey>	 niceeeeeee
[09:31:50] <elukey>	 alarms in here too
[09:31:53] <elukey>	 it works
[09:31:56] <joal>	 :)
[09:32:08] <joal>	 elukey: but alarm possibly means wrong, no?
[09:34:41] <joal>	 elukey: Ah, because of bootstrapping, nothing wrong in fact, correct?
[09:37:24] <elukey>	 yeah yesterday I've set 24hours of downtime
[09:37:32] <elukey>	 and forgot to renew
[09:37:34] <elukey>	 :/
[09:44:10] <joal>	 arf
[10:20:43] <grrrit-wm>	 (CR) Mforns: [WIP] Refactor Mediawiki History scala code (40 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301837 (https://phabricator.wikimedia.org/T141548) (owner: Joal)
[10:21:45] <elukey>	 comeeee onnn cassandraaaaa
[10:57:36] <joal>	 taking a break ateam
[11:06:53] <elukey>	 Receiving 1947 files, 275009995979 bytes total. Already received 1926 files
[11:06:56] <elukey>	 ALMOST :)
[11:12:26] * elukey lunch!
[11:34:04] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.008 second response time on port 9042
[11:42:31] <elukey>	 woooo
[11:43:19] <elukey>	 all right aqs1004-b up and running
[11:44:08] <elukey>	 joal: let me know if you want to check aqs1004-b or if I can proceed with 1005
[12:07:30] <mforns>	 joal, yt?
[13:09:20] <ottomata>	 moornnninnn!
[13:09:33] <ottomata>	 elukey:  refinery scap in a bit?  (lemme  make some coffee...)
[13:10:01] <elukey>	 ottomata: morningggg
[13:10:10] <elukey>	 I already backupped the refinery dir on tin
[13:10:14] <elukey>	 in my home dir
[13:10:22] <elukey>	 so we should be good :)
[13:10:42] <elukey>	 ping me when you are ready
[13:17:45] <elukey>	 also backupped the dir on stat1002 as extra precation
[13:18:08] <elukey>	 I'd say that we could disable puppet on stat1003 and analytics1027
[13:18:34] <elukey>	 (stat1004 sorry)
[13:19:08] <elukey>	 and then merge https://gerrit.wikimedia.org/r/#/c/299719
[13:24:52] <ottomata>	 lookin
[13:25:43] <elukey>	 need to amend the commit message
[13:26:48] <elukey>	 done
[13:27:02] <ottomata>	 elukey:  having some thoughts about the deploy_user name...i think its not too late, is it?
[13:27:13] <ottomata>	 that user hasn't been created anywhere, and its not tied to the ssh key
[13:27:16] <ottomata>	 right?
[13:27:52] <ottomata>	 am thikning that this 'analytics-deploy'  might be more useful than for just owning refinery files
[13:28:09] <ottomata>	 right now analytics prod jobs are running everything as hdfs, whcih is hadoop super user
[13:28:28] <ottomata>	 maybe this deploy_user could also eventually be used for other things?
[13:28:34] <ottomata>	 like launching hadoop jobs, etc.
[13:28:48] <ottomata>	 if so, then we should think of a slightly better name than 'analytics-deploy', eh?
[13:28:53] <ottomata>	 that is...if it isn't too late already :p
[13:29:27] <elukey>	 well I'd prefer to keep things separated, but if you think that it would be best we can change it.. Problem is, the keyholder ssh private key name is based on that username :D
[13:30:00] <elukey>	 because if you don't specify it the keyholder will check a priv key named analytics_deploy in the private repo
[13:30:23] <elukey>	 (do you remember the discussion around dumpsdeploy etc..)
[13:31:12] <joal>	 Hey guys, I'm back
[13:31:17] <ottomata>	 hm, i see but key_name is configurable! :)
[13:31:25] <ottomata>	 and analytics-deploy is a fine key name
[13:31:47] <ottomata>	 elukey:  not strongly opinionated about this, just thought i'd mention it
[13:31:51] <elukey>	 I know but you suggested to me the removal of the key_name :D
[13:32:05] <ottomata>	 ja to make it match...:p
[13:32:09] <elukey>	 :P
[13:32:14] <joal>	 elukey: still many compaction ongoing on aqs1004-b
[13:32:33] <ottomata>	 elukey:  and that key is already in keyholder, right?
[13:32:49] <elukey>	 joal: yeah :/
[13:32:56] <elukey>	 ottomata: yep and also in the pwstore
[13:33:04] <elukey>	 (the passwordzzz)
[13:33:26] <elukey>	 already in the keyholder == in the private repo dir of the keyholder
[13:33:35] <ottomata>	 aye
[13:33:44] <elukey>	 Tyler told me that we'll able to arm the keyholder only after the merge
[13:33:44] <ottomata>	 ok, i think analytics-deploy is a fine key name, and it is an ok user name, if we don't plan on using it for other things too.
[13:33:56] <ottomata>	 elukey:  you think we shouldn't use this user for other things?
[13:34:09] <ottomata>	 like, owning files and running jobs in hdfs?
[13:34:17] <ottomata>	 s/hdfs/hadoop
[13:34:40] <ottomata>	 (btw, sorry I didn't think of this earlier)
[13:34:51] <elukey>	 nono we are not in a hurry, no problem
[13:35:19] <elukey>	 I was trying to think if another user like analytics-runner or something more creative would be better
[13:35:45] <elukey>	 to run jobs and own files in hdfs
[13:35:46] <ottomata>	 well, we have a couple of users l ike this already for other teams
[13:35:52] <ottomata>	 analytics-search, analytics-wmde
[13:35:57] <ottomata>	 analytics-analytics :p jk
[13:36:10] <ottomata>	 could just be 'analytics'
[13:36:11] <ottomata>	 hm.
[13:37:07] <elukey>	 keeping the same user for many things could be handy but also a problem if things start to overlap
[13:38:00] <ottomata>	 ja but this seems like the type of thing you'd want to overlap..hmmm or is it?
[13:38:56] <ottomata>	 elukey:  i'm about 60% in favor of changing this name.  that's not much...maybe you are right.  we can always add another user later if /when we wanted to do something like that.
[13:38:59] <ottomata>	 hm
[13:39:05] <ottomata>	 joal:  what do you think?
[13:40:08] <elukey>	 we'll be able to use group permissions to use analytics-deploy's owned files (as we do not with root), so it might be good not to remember that a user has multiple functions through puppet
[13:40:13] <ottomata>	 elukey:  things can get confusing the more system users you have though.  so sometimes its nice to reuse them if the purposes are similiar
[13:40:22] <joal>	 ottomata, elukey : If it's just about moving hdfs user to something else more relevant, +1 :)
[13:40:23] <elukey>	 sure this is true
[13:40:43] <ottomata>	 joal:  it is sorta about that, but it is about if we should use the deploy user we are making to deploy refinery that user
[13:40:55] <ottomata>	 which informs the choice of name for that user
[13:41:07] <joal>	 ok I get it now
[13:41:08] <ottomata>	 not saying we switch to not using hdfs now, but if we choose a good username now, it will make more sense later.
[13:41:14] <ottomata>	 or, we could just make another user later
[13:41:50] <ottomata>	 hmmmmmmmmm
[13:41:56] <ottomata>	 thinking about puppet...
[13:42:06] <ottomata>	 mixing user purposes in puppet is def always a pain
[13:42:10] <ottomata>	 that is a good point elukey
[13:42:17] <joal>	 My view on it is: "super-power" users (like deployers etc) can be reusable by group of things they own (like analytics stuff for instance)
[13:42:49] <joal>	 I think having one user per task is aiming for perfect speration of concern, but very realistic :)
[13:42:55] <joal>	 ottomata, elukey --^
[13:43:12] <ottomata>	 joal:  so you are in favor of analytics-deploy for deployment, and something else later for running jobs, etc.
[13:43:27] <ottomata>	 elukey:  there's no reason why analytics-deploy can't own files in hdfs too, if we want that.
[13:43:42] <ottomata>	 we'll have to manage the user outside of scap and include it on namenodes, etc.
[13:43:45] <ottomata>	 but we can do that later
[13:43:45] <joal>	 I'd be for a single user for managing analytics operations, from deploy to jobs
[13:43:48] <ottomata>	 ah ok
[13:43:55] <ottomata>	 i'm slightly in favor of that too.
[13:44:12] <joal>	 BUT, if it's easier for puppet and so, what's best for you guys (you manage it)
[13:44:44] <ottomata>	 it takes just a little more puppet brainpower to make things work, but not really.  we'd have to create a new user to do that anyway...might as well be the one we use for deploy too.
[13:45:57] <ottomata>	 elukey:  it sounds like joal and I are in favor or reusing it, but only slightly opinionated.  how strongly opinionated are you? :p
[13:52:14] <elukey>	 ottomata: I think that we are a bit overthinking this, but I have nothing against creating a single username.. My puppet patch shouldn't change much, it is only a matter of not having manage_user = true in the scap config.. I'd need to check how the ssh keys will be provisioned, especially in relationship with the keyholder
[13:52:46] <elukey>	 if the change is minimal without making us crazy, I can work on it
[13:54:56] <ottomata>	 elukey:  it should be easy i think
[13:54:57] <ottomata>	 just do
[13:54:59] <ottomata>	 deploy_user =>
[13:55:02] <ottomata>	 'analytics'
[13:55:03] <ottomata>	 and
[13:55:19] <ottomata>	 key_name => 'analytics-deploy'
[13:55:29] <ottomata>	 you can keep manager_user true for now
[13:56:03] <elukey>	 ahhh ok it might work indeed
[13:56:51] <elukey>	 so manage_user will make sure that 'analytics' will allow ssh access with 'analytics-deploy'
[13:57:01] <elukey>	 creating the user and the ssh config
[13:57:05] <elukey>	 on the scap targets
[13:59:21] <ottomata>	 yup, it will put the analytics-deploy public key for the analytics user on the targets
[13:59:22] <ottomata>	 exactly
[13:59:51] <ottomata>	 folks in the analytics-admins group can access the private key in the armed keyholder, and then ssh + depploy as the analytics user
[13:59:51] <elukey>	 all right looks good
[14:06:39] <elukey>	 ottomata: https://gerrit.wikimedia.org/r/#/c/299719
[14:08:12] <ottomata>	 elukey:  i think you need key_name on the scap::target
[14:08:19] <ottomata>	 key_name => 'analytics-deploy'
[14:11:44] <elukey>	 ah right since the username is different
[14:11:46] <elukey>	 forgot that
[14:11:56] <elukey>	 should be analytics_deploy though
[14:11:57] <elukey>	 right?
[14:12:17] <ottomata>	 did you name it that?  in keyholder_agents you have it as analytics-deploy
[14:12:21] <ottomata>	 in scap/server.yaml
[14:12:22] <ottomata>	 right?
[14:12:41] <ottomata>	 oh but key_name_safe?
[14:12:44] <ottomata>	 is that what you are looking at?
[14:12:49] <ottomata>	 oh no
[14:12:50] <ottomata>	 it sfine
[14:12:53] <ottomata>	             $key_name_safe = regsubst($key_name, '\W', '_', 'G')
[14:13:11] <elukey>	 well we might just add it with _ for clarity
[14:13:13] <elukey>	 doesn't hurt
[14:13:24] <ottomata>	 well, how does that interact with what you have in keyholder_agents?
[14:13:43] <ottomata>	 ah its made into key_name_safe ther etoo
[14:13:55] <ottomata>	 elukey:  either way is fine, but let's make it the same in both places
[14:14:04] <elukey>	 okok I'll use analytics-deploy
[14:14:18] <ottomata>	 non, analytics_deploy is fine!  you just have to change it in scap/serer.yaml too
[14:14:19] <ottomata>	 so they match
[14:14:21] <ottomata>	 up to you :)
[14:14:45] <ottomata>	 might be less confusing to just explicitly name it with _, so that if someone is looking for it on the server they can grep it with the name they see in puppet
[14:15:14] <ottomata>	 elukey:  ^
[14:15:21] <elukey>	 +1
[14:16:39] <elukey>	 https://gerrit.wikimedia.org/r/#/c/299719/7
[14:16:43] <elukey>	 going to check with pcc
[14:18:44] <ottomata>	 elukey:  +2 looks good!
[14:20:19] <joal>	 ottomata: would you have a minute for me ?
[14:20:21] <ottomata>	 yup!
[14:20:23] <ottomata>	 wasssup?
[14:20:26] <joal>	 great, batcave?
[14:20:28] <ottomata>	 k
[14:27:41] <ottomata>	 elukey:  how's it look?
[14:29:47] <wikibugs>	 Analytics-Kanban, Patch-For-Review: Change or upgrade eventlogging kafka client used for consumption - https://phabricator.wikimedia.org/T133779#2522307 (Ottomata) kafka-python 1.2.5 has been running on deployment-eventlogging03 for the client side processor, and the all-events and mysql consumers for se...
[14:30:08] <elukey>	 Tyler added some comments, and the pcc failed because of the private labs repo (I just updated it :)
[14:31:37] <ottomata>	 ah
[14:31:42] <ottomata>	 elukey:  ja i think we can drop service_name
[14:32:06] <elukey>	 ottomata: https://puppet-compiler.wmflabs.org/3605/
[14:32:48] <elukey>	 looks good afaics
[14:32:57] <elukey>	 (modulo service name changes)
[14:33:31] <ottomata>	 nice
[14:34:18] <elukey>	 going to update the scap config repo
[14:34:24] <ottomata>	 k
[14:36:16] <joal>	 elukey: we should ask urandom, but I wonder if you couldn't start re-bootstrapping another host even if compaction is not finished on aqs104
[14:36:41] <grrrit-wm>	 (PS1) Elukey: Replace the git and ssh users with 'analytics' [analytics/refinery/scap] - https://gerrit.wikimedia.org/r/302926
[14:37:26] <elukey>	 ottomata: --^
[14:37:50] <elukey>	 joal: since we will drop two instances it might be not the best thing to do..
[14:37:55] <elukey>	 I was wondering the same
[14:38:01] <grrrit-wm>	 (CR) Ottomata: [C: 2 V: 2] Replace the git and ssh users with 'analytics' [analytics/refinery/scap] - https://gerrit.wikimedia.org/r/302926 (owner: Elukey)
[14:38:37] <joal>	 elukey: I don't get the 2 instances thing
[14:39:21] <elukey>	 I am re-imaging the whole host
[14:39:35] <elukey>	 because I'll let partman re-do the arrays
[14:39:42] <elukey>	 otherwise I'll need to do it manually
[14:39:50] <joal>	 right, why would having 2 bootstrapping hosts at the same time be more difficult for 4-b compaction?
[14:40:27] <elukey>	 not sure if 4b would serve anyway to the two bootstrapping instance or not
[14:40:34] <elukey>	 I am not saying that it would be bad
[14:40:40] <elukey>	 I was just reasoning out loud
[14:40:47] <elukey>	 I have no idea what cassandra does in these cases
[14:41:03] <joal>	 right: this actually my question: aqs1004-b is now said to be ready for usage
[14:41:20] <joal>	 So my guess is that it can handle it's compaction thing while doing other stuff as well
[14:41:28] <elukey>	 yeah this would make sense
[14:41:33] <joal>	 But having urandom's approval is better :)
[14:44:17] <elukey>	 I think we should ask if we have budget for urandom's beers
[14:44:34] <joal>	 ottomata: I have a doubt here - Shall I remove the rsync, or keep it in case of rebuild-after-issue?
[14:44:45] <elukey>	 we owe him already too many favors :D
[14:45:04] <joal>	 elukey: correct, I can pay from my pocket ;)
[14:45:07] <ottomata>	 naw let's remove it
[14:45:08] <ottomata>	 joal:
[14:45:11] <joal>	 ok ottomata
[14:45:24] <ottomata>	 the rsync is only going to sync new data anyway
[14:46:04] <joal>	 ottomata: That was my point: Since no data gets synced, very cheap to have it - and in case of failure and full-rebuild needed, no manual copy
[14:46:53] <joal>	 but no problem for me to remove it ottomata
[14:47:11] <ottomata>	 haha, i guess, but not sure how we could ever fail joal, together we are invicible
[14:47:30] <joal>	 huhu ottomata :)
[14:47:31] <ottomata>	 i think this is pretty simiple, no?  we are really just rremoving cron jobs, and editing an html template, ja?
[14:47:42] <ottomata>	 there's no danger of deleting data
[14:47:43] <joal>	 ottomata: yes, simple enough
[14:47:56] <joal>	 ottomata: was more thinking of host lost or something
[14:48:14] <joal>	 ottomata: But with Raid10, no issue ;)
[14:48:19] <ottomata>	 oh you mean leaving it there for good?
[14:48:26] <ottomata>	 the cron?
[14:48:27] <joal>	 ottomata: correct
[14:48:29] <ottomata>	 oh
[14:48:29] <ottomata>	 hm.
[14:48:43] <ottomata>	 i dunno, i think we shoudl get rid of it.  are we planning on keeping the data on stat1002?
[14:49:01] <joal>	 ottomata: not sure ... On dumps for sure
[14:49:04] <ottomata>	 ja
[14:49:18] <ottomata>	 joal:  i think we should get rid of it, less confusing useless stuff on dumps
[14:49:28] <joal>	 ottomata: Okey sir, doing !
[14:50:56] <ottomata>	 k!
[14:52:16] <ottomata>	 elukey:  this does not matter, just an fyi.  if you set scap_repository: true
[14:52:21] <ottomata>	 analytics/refinery/scap would be the default
[14:52:41] <ottomata>	 on scap::source
[14:52:51] <ottomata>	 explicit is cool too
[14:53:48] <elukey>	 ah okok!
[14:54:30] <elukey>	 ottomata: checking pcc and then I'd say we can merge
[14:54:43] <elukey>	 Tyler is also around so it will be easier if we have questions
[14:54:56] <wikibugs>	 Analytics: 2016-06-02 hour 14 file missing? - https://phabricator.wikimedia.org/T142052#2522451 (dr0ptp4kt) Ah, right. Thanks, @JAllemandou, I was thinking about the 90 day window, but not the 60 day one. @Jdlrobson, note well when querying wmf.webrequests. I suppose one can do a `select min(day) from webreq...
[14:55:18] <elukey>	 https://puppet-compiler.wmflabs.org/3607/
[14:55:22] <elukey>	 all good
[14:55:33] <ottomata>	 k cool
[14:55:36] <ottomata>	 ok so.
[14:55:42] <ottomata>	 what's the plan?
[14:56:36] <elukey>	 just disabled puppet on stat1004 and analytics1027
[14:56:49] <elukey>	 I have backups of /srv/../refinery on both tin and stat1002
[14:56:54] <elukey>	 so the next step is to merge https://gerrit.wikimedia.org/r/#/c/299719/
[14:56:58] <elukey>	 run puppet
[14:57:01] <elukey>	 arm the keyholder
[14:57:04] <elukey>	 and test a deploy
[14:57:24] <elukey>	 does it sound good?
[14:59:19] <joal>	 ottomata: can we use a dash in a puppet class name?
[15:00:26] <ottomata>	 joal:  nawww
[15:00:32] <ottomata>	 you can use underscores but we try to avoid it
[15:00:33] <joal>	 underscore ?
[15:00:47] <ottomata>	 might be ok in this case, since you are doing pageview_all_sites?
[15:00:49] <ottomata>	 whatcha doing?
[15:00:52] <joal>	 ottomata: renaming pagecounts to pagecounts_ez for clarity
[15:01:00] <ottomata>	 joal:  i think that's fine
[15:01:07] <joal>	 ok, thanks :)
[15:01:12] <ottomata>	 or, will there also be a pagecounts.pp file/ class?
[15:01:22] <ottomata>	 you could also subdir it
[15:01:29] <ottomata>	 pagecounts/ez.pp pagecounts::ez
[15:01:30] <ottomata>	 but
[15:01:34] <ottomata>	 pagecounts_ez is fine
[15:01:40] <ottomata>	 elukey:  sounds good!
[15:01:48] * elukey merges
[15:02:51] <ottomata>	 elukey:  i'm going to remove some old tmp dirs in /srv/deployment/analytics on an27
[15:02:57] <ottomata>	 refinery.bonkers
[15:02:57] <ottomata>	 refinery.git-fat.dumb
[15:02:58] <ottomata>	 :p
[15:03:08] <elukey>	 hahahahahahaha
[15:03:42] <ottomata>	 done
[15:19:52] <ottomata>	 joal:  looks good, one nit, and i wrote down post merge steps for mein the review
[15:26:36] <wikibugs>	 Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2522554 (Nuria) >It doesn't make much implementation difference whether we scope this for one, several, or all of our second-level domains. If anything, doing it for all of them at once is pr...
[15:31:35] <elukey>	 ottomata: tin + keyholder deploy went fine, but stat1002 failed since there is not scap package anymore via apt.. Tyler is going to update the repo with the new version (and puppet) in today's swat
[15:32:05] <elukey>	 so atm we are a bit broken
[15:32:12] <elukey>	 buuut we'll be back soon
[15:32:27] <ottomata>	 oh ha, ok!
[15:34:28] <grrrit-wm>	 (CR) Nuria: "We are thinking of adding unit tests to this code at some point, correct?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301837 (https://phabricator.wikimedia.org/T141548) (owner: Joal)
[15:34:57] <elukey>	 a-team: the refinery deployments are blocked for now! Will be back soon later on today
[15:35:37] <elukey>	 (hopefully with a brand new scap3)
[15:41:12] <wikibugs>	 Analytics-Kanban: Stop generating pagecounts-raw and pagecounts-all-sites - https://phabricator.wikimedia.org/T130656#2522571 (Nuria) Just to write this down somewhere.  After merge we need to do some manual steps: - remove files from dumps box(es?):   /etc/rsyncd.d/30-rsync-pagecounts.conf   /usr/local/bin/...
[15:50:20] <milimetric>	 woo scap3 :)
[15:59:16] <ottomata>	 be right there, running upstairs for power...
[16:00:02] <wikibugs>	 Analytics-Kanban: User History: Solve the fixed-point algorithm's long tail problem - https://phabricator.wikimedia.org/T139745#2522655 (Nuria) Open>Resolved
[16:02:32] <elukey>	 hangouts hates me
[16:02:37] <elukey>	 I am joining
[16:08:36] <wikibugs>	 Analytics, Pageviews-API: Improve pageviews error messages on invalid project - https://phabricator.wikimedia.org/T129899#2522682 (Nuria)
[16:08:38] <wikibugs>	 Analytics-Kanban, Patch-For-Review: lowercase project parameter - https://phabricator.wikimedia.org/T136016#2522681 (Nuria) Open>Resolved
[16:08:57] <wikibugs>	 Analytics-Kanban: User history: rewrite the user history script to use the new algorithm - https://phabricator.wikimedia.org/T141468#2522683 (Nuria) Open>Resolved
[16:08:59] <wikibugs>	 Analytics-Kanban: Scale MySQL edit history reconstruction data extraction - https://phabricator.wikimedia.org/T134791#2522684 (Nuria)
[16:09:49] <wikibugs>	 Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for simplewiki - https://phabricator.wikimedia.org/T134790#2522687 (Nuria)
[16:09:51] <wikibugs>	 Analytics-Kanban: Page History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138853#2522686 (Nuria) Open>Resolved
[16:10:04] <wikibugs>	 Analytics-Kanban: Scale MySQL edit history reconstruction data extraction - https://phabricator.wikimedia.org/T134791#2277164 (Nuria)
[16:10:06] <wikibugs>	 Analytics-Kanban: Use scalable algo on enwiki - https://phabricator.wikimedia.org/T141778#2522691 (Nuria) Open>Resolved
[16:15:31] <wikibugs>	 Analytics: 2016-06-02 hour 14 file missing? - https://phabricator.wikimedia.org/T142052#2522704 (JAllemandou) @dr0ptp4kt You can indeed look for the last day available for a given month. It's also a good idea indeed to keep some margin not be using data that'll be deleted during query.
[16:27:05] <wikibugs>	 Analytics, Operations, Ops-Access-Requests: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522756 (Nuria)
[16:32:44] <elukey>	 urandom: I am re-imagining one host at the time in aqs100[456] to deploy RAID10 for the cassandra instances partitions, and I just finished aqs1004. One instance (aqs1004-b) is up in the ring but heavily compacting. Should I proceed with 1005 (so bringing down two instances) or should I wait?
[16:34:28] <urandom>	 elukey: so, what is the full process here?
[16:34:47] <elukey>	 we are trying to keep the 4 months of data
[16:34:52] <urandom>	 you decomm'd 1004-{a,b}?
[16:34:58] <elukey>	 while we replace the raid arrays
[16:34:59] <urandom>	 nodetool-a decommission, etc?
[16:35:34] <urandom>	 elukey: how did you reimage?
[16:37:08] <elukey>	 ah no I just bring down the instances, then start them with -Dcassandra.replace_address
[16:37:23] <elukey>	 and the reimage is brutal restart with PXE boot
[16:37:34] * urandom is confused
[16:43:30] <elukey>	 I know
[16:43:38] <elukey>	 it is my fault
[16:43:46] <elukey>	 so, I'll try to be a bit more verbose
[16:45:30] <elukey>	 aqs100[456] have only raid0 arrays for all the cassandra instances, and we decided to move to raid10 recently. Since it would be great to use partman without doing mdadm operations, I wanted to re-install the os via PXE triggering a brand new fresh install on each host. The original thought was to just wipe the cluster and re-install.
[16:45:50] <elukey>	 but we thought to try to keep the current data loaded (4 months of page views)
[16:46:01] <elukey>	 so I followed this procedure
[16:47:08] <elukey>	 1) re-install the os on one host, therefore bringing down two instances from the cluster. When the machine is up and running, I start each instance with -Dcassandra.replace_address (adding it manually to the related /etc/cassandra.. file)
[16:47:28] <elukey>	 2) I wait until the instances are back to normal service rather than JOINING
[16:47:35] <elukey>	 3) I proceed with another host from 1)
[16:47:48] <elukey>	 urandom: hope that this is clearer
[16:49:18] <urandom>	 so the existing data is wiped?
[16:50:01] <elukey>	 yeah and it gets re-streamed from the cluster
[16:50:08] <elukey>	 to the new raid10 arrays
[16:50:16] <elukey>	 taking a bit of course
[16:50:50] <elukey>	 not sure if this procedure is correct but I was wondering if an instance heavily compacting should prevent me to proceed
[16:50:56] <urandom>	 ok, just so you know, it's possible that that could violate consistency constraints
[16:51:09] <elukey>	 ah snap
[16:51:14] <elukey>	 why??
[16:51:19] <urandom>	 the probability in your environment might be low though
[16:51:41] <urandom>	 you guys aren't overwriting and/or deleting
[16:52:00] <urandom>	 you guys write at consistency level one?
[16:52:29] <elukey>	 nope quorum
[16:52:34] <elukey>	 we read at one
[16:53:44] <urandom>	 ok, so the TL;DR is, that with replication factor of 3, and quorum, your consistency guarantees are based on the notion that 2 of a potential 3 copies are sane at the time of a write
[16:54:01] <urandom>	 wiping means the loss of one replica for everything those nodes stored
[16:54:14] <urandom>	 reducing that to (potentially) 1 in all cases
[16:54:20] <elukey>	 ahh okok
[16:54:27] <wikibugs>	 Analytics, Pageviews-API: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2522888 (Nuria) Let's start with outreachwiki, nl.wikimedia, ru.wikimedia  - Refactor the code on pageview definition that restricts counting  to certain urls   - Add wikis  to the w...
[16:54:33] <elukey>	 but we are not doing anything else at the moment
[16:54:37] <urandom>	 i gather you guys don't ever do overwrites or deletes
[16:54:40] <elukey>	 so no read/write traffic
[16:54:46] <elukey>	 yeah
[16:55:06] <elukey>	 not sure about overwrites but not deletes for sure
[16:55:08] <urandom>	 and providing there are no other failures before one of these operations succeeds, you'll probably be OK
[16:55:38] <elukey>	 I didn't decommission first the instances because I didn't want data re-shuffling
[16:55:45] <urandom>	 yup
[16:55:57] <elukey>	 ah ok so sometimes I get one thing right :P
[16:56:12] <wikibugs>	 Analytics, Pageviews-API: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2522894 (Milimetric)
[16:56:14] <wikibugs>	 Analytics: stats.grok.se doesn't offer mediawiki.org page view stats - https://phabricator.wikimedia.org/T111662#2522892 (Milimetric) duplicate>declined no longer valid, covered by the pageview API
[16:56:55] <urandom>	 elukey: so anyway, if one node is busy compacting it should still be OK to move on
[16:57:01] <urandom>	 so long as it's done streaming
[16:57:17] <elukey>	 oh yes yes
[16:58:24] <urandom>	 moving data around the cluster like this is fraught with so many corner cases and whatnot, that i generally avoid being creative
[16:59:19] <urandom>	 you might come up with a scenario where something makes sense in the current situation, and then later on down the road, the same thing might cause you dataloss
[16:59:40] <urandom>	 because something in AQS/RESTbase changed, or something in Cassandra changed
[16:59:42] <elukey>	 yes yes I wouldn't do it on a live use case (with read/writes going on)
[17:02:52] <elukey>	 ottomata: on stat1002.eqiad.wmnet returned [255]: Agent admitted failure to sign using the key.
[17:03:26] <wikibugs>	 Analytics, Analytics-Dashiki: Timeseries on browser reports broken when going back 18 months - https://phabricator.wikimedia.org/T141166#2522925 (Milimetric) two fixes:  * display "no data" when there's no data * fix calendars to not go back before a date configured in the wiki
[17:06:35] <wikibugs>	 Analytics, Analytics-Dashiki: Default date selection to currently applied date for browser reports - https://phabricator.wikimedia.org/T141165#2522938 (Milimetric)
[17:06:37] <wikibugs>	 Analytics, Analytics-Dashiki: Timeseries on browser reports broken when going back 18 months - https://phabricator.wikimedia.org/T141166#2522940 (Milimetric)
[17:14:02] <wikibugs>	 Analytics-Kanban: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2523004 (Milimetric)
[17:14:10] <wikibugs>	 Analytics-Dashiki, Analytics-Kanban: Default date selection to currently applied date for browser reports - https://phabricator.wikimedia.org/T141165#2489086 (Milimetric)
[17:15:59] <wikibugs>	 Analytics-Dashiki, Analytics-Kanban: Default date selection to currently applied date for browser reports - https://phabricator.wikimedia.org/T141165#2523021 (Milimetric) a few simple fixes done together:  * display "no data" when there's no data * fix calendars to not go back before a date configured in...
[17:26:20] <wikibugs>	 Analytics-Kanban: Replace RAID0 arrays with RAID10 on aqs100[456] - https://phabricator.wikimedia.org/T142075#2521712 (Nuria) - Reimage the hosts   - Restart the cassandra instances, they will not have the data but they will ask other instances for data (this process is happening on each host separately)  It...
[17:28:59] <elukey>	 !log deploying the refinery with scap3 for the first time on all nodes
[17:29:02] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[17:36:24] <elukey>	 !log added the analytics-deploy key to the Keyholder for the Analytics Refinery scap3 migration (also updated https://wikitech.wikimedia.org/wiki/Keyholder)
[17:36:25] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[17:40:55] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523151 (Nuria) @Nemo_bis , @Tbayer  Theory: Looking at data the spike is present on Chrome requests but not on IE, there was a chrome update july 20th.  Spike o...
[17:42:31] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523165 (Nuria) {F4336854}  {F4336856}
[17:54:08] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523206 (Milimetric) To follow up on Nuria's theory, we broke down the version of Chrome and see that Chrome 41 is almost solely responsible for the increase:  {...
[17:59:03] <joal>	 dr0ptp4kt: Hi
[17:59:16] <dr0ptp4kt>	 jdlrobson: hi joal !
[17:59:50] <joal>	 dr0ptp4kt: quick question: There are three long running hive queries on the cluster launched by you
[17:59:50] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523218 (Milimetric) If I filter just Chrome 41 on desktop, and break down by countries, I see something interesting.  All countries appear to spike after July 2...
[18:00:03] <joal>	 dr0ptp4kt: Are you actually waiting for results of those?
[18:01:02] <elukey>	 a-team: going afk but me and andrew discovered an interesting behavior of the new refinery. We got some space problems on analytics1027, since scap keeps multiple revisions of the repo and switches the symlinks to them.
[18:01:15] <elukey>	 Andrew cleaned up some /tmp files freeing space
[18:01:32] <elukey>	 but please be careful before deploying :)
[18:01:51] <joal>	 dr0ptp4kt: From what I see, you are scanning almost the 2 months of webrequest data (every source) - That represent about 70Tb of data scanned
[18:01:51] <elukey>	 I am going afk but will double check later
[18:01:55] <joal>	 Bye elukey
[18:02:08] <elukey>	 o/
[18:02:17] <elukey>	 joal: I'll start the re-image first thing tomorrow
[18:02:23] <joal>	 awesopme :)
[18:02:46] <dr0ptp4kt>	 joal: yeah, i need those to run
[18:02:55] <dr0ptp4kt>	 joal: that said, am i blocking you on something?
[18:03:13] <dr0ptp4kt>	 joal: i'm in a meeting, sorry if delaying when responding, will ask person if i can pause if needed
[18:03:25] <joal>	 dr0ptp4kt: you drain user resources from the cluster, but that's ok, production job work
[18:04:09] <dr0ptp4kt>	 joal: thanks, lmk know if i need to back off. i can be a little more careful in the future
[18:04:37] <joal>	 dr0ptp4kt: I'd just suggest a few improvements in query: smaller dates range, not using upload webcache, or trying to bundle all queries result in one pass, to prevent scanning data multiple times
[18:07:16] <ottomata>	 joal:  for you, not urgent:
[18:07:17] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/302954/
[18:07:38] <joal>	 ottomata: will do first thing tomorrow morning :)
[18:07:55] <ottomata>	 joal:  ok, let's do your pagecounts patch then too
[18:07:58] <ottomata>	 when i get online
[18:08:00] <joal>	 sou
[18:08:08] <joal>	 ok ottomata (even if friday?)
[18:08:13] <ottomata>	 ja it aint no thing
[18:08:20] <joal>	 k great
[18:08:32] <joal>	 moving off for today a-team
[18:08:46] <mforns>	 me, too, team
[18:08:52] <mforns>	 have a nice weekend, see you on monday!
[18:08:56] <ottomata>	 laters!
[18:15:07] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2501321 (mforns) Also, a quick note: I couldn't find any article among the top 10 other than Main Page where this happens. I searched for a short while, but enwi...
[18:31:46] <nuria_>	 a-team: back, sorry
[18:39:05] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523394 (Nuria) >One possibility is that a bot running as Chrome 41 all of a sudden started being active from a lot of countries. This is unlikely a bot, it's pr...
[18:50:47] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2501321 (BBlack) @legoktm just pointed me here.  I've been investigating something almost certainly-related, but I wasn't considering that changes to our content...
[19:03:31] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523548 (BBlack) All the anomalous stuff I'm looking at definitely points at Chrome/41.0.2272.76 on Windows (10, 8, 7), which is an old release.  The timing of M...
[19:06:26] <wikibugs>	 Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2523555 (BBlack) Copying in a couple of relevant links from the TLS ticket, re the bad MS update that seems to have at least partially broken TLS for some 3rd pa...
[19:25:51] <ottomata>	 nuria_:  no
[19:25:58] <ottomata>	 3 is a good default for pretty much all use cases
[19:26:02] <ottomata>	 that's why i'm setting here instead of in puppet
[19:26:06] <ottomata>	 so we don't have to set it for every single use
[19:26:08] <ottomata>	 in puppet
[19:26:14] <nuria_>	 ottomata: k
[19:30:01] <grrrit-wm>	 (PS2) Nuria: [WIP] Service Worker to cache locally AQS data [analytics/dashiki] - https://gerrit.wikimedia.org/r/302755 (https://phabricator.wikimedia.org/T138647)
[19:37:18] <grrrit-wm>	 (CR) Nuria: [C: 2 V: 2] "I think this is ready to be merged." [analytics/refinery] - https://gerrit.wikimedia.org/r/298131 (https://phabricator.wikimedia.org/T138264) (owner: Joal)
[19:50:31] <ottomata>	 !log now running kafka-python 1.2.5 for eventlogging-service-eventbus in codfw, removed downtime for kafka200[12]
[19:50:33] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[19:50:36] <ottomata>	 elukey:  fyi ^^
[19:51:01] <wikibugs>	 Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2523701 (Nuria) Does this select below seem ok? Note that:  1) there is no time information 2) is selecting for all projects for all content types   SELECT  uri_host, HASH(uri_path, uri_qu...
[20:31:33] <nuria_>	 ottomata or milimetric 1 fast question if you are there
[20:31:41] <milimetric>	 what's up nuria_
[20:31:55] <nuria_>	 milimetric: I am compiling teh datset for the cache folks,
[20:32:24] <nuria_>	 milimetric: I am going to just give them a file with 1 hour of data just to make sure it has what they want... etc
[20:32:39] <nuria_>	 milimetric: where is a good place for me to put this file on 1001/1003?
[20:32:43] <ottomata>	 ja
[20:32:59] <milimetric>	 nuria_: one sec lemme read up how we wanted to clean those folders
[20:32:59] <ottomata>	 nuria_:  they can log in or do you want to put it public?
[20:33:07] <milimetric>	 (public)
[20:33:43] <nuria_>	 milimetric: public is fine, is all caching data with hashed urls no timestamps
[20:33:47] <nuria_>	 milimetric: so no pii
[20:33:56] <nuria_>	 like:
[20:34:11] <nuria_>	 https://www.irccloud.com/pastebin/CJ7iQbmj/
[20:35:51] <milimetric>	 nuria_: if you put a new folder here: stat1003:/a/limn-public-data  it will end up here: https://datasets.wikimedia.org/limn-public-data/
[20:36:05] <nuria_>	 ok, i will do that
[20:36:11] <milimetric>	 it'll get moved properly when we do the cleanup, too
[20:36:34] <milimetric>	 so we might leave symlinks behind or just update people on where we moved it.  Ideally, put a README in that folder explaining what the data is.
[20:36:40] <milimetric>	 thanks nuria_
[20:36:45] <nuria_>	 milimetric: k
[20:43:25] <milimetric>	 nuria_: I know you'd probably get to this at some point, but it sounds like you need to approve the ISI people in their pageview research: https://phabricator.wikimedia.org/T141634
[20:43:32] <milimetric>	 https://phabricator.wikimedia.org/T141634#2518222
[20:45:14] <wikibugs>	 Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2523997 (Nuria) Actually select above is going to give much too much data, I am going to limit by content_type to html docs. Let me know if that is OK. Will give you a sample file for an h...
[20:57:09] <wikibugs>	 Analytics: Top API user agents stat - https://phabricator.wikimedia.org/T142139#2524050 (GWicke)
[20:57:18] <wikibugs>	 Analytics: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524063 (GWicke)
[21:04:18] <wikibugs>	 Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524099 (GWicke)
[21:07:39] <wikibugs>	 Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524121 (Nuria) That information exists for the php api  in the api tables : https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction  Data from api in webrequest is partial...
[21:12:36] <wikibugs>	 Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524130 (GWicke) @Nuria, we are interested in all requests, including cache hits. Requests recorded by backends like the PHP API would not include those.
[21:14:31] <wikibugs>	 Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524133 (Nuria) >We do have user agents in web request logs, but as far as I know we currently only expose UA stats aggregated across all requests.  A small clarification: we ex...
[23:44:14] <grrrit-wm>	 (PS1) Dereckson: Count tcy.wikipedia page views [analytics/refinery] - https://gerrit.wikimedia.org/r/303107 (https://phabricator.wikimedia.org/T140898)
[23:47:33] <grrrit-wm>	 (PS1) Dereckson: git-review configuration [analytics/refinery] - https://gerrit.wikimedia.org/r/303108