[08:36:38] <elukey>	 !log re-executed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2016-7-16 (failed oozie job)
[08:36:41] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[08:36:46] <elukey>	 joal: --^
[08:36:48] <elukey>	 o/
[08:38:02] <elukey>	 there was an error in one mapper afaiu
[08:38:05] <elukey>	 so I re-launched it
[08:38:52] <elukey>	 also, no more emails from oozie \o/
[08:41:08] <wikibugs>	 Analytics: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2469958 (elukey) p:Normal>Low
[08:42:55] <wikibugs>	 Analytics-Cluster, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2469960 (elukey) @Cmjohnson sorry for the late answer! Can we schedule maintenance for a couple of servers to see if it fixes the issue? These are part of the Hadoop cl...
[08:52:51] <joal>	 elukey: o/
[08:52:59] <joal>	 elukey: Thanks for relaunching the oozie job :)
[08:53:32] <elukey>	 joal: helloooo
[08:54:00] <joal>	 elukey: good weekend?
[08:54:02] <elukey>	 enjoyed time oof?
[08:54:05] <elukey>	 :)
[08:54:12] <joal>	 elukey: I did !
[08:54:35] <elukey>	 yes yes all good, lovely weather in here.. Hot but you can live and walk around without drinking liters of water
[08:54:59] <joal>	 :D
[08:56:04] <elukey>	 also, oozie errors are gone!
[08:56:14] <elukey>	 we can lower down the limit to 5%
[08:56:45] <joal>	 elukey: That is really great :)
[08:57:04] <elukey>	 for upload we'll probably need to tune a bit the parameters
[08:57:15] <joal>	 elukey: right, that makes sense
[08:57:16] <elukey>	 but I am sure that vk will not be a blocker
[08:57:38] <elukey>	 joal: can we move to done https://phabricator.wikimedia.org/T139493 ?
[08:58:13] <joal>	 elukey: woooooow, I completely forgot to deploy the other machines
[08:58:21] <joal>	 elukey: Deploying now, you can move the ticket
[08:58:28] <elukey>	 ahhh okok! nice!
[08:59:00] <joal>	 elukey: sorry, middle of last week have been a bit overwhelming for me :)
[08:59:48] <joal>	 !log deploy restabase on aqs100[23]
[08:59:50] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[09:29:41] <grrrit-wm>	 (CR) Joal: "Looks correct, has the query been tested in hive or spark-shell?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298723 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:32:41] <grrrit-wm>	 (CR) Joal: "Same question as in https://gerrit.wikimedia.org/r/#/c/298723/: Has the query been tested in hive or Spark-shell?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298724 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:33:07] <grrrit-wm>	 (CR) Addshore: "Query has been tested in hive" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298723 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:33:44] <addshore>	 morning joal!
[09:34:10] <grrrit-wm>	 (CR) Joal: "I wonder about the need for 3 CRs over the same piece of code. If you prepare future releases, adding reviewers in a timely fashion would " [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298725 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:34:12] <joal>	 Hi addshore
[09:37:59] <addshore>	 So, the 3 patches are 1) the code that can be run to fill in the back dated stuff, and 2 + 3) the code to use the new stuff in the x_analytics header that is now merged and dpeloyed!
[09:39:27] <joal>	 right addshore
[09:40:24] <joal>	 so, patch should be annotated with that info, and possibly -1ed by you to make sure it's not merged :)
[09:41:01] <grrrit-wm>	 (CR) Addshore: [C: -1] "Not to be merged until data is backfilled" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298724 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:41:06] <addshore>	 yes! that totally makes sense...
[09:41:50] <joal>	 for the other 2, possibly bundling them into only one (if it makes sense) would be better, and then we'll wait for nuria_ if she has other comments (as she started commenting)
[09:42:33] <addshore>	 Yup, the reason for the 2 patches there were due to the things being added to the header seperatly, but they ended up getting merged very quickly one after the other!
[09:43:09] <joal>	 actually addshore, shouldn't your comment on the first patch say something like: Not to be merged, used as one-off for backfilling
[09:43:27] <joal>	 addshore: I hear you on the 2 patches for header
[09:43:57] <addshore>	 Well, I was going to ask you what would be the best process, if it should be merged etc, or just run in the way we were running things during testing
[09:44:10] <joal>	 addshore: That would make sense then to abandon the first of the 2 change and have only one patch using the second
[09:44:39] <joal>	 addshore: For one-offs we usually don't merge-deploy
[09:44:52] <addshore>	 okay!
[09:45:07] <joal>	 addshore: Cool :)
[09:45:33] <grrrit-wm>	 (Abandoned) Addshore: Ignore namespace when matching Special:AboutTopic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298723 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:45:40] <joal>	 addshore: thanks for being nice with the comments we have ;)
[09:46:11] <addshore>	 hehe, no worries, I an new to the analytics / refinery repos :) I don't know the processes yet :)
[09:46:40] <joal>	 addshore: processes are not too tight, but we're kinda used to some ways of working ;)
[09:48:03] <joal>	 addshore: I suggest we wait for nuria_'s comments later on today, then when your code is merged/productionized, we'll defined the dates for backfilling and we'll launch the off-off jobs
[09:50:27] <grrrit-wm>	 (PS5) Addshore: Use x_analytics header to match title & ns for Special:AboutTopic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298724 (https://phabricator.wikimedia.org/T138500)
[09:50:42] <grrrit-wm>	 (Abandoned) Addshore: Use x_analytics header to match special page name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298725 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[09:51:10] <addshore>	 okay, patches are all in order now :)
[09:51:17] <joal>	 Thanks man :)
[10:03:58] <grrrit-wm>	 (CR) Addshore: [C: 2] Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore)
[10:04:05] <grrrit-wm>	 (PS1) Addshore: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299516 (https://phabricator.wikimedia.org/T140229)
[10:04:14] <grrrit-wm>	 (CR) Addshore: [C: 2] Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299516 (https://phabricator.wikimedia.org/T140229) (owner: Addshore)
[10:04:18] <grrrit-wm>	 (Merged) jenkins-bot: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore)
[10:04:53] <grrrit-wm>	 (Merged) jenkins-bot: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299516 (https://phabricator.wikimedia.org/T140229) (owner: Addshore)
[10:16:33] <elukey>	 joal: today ema noticed that puppet was restarting varnishkafka for config file changes once every two runs.. It turned out to be a problem with puppet, it does not guarantee the order of evaluation of hash keys in a foreach loop
[10:16:38] <elukey>	  /o\
[10:17:33] <joal>	 hm, elukey, how does it impact us? data lost?
[10:18:49] <elukey>	 vk gets restarted every 40 mins
[10:18:57] <elukey>	 I am fixing it
[10:19:05] <elukey>	 we need to use a sort
[10:19:07] <elukey>	 that's it
[10:19:10] <elukey>	 -.-
[10:20:00] <elukey>	 joal: https://gerrit.wikimedia.org/r/#/c/299518/
[10:21:09] <joal>	 elukey: on our side it'll just prevent us to have a good visibility over ordering problems (we remove restarted hosts), but nothing else I think
[10:22:48] <elukey>	 yep but it is very annoying
[10:28:24] <joal>	 elukey: I can imagine
[10:28:40] <joal>	 elukey: From a cache perspective, this is really bad
[10:53:08] <elukey>	 a-team: running errand for a couple of hours max. Ping me on hangouts if you need me :)
[10:53:34] <elukey>	 varnishakafka restarts are almost completed (only misc and maps)
[11:07:35] <grrrit-wm>	 (PS3) Joal: [WIP] Add casssandra bulk loading classes [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295663
[11:13:23] <joal>	 elukey: o/
[11:14:17] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] [WIP] Add casssandra bulk loading classes [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295663 (owner: Joal)
[12:55:24] * elukey back!
[12:55:33] <elukey>	 varnishkafka is finally behaving correctly :)
[12:55:36] <elukey>	 no more restarts
[12:57:57] <mforns>	 milimetric, I'm here
[12:58:17] <milimetric>	 I'll join the cave
[13:01:49] <elukey>	 joal: o/
[13:01:52] <elukey>	 do you have 5 mins?
[13:07:40] <joal>	 Hi elukey
[13:07:43] <joal>	 I have time yes
[13:07:47] <joal>	 what's up?
[13:12:15] <elukey>	 o/
[13:12:32] <elukey>	 so a couple of questions that are probably trivial and dumb
[13:12:49] <elukey>	 1) do we store 1 year of data in AQS (100[123]) atm right?
[13:13:04] <elukey>	 2) did you load one month of data only to the new cluster up to now?
[13:13:10] <elukey>	 (I am checking SSTable size :)
[13:13:28] <elukey>	 joal: --^
[13:24:32] <joal>	 elukey: there almsot a year in aqs100[123] currenty, yes
[13:25:06] <joal>	 elukey: About loading, I am currently loading a day, rechecking everythiong is smooth after some tests early on today, then starting back for a month load
[13:25:22] <joal>	 elukey: we need to have a chat about data size in cassandra
[13:25:37] <joal>	 elukey: I think I made a mistake when sizing
[13:34:30] <elukey>	 joal: I had a chat with urandom and we were puzzled (as we were a while ago IIRC) about SSTable sizes..
[13:35:16] <elukey>	 we have ~2TB for one year of data and 3 nodes but we are accounting 2TB for each of the six instances of the new cluster right? (6TB for 3 years)
[13:35:36] <elukey>	 but I can see 200GB atm on aqs1004-a.. so I tought it was a month ..
[13:37:59] <joal>	 elukey: I loaded a month, but lob failed on aqs100[456]
[13:38:57] <joal>	 elukey: That means there is a almost month of data loaded (in term of size), but since it's not correct, load is to be re-done
[13:41:06] <elukey>	 joal: ahhh okok! So assuming linear increase we should get to ~200GB * 12 ~ 2.4TB per instance for the new cluster
[13:41:14] <elukey>	 but it doesn't make any sense in my head
[13:41:24] <elukey>	 maybe my assumption is too strong
[13:44:59] <joal>	 elukey: We should spend some time on this
[13:45:13] <elukey>	 joal: yeah..
[13:58:30] <elukey>	 https://doc.wikimedia.org/mw-tools-scap/scap3/repo_config.html --> niceeeee
[14:15:25] <elukey>	 a-team: dbstore1002 needs maintenance tomorrow around this time, I suggested to Jaime to write an email to analytics@ and research@ as heads up but nothing more. Does it sound ok for you?
[14:15:39] <elukey>	 I have not a lot of experience with what people do on db1002
[14:16:38] <elukey>	 addendum: eventlogging and s1 and s2 will continue to be available on analytics-slave
[14:17:30] <milimetric>	 elukey: sounds good, but some people might want more heads up than that if possible
[14:17:32] <milimetric>	 but it's ok
[14:17:41] <milimetric>	 (the addendum will be useful)
[14:18:14] <elukey>	 milimetric: 2/3 days?
[14:36:58] <elukey>	 the downtime should be super long, but Chris and Jaime would need to do it asap
[14:45:35] <nuria_>	 hole , redaing backlog cc joal
[14:45:43] <milimetric>	 elukey: yeah, like 3 days seems good
[14:47:30] <elukey>	 milimetric: Jaime and Chris proposed Thursday at 14 UTC, they both have time
[14:47:38] <elukey>	 should take few time
[14:47:46] <milimetric>	 sweet, thanks
[14:47:57] <joal>	 hey nuria_, I don't get it :)
[14:48:19] <nuria_>	 joal: sorry, i saw you ping-ed me and i was just catching up
[14:48:27] <joal>	 ah okokok :)
[14:49:01] <joal>	 nuria_: discussion with addshore around the changes for his spark job
[14:51:18] <grrrit-wm>	 (CR) Nuria: "Can we update commit message with more specifics and a link to the documentation of the new settings in x-analytics?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298724 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[14:54:07] <addshore>	 nuria_: the patch links to the 2 patches that introduced the fields in the x-anaytics header in the Depends-On section!
[14:55:39] <nuria_>	 addshore: ok, read it , joal : why do we need to backfill?
[14:55:59] <joal>	 nuria_: addshore wants historical data
[14:56:57] <nuria_>	 joal: that seems a bit of overhead at this time addshore , backfilling requires special monitoring and takes time away from doing other things
[14:57:09] <nuria_>	 addshore: we do not normally backfil unless we are correcting a bug
[14:57:14] <nuria_>	 addshore: and this is not the case
[14:59:22] <addshore>	 well, the back filled data would be very valuable, the extension this is monitoring pageview for was only deployed a short while ago *checks the exact date*
[15:00:26] <addshore>	 From the 9th of June, also it could be seen as correcting a bug, as the initial job has already run from the 9th June to 21st of June for which bad data is now stored in graphite https://grafana.wikimedia.org/dashboard/db/article-placeholder
[15:01:49] <addshore>	 I'm more than happy to throw time at backfilling / monitoring (as thats the data that has been reuquested of me)
[15:11:47] <nuria_>	 addshore: all data is valuable, but we have to administer our resources and backfilling takes time away from other tasks. We stop and do it when there are bugs in the data that need correction but this is not the case as far as i can see, graphite just reports metrics , it is not a consumable dataset.
[15:12:21] <nuria_>	 addshore: we appreciate your offer to help but it requires a bunch of permits, plus have in mind that in order to backfill data we need to reprocess it all again
[15:12:31] <nuria_>	 addshore: at this time we cannot reprocess data selectively
[15:13:14] <nuria_>	 addshore: so we will need to reprocess again all data for the last 60 days, it is quite an ask
[15:17:59] <addshore>	 nuria_: is there any chance that you could comment on the ticket https://phabricator.wikimedia.org/T138500 ? Hopfully noone will complain and we can just move forward from today then (and I might have to find someone to remove the bad data from graphite / try and overwrite it myself)
[15:18:24] <nuria_>	 addshore: ok, let me see
[15:21:08] <nuria_>	 addshore: ok, i see , what you want is not backfilling
[15:21:20] <nuria_>	 addshore: sorry, there were two different things
[15:21:51] <nuria_>	 addshore: we normally call backfilling the process of reprocessing raw data which is time and computation consuming
[15:21:55] <milimetric>	 joal: I'm short-circuiting with this man, I need help
[15:22:05] <nuria_>	 addshore: if I understand what you want to do
[15:22:08] <addshore>	 aaah, in that case nuria_ this is not bacfilling!
[15:22:30] <nuria_>	 addshore: it's just use your spark script to publish metrics from teh very 1st record on webrequest
[15:22:32] <addshore>	 What I would want to do is run a spark job over 60 days work of the webrequest table!
[15:22:38] <addshore>	 yes :)
[15:22:41] <nuria_>	 addshore: ok, yes, that is very different
[15:22:51] * addshore will stop using the term backfilling ;)
[15:22:52] <joal>	 milimetric: Joining !
[15:23:07] <nuria_>	 addshore: cause that runs on processed data, not raw data and it doesn't require recomputing
[15:23:12] <nuria_>	 addshore: very well
[15:23:21] <nuria_>	 addshore: you need no additional permits to do that
[15:23:28] <nuria_>	 addshore: but I will do it in steps
[15:24:14] <nuria_>	 addshore: that is, 1st try for the 1st week we have in webrequest table and see that it publishes correctly
[15:24:51] <addshore>	 Okay! The job for the old webrequest entries is https://gerrit.wikimedia.org/r/#/c/298723/ (as they do not have the headers used in the new version of the job)
[15:25:03] <nuria_>	 addshore: actually first try for an hour
[15:25:25] <nuria_>	 addshore: let me look
[15:25:27] <wikibugs>	 Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 3 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2471189 (thcipriani) >>! In T129151#2466150, @elukey wrote: > Suggestion: might be good to have a reference in https://doc.wikimedia.o...
[15:26:36] <nuria_>	 addshore: that select uses sum(view_count) which is a column that doesn't exist in webrequest
[15:28:14] <addshore>	 Wait, that is not the latest patchset, But I am confused as to where it has gone now, give me 2 ticks!
[15:29:06] <nuria_>	 addshore: k
[15:30:35] <addshore>	 okay, I don't know exactly where the patchset has gone, apparently I didn't submit it, but changing the SUM to a count(*) is as far as I remember the only change
[15:33:43] <joal>	 addshore: I have it somewhere
[15:33:59] <joal>	 addshore: https://gerrit.wikimedia.org/r/#/c/298723/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
[15:34:29] <addshore>	 joal: nope, thats not the latest one I tested! as nuria_ said it still has the SUM
[15:34:40] <joal>	 addshore: hm,
[15:35:06] <joal>	 addshore: Sorry, I thought you were about to test re-running old pageview data
[15:35:09] <joal>	 addshore: my bad !
[15:35:24] <nuria_>	 addshore: you also what to filter by content type
[15:35:30] * joal stops trying to interfer
[15:36:17] <nuria_>	 addshore: do modify select a bit, try it on 1 hour of code to make sure it works, submit your change and ping me and i can look at it
[15:38:12] <addshore>	 nuria_: content type is not in the pageview table! for the days prior to today / when the things were added to the x-analytics header the pageview table is still being used!
[15:38:57] <addshore>	 And then moving forward where the header can be used this is the query https://gerrit.wikimedia.org/r/#/c/298724/5/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
[15:40:32] <grrrit-wm>	 (Restored) Addshore: Ignore namespace when matching Special:AboutTopic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298723 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[15:40:36] <nuria_>	 addshore: ok, the code in "The job for the old webrequest entries is https://gerrit.wikimedia.org/r/#/c/298723/ " runs against webrequest correct?
[15:40:58] <addshore>	 no, against pageview_hourly
[15:41:21] <addshore>	 which yes, means SUM(view_count) is correct there!
[15:42:13] <nuria_>	 addshore: ok, let's update commit message to document this fact ok?
[15:42:57] <grrrit-wm>	 (PS3) Addshore: Ignore namespace when matching Special:AboutTopic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298723 (https://phabricator.wikimedia.org/T138500)
[15:43:08] <addshore>	 nuria_: done! https://gerrit.wikimedia.org/r/#/c/298723/2..3//COMMIT_MSG
[15:45:09] <nuria_>	 addshore: ok. You have not run this script before correct?
[15:45:58] <addshore>	 The job (in its previous version) has run before, over the 13 days that can be seen in the top graph on https://grafana.wikimedia.org/dashboard/db/article-placeholder
[15:47:28] <nuria_>	 addshore: ok, and you know you are selecting data for all projects right/
[15:47:31] <nuria_>	 ?
[15:48:04] <addshore>	 Yup, but it is grouped!
[15:50:11] <nuria_>	 addshore: and the dashboard is showing views just for wikidata?
[15:52:22] <addshore>	 nope, for all sites
[15:54:23] <addshore>	 But the extension is currently only deployed in a few places, and the data already added to graphite is missing some of those due to the issue with the namespaces and localization that is fixed in the next version of the jo
[15:54:25] <addshore>	 *job
[15:55:36] <nuria_>	 addshore: ok, let's try to run it , i would do it for  a smaller set of data 1st to anticipate any issues
[15:56:21] <mforns>	 milimetric, back
[15:56:40] <milimetric>	 mforns: omw to the cave
[16:02:29] <joal>	 a-team: got problem joining batcave
[16:03:03] <nuria_>	 a-team: aALONE
[16:03:32] <nuria_>	 a-team: elukey coming to standup?
[16:10:05] <addshore>	 nuria_: well, the smallest section of time that makes sense is 1 day, however you coudl probably limit it to a single project for testing.
[16:10:18] <addshore>	 But the query for grouped projects doesn't take that long as it is.
[16:34:47] <nuria_>	 addshore: sounds good, do execute for a week , right? and see that you do not run into issues
[16:37:04] <wikibugs>	 Analytics-Kanban: Consider SSTable bulk loading for AQS imports - https://phabricator.wikimedia.org/T126243#2008467 (JAllemandou) A last try has been done with sorting data to build almost sorted SSTables, but with no success. Cassandra sorting strategy is very difficult to replicate, and I didn't manage to...
[16:38:47] <addshore>	 nuria_: I'll try to get around to it this evening!
[16:39:08] <nuria_>	 addshore: ok, we can touch base as you progress, thank you for your work!
[16:43:33] <addshore>	 great! :)
[16:47:42] <joal>	 urandom: o/
[16:48:27] <urandom>	 joal: o/
[16:49:01] <joal>	 urandom: Can I bother you a few minutes about data size?
[16:49:31] <urandom>	 sure
[16:50:07] <joal>	 urandom: There is some unexplained data growth from old (2.1.13) to new (2.2.6) cluster
[16:50:29] <urandom>	 oh?
[16:50:39] <joal>	 urandom: nuria_ bets it is somthing related to repliaction factor not behaving as expected (ot us not nowing how it behaves),
[16:51:09] <joal>	 urandom: I bet on internal data format change
[16:51:24] <joal>	 urandom: and your views would be more than welcome !
[16:51:51] <urandom>	 joal: i don't of any significant internal data format changes between 2.1 and 2.2
[16:52:09] <urandom>	 the big one is 3.0 (which should make data sizes smaller)
[16:52:18] <joal>	 ok :)
[16:52:22] <joal>	 I lose my bet ;)
[16:53:21] <urandom>	 joal: doesn't mean something didn't change there, but the very existence of 2.2 was to get out the changes in master before the big storage changes landed
[16:53:39] <urandom>	 it was meant to be somewhat close to 2.1
[16:53:47] <urandom>	 anyway, how much storage change?
[16:54:37] <joal>	 urandom: here is the detail: old version handle 1 year of data with replication factor 3 over 3 nodes (1 instance per node) - A bit less than 3Tb
[16:54:52] <urandom>	 3TB per node?
[16:55:11] <urandom>	 or total/cluster-wide?
[16:55:27] <joal>	 urandom: new version has about 1 month of data loaded, replcation factor 3 as well, 3 nodes but 2 instances per node (with rack policy), and uses about 200Gb per Instance
[16:55:32] <joal>	 3Tb per node
[16:56:42] <joal>	 So the diff is almost as of replication factor was 6 instead of 3
[16:57:05] <urandom>	 ?
[16:57:32] <urandom>	 ~4.8T per host as opposed to ~3T, yes?
[16:57:47] <urandom>	 (200 * 2) * 12 ~= 4.8T
[16:57:48] <wikibugs>	 Analytics, Reading-analysis, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2471562 (Nuria) >3) What websites should we compare Wikimedia sites with? For example, should we compare wikipedia.org with bi...
[16:58:03] <joal>	 correct urandom n
[17:00:13] <joal>	 urandom: other thing we changed is the rack-awareness policy ... But I don't imagine how it could have a nimpact
[17:00:48] <urandom>	 you also have a different compression algorithm
[17:00:52] <urandom>	 lz4 vs deflate
[17:01:19] <joal>	 urandom: good catch !!!
[17:02:23] <urandom>	 wow
[17:02:31] <joal>	 urandom: Would that explain a *2 in data size?
[17:02:33] <urandom>	 SSTable Compression Ratio: 0.18338380961018205
[17:02:40] <urandom>	 SSTable Compression Ratio: 0.08109077478223656
[17:03:42] <elukey>	 did the default change?
[17:03:42] <urandom>	 joal: it would probably explain a lot, but I don't know about that much
[17:03:59] <urandom>	 elukey: no, LZ4 is the default actually
[17:04:30] <joal>	 urandom: I also expect DTCS to prevent having a good compression ratio
[17:04:51] <urandom>	 joal: why is that?
[17:05:12] <joal>	 urandom: because keys are duplicated?
[17:06:03] <joal>	 meaning, most of the compression we can have is on fileds that repeat themselves
[17:06:17] <urandom>	 joal: hrmm, i dunno, DTCS would give you bigger tables
[17:06:30] <joal>	 right, therefore better compression
[17:06:35] <joal>	 hmm
[17:06:41] <joal>	 weird WEIRD !
[17:07:09] <joal>	 elukey: I think I'm gonna stop loading from now, and restart tomorrow morning after a wipe and a config change for compression
[17:07:16] <joal>	 elukey: would that be ok with you?
[17:08:24] <addshore>	 joal: just curious but what does the "sync workflow" button do in hue?
[17:08:35] <urandom>	 i guess something must have changed for you to not have the tables created with deflate
[17:09:02] <joal>	 addshore: I don't know !
[17:09:20] <joal>	 urandom: I think we changed the compression manually
[17:10:41] <urandom>	 joal: ok, in config?
[17:10:53] <joal>	 urandom: I don't think so, through CQL
[17:10:59] <urandom>	 oh, i see
[17:11:32] <elukey>	 joal: good for me, I'll wipe the cluster again tomorrow :)
[17:11:51] <joal>	 urandom: the talk we had with gwicke at that moment was that compression was not handled by restbase, therefore could be changed easily without fear of restarts
[17:12:12] <joal>	 elukey: I'll wait for nuria_ opinion, but I think that's the wisest
[17:12:15] <urandom>	 it's OK for Cassandra to change it (at any time)
[17:12:32] <joal>	 urandom: Ok, I might try that then :)
[17:12:42] <urandom>	 key_rev_value in restbase defaults to deflate (creates tables that way)
[17:12:50] <urandom>	 but you can configure it otherwises
[17:12:54] <urandom>	 otherwise
[17:13:31] <urandom>	 joal: if you change it, then new tables get the new algorithm, and the old ones get decompressed with the old algorithm (until eventually being compacted to the new)
[17:13:50] <urandom>	 so you can change it as you like
[17:15:01] <joal>	 ok urandom, updating compresion !
[17:16:14] <joal>	 !log Change compression from lz4 to deflate
[17:16:16] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[17:16:30] <joal>	 !log Change compression from lz4 to deflate on aqs100[456]
[17:16:32] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[17:16:45] <joal>	 urandom: Thanks a lot for having spotted the compression diff
[17:16:58] <joal>	 I had read this line so many times I didn't even notice :(
[17:17:09] <urandom>	 no worries
[17:17:13] <urandom>	 happy to help
[17:17:28] <joal>	 urandom: Also, I have tried to optimize bulk loading with sorting data into tables, but didn't manage to replicate cassandra sorting
[17:18:15] <urandom>	 joal: you're back to doing CQL inserts, yes?
[17:20:01] <urandom>	 joal: nm, i just saw your phab comment
[17:20:52] <joal>	 urandom: Yes, CQL insterts
[17:27:01] <joal>	 nuria_: I'll go for diner now, let's discuss cassandra in an hour or so ?
[17:27:17] <nuria_>	 joal: k
[17:28:19] <elukey>	 joal: do you need me now for cassandra?
[17:28:26] <elukey>	 otherwise I'll go afk :)
[17:32:38] <elukey>	 will be back online in a bit to check, ttl!
[17:32:58] <elukey>	 a-team: anything that you need from your dear ops?
[17:33:11] <elukey>	 in case ping me on IRC or email me :)
[17:33:12] <mforns>	 hehe
[17:33:27] <mforns>	 bye elukey!
[17:33:53] <elukey>	 o/
[17:42:34] <addshore>	 joal: is there a way to see the state of spark jobs not using hue? (as it always seems to be so slow)!
[17:46:12] <nuria_>	 addshore: did you launch it with oozie or spark shell
[17:46:23] <addshore>	 oozie!
[17:48:58] <addshore>	 For example under https://hue.wikimedia.org/oozie/list_oozie_coordinator/0028751-160630131625562-oozie-oozi-C/ It still only shows tasks for 2 days (but I can see that data from some of the other days of the same job have already completed and are in graphite)
[17:51:13] <addshore>	 elukey: Could you take a look at https://gerrit.wikimedia.org/r/#/c/298931/ ? :)
[17:54:23] <addshore>	 well nuria_ all of the data seems to be getting added to graphite fine now! I could run one more oozie job to finish filling the gap!
[17:54:38] <addshore>	 oddly the oozie jobs still show up as running on hue!
[17:54:47] <nuria_>	 addshore: elukey has gone home
[17:55:48] <addshore>	 ahhh! I didn't realise that was him signing off!
[18:07:57] <addshore>	 afk for dinner!
[18:14:26] <quiddity>	 Do we have any sort of mobile stats (mobileweb and/or apps) for "Tablet vs Phone" ?
[18:14:32] <quiddity>	 (I'm not sure if we track screen-dimensions, or if there are other ways to distinguish tablets from the rest...?)
[18:33:54] <joal>	 Thanks elukey for having asked, no need for tonight :)
[18:34:32] <joal>	 nuria_: some thoughts on cassandra?
[18:34:46] <nuria_>	 joal; give me 10 mins
[18:34:50] <joal>	 sure nuria_
[18:53:25] <wikibugs>	 Analytics, Analytics-EventLogging, Performance-Team: Stop using global eventlogging install on hafnium (and any other eventlogging lib user) - https://phabricator.wikimedia.org/T131977#2472057 (greg)
[19:10:05] <nuria_>	 joal: sorry, ready now
[19:10:08] <nuria_>	 joal: yt?
[19:10:09] <joal>	 np nuria_
[19:10:18] <joal>	 batcave or IRC?
[19:10:44] <nuria_>	 let's try irc cause my batt is super low
[19:10:51] <nuria_>	 did urandom had any tips?
[19:11:11] <joal>	 he actually pointed a conf difference I hadn't noticed: compression
[19:11:38] <joal>	 We were using lz4 on new cluster, old had deflate
[19:11:48] <nuria_>	 joal: aham
[19:11:57] <nuria_>	 joal: anad data is compressed and compacted?
[19:12:02] <nuria_>	 joal: ok i see
[19:12:12] <nuria_>	 joal: so i take a new wipeout is needed
[19:12:26] <joal>	 correct nuria: data is compacted using a strategy, and then compacted tables are compressed
[19:12:41] <joal>	 nuria_: no, actually the change has already been made
[19:12:55] <nuria_>	 joal: ah, on teh new cluster?
[19:13:01] <nuria_>	 joal: and it recompresses?
[19:13:11] <joal>	 nuria_: Since almost no data had been inserted, most of it will be recompactedc therefore recompressed, so we should be good
[19:13:40] <nuria_>	 joal: okeis! then, we can do a space check tomorrow
[19:13:45] <joal>	 nuria_: oldly compressed data is recompressed when recompacted (when tha SSTables holding those data are changed)
[19:14:11] <joal>	 nuria_: Since almost no data was loaded, most of it will be recompacted during initial load
[19:14:44] <joal>	 nuria_: tomorrow I'm not sure we'll see a diff, load is currently happening and it'll take a few days (including compaction)
[19:15:10] <nuria_>	 joal: ok, let's check back at teh end of teh week
[19:15:12] <nuria_>	 *the
[19:15:32] <joal>	 nuria_: The other thing to keep in mind is that we probably don't want to serve all our historical data wiht daily level of granularity (due to data volumes)
[19:15:52] <joal>	 nuria_: I think it would be a good discussion to have with Dan
[19:16:24] <nuria_>	 joal: I think we will always have taht use case
[19:16:33] <joal>	 ?
[19:16:35] <nuria_>	 joal: daily granularity
[19:16:52] <nuria_>	 joal: at least that is what i would expect for an API like ours
[19:16:59] <joal>	 nuria_: sure, but with data from the beginning of time, or only recent
[19:17:09] <joal>	 ?
[19:17:12] <nuria_>	 joal: from beginning
[19:17:21] <joal>	 nuria_: Ok, then we'll need hardware :)
[19:17:29] <nuria_>	 joal: as a user I  would not expect that to change
[19:17:57] <nuria_>	 joal: will our current hardware hold data for 3 years ? mmm.. i guess we do not know yet
[19:18:05] <joal>	 nuria_: While I understand, the price of the storage / hardaware for serving that use case is, IMO, too expensive given the usage people have of it (so far)
[19:18:54] <nuria_>	 joal: let's see where we are capacity wise once compression is working ok and we can plan for that after scaling
[19:19:02] <joal>	 nuria_: our current hardware (SSDs) can handle about 2 years of data as of what we know now
[19:19:12] <joal>	 right
[19:23:59] <nuria_>	 joal: ok, this is something we might need to take a 2nd look at
[19:24:37] <joal>	 nuria_: it just bothers me because of teh habit people will take to have infinite data ;)
[19:25:01] <nuria_>	 joal: infinite data sounds like a sci-fi book
[19:25:04] <joal>	 nuria_: but your approach is best: first scale in qps, then the rest
[19:25:08] <joal>	 nuria_: :D
[19:25:30] <nuria_>	 joal: also we would need to talk to mark for $$$
[19:25:39] <joal>	 nuria_: true
[19:25:57] <joal>	 nuria_: It's actually the ratio $$$/usage thatI'm not sure of
[19:29:19] <joal>	 a-team, i'm logging off for tonight
[19:29:30] <mforns>	 joal, good night!
[19:33:55] <nuria_>	 joal: agreed, let's look at this later
[21:36:26] <mforns>	 hello milimetric?
[21:36:34] <mforns>	 yt?
[21:37:02] <milimetric>	 hey yea
[21:38:20] <milimetric>	 hey mforns
[21:38:25] <mforns>	 :]
[21:38:31] <mforns>	 working?
[21:39:17] <milimetric>	 eh
[21:39:19] <milimetric>	 :)
[21:39:23] <milimetric>	 wanna chat quick?
[21:39:27] <mforns>	 hehe ok