[02:09:37] <grrrit-wm>	 (PS1) Milimetric: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903
[02:10:33] <grrrit-wm>	 (CR) Milimetric: "Didn't get very far, Joseph, got a little bogged down with the tedious mapping code." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric)
[02:11:27] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric)
[09:09:57] <elukey>	 joal: sorry joining
[09:10:06] <elukey>	 was debugging and lost track of time -.-
[09:13:06] <joal>	 elukey: np, Lino is off schedule today
[09:13:25] <joal>	 elukey: Can we move the meeting to later on (let's say 1pm?)
[09:13:52] <elukey>	 suuuure
[09:14:31] <elukey>	 done :)
[09:14:49] <joal>	 thx mate
[10:59:18] <joal>	 elukey: meeting in batcave?
[11:00:26] <elukey>	 yep joining
[11:20:49] * elukey lunch!
[11:27:44] <joal>	 elukey: Arf, forgot one thing about conference travelling
[11:27:53] <joal>	 elukey: we'll finalize this after :)
[12:32:11] <wikibugs>	 Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2601271 (elukey) Today I tried to look into all the per article metrics in Graphite to check for any relevant pattern. The rationale is that $something triggers IOPS on disk that eventually tu...
[12:50:34] <wikibugs>	 Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2601283 (elukey) One possible way to have more insight would be http://techblog.netflix.com/2015/07/java-in-flames.html  I checked kernel+jdk should support this. The only requirement would be...
[13:44:22] <grrrit-wm>	 (PS2) Milimetric: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903
[13:45:48] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric)
[14:00:21] <joal>	 git up
[14:00:23] <joal>	 oops :)
[14:09:33] <joal>	 milimetric: Currently reading your last scala patch
[14:09:49] <joal>	 milimetric: if you want, we can pair from now to standup
[14:10:04] <milimetric>	 working with Andrew now
[14:10:11] <joal>	 ok milimetric, np
[14:10:12] <milimetric>	 that patch is definitely mid-stream
[14:10:19] <milimetric>	 we can chat about it a bit before standup, like 10 minutes?
[14:10:23] <joal>	 milimetric: will try to move from there
[14:10:29] <joal>	 sure
[14:50:12] <milimetric>	 ok joal, wanna chat a little?
[14:50:17] <joal>	 sure milimetric
[14:50:23] <joal>	 milimetric: batcave !
[15:01:19] <wikibugs>	 Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2601560 (jcrespo) Resolved>Open This is now causing cronspam, 1 email per hour sent to root@. This is not a good solution:   ``` rsync: mkdir "/limn-pu...
[15:03:23] <wikibugs>	 Analytics-Kanban: Create clean simplewiki output from edit history reconstruction - https://phabricator.wikimedia.org/T143321#2601574 (Nuria)
[15:03:26] <wikibugs>	 Analytics-Kanban: Edit History: Review scala code functionality and make page and user output uniform - https://phabricator.wikimedia.org/T143322#2601573 (Nuria) Open>Resolved
[15:13:02] <wikibugs>	 Analytics-Kanban: Switch AQS to new cluster - https://phabricator.wikimedia.org/T144497#2601636 (Nuria)
[15:40:50] <milimetric>	 https://www.irccloud.com/pastebin/eIfNO6ru/
[15:41:01] <milimetric>	 revert weidness joal ^
[16:10:08] * elukey afk! o/
[16:58:13] <joal>	 milimetric: the sha1 thing is more difficult than I expected, I'm gonna leave it for now
[16:59:00] <joal>	 On another aspect milimetric : Seems that clickhouse is faster than druid on aggregations and same-ish on time-series
[16:59:36] <joal>	 but this is just fast-trying, more complete tests should be done
[17:01:04] <milimetric>	 huh, interesting.  I'd have expected the opposite joal
[17:01:21] <milimetric>	 (druid faster on aggregation and slower on timeseries
[17:01:27] <joal>	 right milimetric
[17:01:46] <milimetric>	 but as for the sha1, yeah, totally, make it a no-op for now and we'll populate that field later.  I think that's the beauty of this approach, doesn't have to be complete to be done
[17:01:58] <joal>	 milimetric: ok great :)
[17:02:36] <joal>	 milimetric: going through modifying every rev in-between every revert efficiently is the real
[17:02:40] <joal>	 thing
[17:26:43] <leila>	 hi milimetric. I'm back again. I have a question about analytic-users group. Can you tell me what kind of access the users in that group have?
[17:27:01] <leila>	 Stas and I are discussing some access options, and I'm not clear about that group.
[17:27:44] <nuria_>	 leila: i think you have to ask ottomata
[17:28:03] <leila>	 nuria_: thanks. ottomata, ^
[17:36:33] <ottomata>	 leila:  hiiii
[17:36:41] <ottomata>	 they'd get login access to stat1002 & stat1004
[17:36:48] <ottomata>	 and they'd have an account created in hadoop
[17:36:50] <ottomata>	 but that's it
[17:37:01] <ottomata>	 they wouldn't have access to things  that are group readable by people in  the analytics-privatedata-users group
[17:37:03] <ottomata>	 like webrequests
[17:37:13] <ottomata>	 so, its really only useful if the user wants to use hadoop with her own data
[17:37:16] <ottomata>	 not stuff we have
[17:37:19] <ottomata>	 actually
[17:37:26] <ottomata>	 there is stuff in there that should be readable by non privatedata user
[17:37:30] <ottomata>	 like ummm, pageviews?  maybe?
[17:37:35] <ottomata>	 can't remember if that is private...it might be
[17:37:58] <ottomata>	 but, other stuff like the edit history that milimetric et. al. are working on would probably be non private
[17:38:03] <ottomata>	 so people in analytics-users could access that
[17:38:16] <ottomata>	 its bascailly hadoop access without access to webrequests
[17:41:32] <leila>	 ottomata: thanks. for my understanding: are the accesses governed by database names? for example, is it correct to say that analytics-users won't have access to wmf_raw database?
[17:42:59] <leila>	 and if that's the case, will analytics-users have access to wmf database? (as you said, there is some data in wmf database, for example webrequest, or pageview_hourly, that is still sensitive, correct?)
[17:48:30] <nuria_>	 leila: no, it is not database restricted
[17:49:13] <ottomata>	 nope, access is not goverend by database
[17:49:17] <ottomata>	 its just HDFS file permissions
[17:49:23] <ottomata>	 Hive is just a mapping onto files in HDFS anyway
[17:49:39] <nuria_>	 leila: it's like a directory tree on linux
[17:49:47] <ottomata>	 the database are arbitrary
[17:49:49] <nuria_>	 leila: in some directories you have reda permits
[17:49:52] <nuria_>	 *read
[17:49:59] <ottomata>	 its all about the location of the data for the external table partitoins
[17:50:03] <ottomata>	 so if you do
[17:50:10] <ottomata>	 show partitions wmf.webrequest
[17:50:13] <nuria_>	 leila: but not in others that might be restricted just to be readable by root
[17:50:13] <ottomata>	 (you'll get a LOT)
[17:50:16] <ottomata>	 you can see the HDFS file paths
[17:50:19] <leila>	 How can I get a list of the directories/data the user will have access to if in analytics-users?
[17:50:20] <ottomata>	 and then if you do
[17:50:23] <ottomata>	 hdfs dfs -ls <path>
[17:50:27] <ottomata>	 you can see the access perms
[17:50:33] <ottomata>	 ummm
[17:50:35] <nuria_>	 joal, milimetric : so, ahem, i found 1 inconsistancy that might be notable
[17:50:48] <nuria_>	 joal, milimetric : in cassandra taht is
[17:50:49] <ottomata>	 leila:  not really sure, anything that is readable by users in that group.
[17:50:49] <ottomata>	 hm
[17:50:57] <nuria_>	 *that
[17:51:02] <leila>	 ottomata: that's already helpful. Let me dig into that.
[17:51:16] <milimetric>	 nuria_: what's the inconsistency?
[17:51:50] <nuria_>	 milimetric: that "no views" are reported as null in new cluster and zero in old cluster
[17:52:07] <nuria_>	 milimetric: [{"project":"wikidata","article":"Q604141","granularity":"daily","timestamp":"2016060100","access":"all-access","agent":"user","views":0}
[17:52:09] <nuria_>	 versus
[17:52:20] <milimetric>	 right
[17:52:32] <milimetric>	 that was a hot topic when we first launched
[17:52:44] <nuria_>	 {"project":"wikidata","article":"Q604141","granularity":"daily","timestamp":"2016060100","access":"all-access","agent":"user","views":null}
[17:52:46] <ottomata>	 leila: it might be easier to just look at perms in hdfs
[17:52:49] <ottomata>	 in /wmf/data/*
[17:52:57] <ottomata>	 also	 if you find anything that isn't as it should be
[17:52:59] <ottomata>	 let me know
[17:53:34] <ottomata>	 e.g. if you happen to find webrequests that aren't 750 hdfs:analytics-privatedata-users
[17:53:38] <milimetric>	 nuria_: yeah, we can go back to the discussions we had in the beginning, I think people preferred 0s.  But we could easily change that behavior in javascript and just store the nulls as they are
[17:53:47] <leila>	 I'll do that ottomata. :)
[17:54:54] <nuria_>	 milimetric: ya, i do not think is a deal breaker but we should dig into why that is the case, will make a note in ticket
[17:56:01] <nuria_>	 milimetric: but yes, we should pass 0s not nulls
[17:56:04] <milimetric>	 nuria_: my memory is definitely foggy, but I remember joseph doing custom logic on his loader back then.  Maybe he just forgot to do the same with the new loader
[17:56:23] <milimetric>	 either way, I think it's more efficient to store nulls
[17:56:33] <milimetric>	 I always thought we should store nulls and return 0s
[17:56:35] <nuria_>	 milimetric: but api should return 0s no question
[17:56:40] <nuria_>	 milimetric: agreed
[17:56:43] <milimetric>	 right
[17:57:20] <milimetric>	 and now we have more data about CPU usage and all that for the cluster, so we can make informed decisions
[17:58:24] <leila>	 right, so ottomata, if I understand the output of hdfs dfs -ls /wmf/data/* correctly, /wmf/data/archive/webrequest will be accessible to analytics-users?
[17:59:04] <leila>	 ow no! scratch that. that is available only to analytics-privatedata-users
[17:59:50] <ottomata>	 leila:  the ones in archive are not the unsampeld webrequests
[17:59:54] <ottomata>	 so those are readalbe
[18:00:00] <ottomata>	 those are historical copies of udp2log like data
[18:00:09] <ottomata>	 see
[18:00:14] <ottomata>	  /wmf/data/wmf/webrequest
[18:00:16] <ottomata>	 and/or
[18:00:18] <ottomata>	  /wmf/data/raw/webrequest
[18:00:27] <wikibugs>	 Analytics-Kanban: Continue New AQS Loading - https://phabricator.wikimedia.org/T140866#2602230 (Nuria) Tested a bit how are we doing consistency wise and thus far things checkout. I found 1 issue. See repro below.   Current API: http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/wikidata.org/all-...
[18:44:50] <milimetric>	 a-team: public service announcement: you have to escape semicolons in hive strings otherwise it throws an EOF error:
[18:44:58] <milimetric>	 select 'what the;'; throws an error
[18:45:05] <milimetric>	 but select 'what the\;'; is ok
[19:05:51] <wikibugs>	 Analytics, Cassandra, Discovery, Maps, and 2 others: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861#2602486 (Eevans) I updated https://gerrit.wikimedia.org/r/#/c/282466 to include the equivalent 2.2 configuration, and...
[19:06:52] <joal>	 nuria_, milimetric: Diff between nulls and 0 is due to me upgrading storage to use null instead of 0 (less space used)
[19:07:16] <milimetric>	 ok, that makes sense, we can just coalesce them in the output code
[19:07:18] <nuria_>	 joal: ok, I will just need to do changes to aqs to make sure those get wrapped at js layer
[19:07:18] <joal>	 nuria_, milimetric: However I'd have expected restbase to convert null to 0
[19:07:33] <nuria_>	 joal: jaja, if we write the code sure
[19:07:42] <joal>	 milimetric: I can also do that
[19:07:54] <joal>	 nuria_: My bad, I had in mind that it already did it
[19:07:58] <milimetric>	 I'll file the task so we don't forget and put it on kanban
[19:08:04] <joal>	 great
[19:08:17] <nuria_>	 milimetric: ok, was about to do it, just looked at code again and shoudl be easy
[19:08:20] <nuria_>	 *should
[19:08:36] <nuria_>	 milimetric: let me finish deploying latest changes to dashiki
[19:09:19] <wikibugs>	 Analytics-Kanban: Coalesce nulls to 0s in output - https://phabricator.wikimedia.org/T144521#2602510 (Milimetric)
[19:09:20] <milimetric>	 https://phabricator.wikimedia.org/T144521
[19:15:49] <wikibugs>	 Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2602573 (Milimetric) ping @Ottomata because he was cleaning these directories a bit, may be related
[19:16:45] <joal>	 logging off a-team, bye !
[19:18:59] <ottomata>	 byyye!
[19:23:14] <wikibugs>	 Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2602603 (Ottomata) Thanks, just merged a fix.
[19:23:56] <wikibugs>	 Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2602604 (Ottomata) Open>Resolved I'm pretty sure that cronspam problem was unrelated to this ticket. Closing.
[19:50:05] <grrrit-wm>	 (PS4) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476)
[19:53:26] <grrrit-wm>	 (PS5) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476)
[20:00:24] <grrrit-wm>	 (PS7) Nuria: Bookmark for browser dashboard regarding graph and time [analytics/dashiki] - https://gerrit.wikimedia.org/r/306980 (https://phabricator.wikimedia.org/T143689)
[20:06:55] <grrrit-wm>	 (CR) Nuria: [C: -1] "Need to fix couple comments" [analytics/dashiki] - https://gerrit.wikimedia.org/r/306980 (https://phabricator.wikimedia.org/T143689) (owner: Nuria)
[20:26:20] <nuria_>	 argh, bug ! on bookmarks
[20:38:07] <grrrit-wm>	 (PS6) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476)
[20:39:14] <grrrit-wm>	 (PS7) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476)
[20:39:28] <grrrit-wm>	 (CR) Milimetric: "ok I think this is ready for review again" [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric)