[03:53:20] 10Analytics, 06Performance-Team: Eventlogging client needs to support offline events - https://phabricator.wikimedia.org/T162308#3159515 (10Krinkle) [06:35:04] (03PS3) 10Fdans: Removes subjective fields from mediawikihistory query [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346522 (https://phabricator.wikimedia.org/T157362) [08:30:57] hi fdans [08:31:10] Have you seen my comments on your refinery-source CR? [10:14:51] joal: hello! I have, on them right now :) [10:14:57] thank you so much for the CR [10:15:40] fdans: no prob :) [10:15:56] fdans: I was just making sure they make sense and look ok to you :) [10:39:26] (03PS1) 10Joal: [WIP] Add wikidata graph analysis code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346726 [11:08:19] (03PS2) 10Joal: [WIP] Add wikidata graph analysis code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346726 [11:48:20] taking a break a-team [12:01:33] * elukey brb! [12:36:45] (03PS2) 10Fdans: Remove is_productive and update time to revert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) [12:36:55] (03CR) 10Fdans: Remove is_productive and update time to revert (0314 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) (owner: 10Fdans) [12:40:14] (03CR) 10jerkins-bot: [V: 04-1] Remove is_productive and update time to revert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) (owner: 10Fdans) [12:51:25] (03PS3) 10Fdans: Remove is_productive and update time to revert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) [12:58:01] (03Abandoned) 10Fdans: Removes subjective fields from mediawikihistory query [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346522 (https://phabricator.wikimedia.org/T157362) (owner: 10Fdans) [13:11:41] (03PS1) 10Fdans: Remove is_productive and update time to revert [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346755 (https://phabricator.wikimedia.org/T157362) [13:13:16] (03PS2) 10Fdans: Update jobs to remove is_productive and update time to revert [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346755 (https://phabricator.wikimedia.org/T157362) [13:24:44] mooorninnnng elukey! [13:24:49] should we do an02? [13:26:37] morniiiinnnggggg [13:26:49] I just pinged you in phab :) [13:27:05] lemme grab a coffee and then I'll be ready [13:27:05] oh! ok will read [13:35:35] oh elukey, thought of something about an03 downtime annoucement [13:35:42] an03 is used for druid metadata storage [13:35:53] i think we will need to do druid downtime too [13:37:07] yeah :( [13:37:27] hm, ok so an02. does not have a partman [13:37:30] i am sorry :/ [13:38:54] it is RAID 1 across 4 drives [13:39:03] with lvm on top [13:40:30] yep I didn't have time this morning to come up with a recipe, but I can try now if you have patience [13:40:34] should be easy [13:40:37] haha, if you WANT to [13:40:48] elukey: do you think it is possible to reinstall a node wthout wiping partitions at all? [13:40:53] just to use the existing / for reinstall? [13:41:14] if we do it via console probably yes [13:41:16] that one coudl be 'wiped', but i wonder if we could do it so md0 and lvm didn't go away [13:41:17] yeah [13:41:29] all right let's do it via console :) [13:41:31] I got the drill [13:41:36] haha ok [13:41:43] batcave? lemme get settled into monitor [13:41:51] we should backup the stuff we have there remotely first i guess [13:42:02] /srv/backup is mysql backups, 24G [13:42:15] name dir is 1.7G [13:42:30] hm, i'm going to pause LVM backups of MySQL and rsync /srv/backup to stat1004 or 2 or something [13:42:31] ok? [13:42:39] ok! [13:42:45] going to the batcave [13:43:02] be there in a min, let me start this rsync [14:04:51] People we are reimaging analyics1002 to Debian [14:05:01] let us know if you see any issue [14:19:37] (03CR) 10Joal: [C: 031] "A lst comment, but really not preventing to merge :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) (owner: 10Fdans) [14:20:08] ありがとう joal ! [14:22:54] (03PS4) 10Fdans: Remove is_productive and update time to revert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) [14:33:47] (03CR) 10Fdans: [C: 032] Remove is_productive and update time to revert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346519 (https://phabricator.wikimedia.org/T157362) (owner: 10Fdans) [14:46:10] 06Analytics-Kanban, 10MediaWiki-extensions-WikimediaEvents, 10The-Wikipedia-Library: Logging instrumentation is completely broken - https://phabricator.wikimedia.org/T162365#3160497 (10Milimetric) [14:46:25] 06Analytics-Kanban, 10MediaWiki-extensions-WikimediaEvents, 10The-Wikipedia-Library: ExternalLinksChange Logging instrumentation is completely broken - https://phabricator.wikimedia.org/T162365#3160512 (10Milimetric) [14:52:34] Question concerning apps: French Wikipedia's home page visits have dropped since the community has applied the new design. https://tools.wmflabs.org/pageviews/?project=fr.wikipedia.org&platform=mobile-app&agent=user&start=2016-03-02&end=2017-04-05&pages=Wikip%C3%A9dia:Accueil_principal Anyone knows why? [14:55:34] 10Analytics-Cluster, 06Analytics-Kanban: Enable hyperthreading on analytics100[12] - https://phabricator.wikimedia.org/T159742#3160532 (10Ottomata) [14:55:38] Trizek: I would not assume that is due to nay changes deployed, requests to the main page in large amounts normally indicate connection problems (browser trying to connect to http://fr.wikipedia.org via https , will get redirected to main page and if it cannot stablish a connection the cycle will repeat) [14:56:29] So the big amount of visits is something that is now fixed? [14:57:13] That explains why the results have skyrocket before christmas! :) [14:57:51] Trizek: we do not normally need to "fix" anything, if it is connection issue normally can be traced to a browser, if it is an undetected bot those requests stop all of a sudden [14:58:44] Trizek: w/o looking at requests in detail would be hard to say but main page requests huge numbers have indicated connection problems before, as it is our default page [14:59:28] thanks nuria [15:09:12] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 06Operations, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3160560 (10Nuria) [15:09:13] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3145199 (10Nuria) 05Open>03Resolved [15:09:42] 10Analytics-Dashiki, 06Analytics-Kanban, 13Patch-For-Review: Show pageviews prior to 2015 in dashiki - https://phabricator.wikimedia.org/T143906#3160566 (10Nuria) 05Open>03Resolved [15:09:44] 06Analytics-Kanban, 13Patch-For-Review: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#3160568 (10Nuria) [15:11:29] 10Analytics: Polish script that checks eventlogging lag to use it for alarming - https://phabricator.wikimedia.org/T124306#3160587 (10Nuria) [15:11:33] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 06Operations, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) 05Open>03Resolved [15:15:04] 06Analytics-Kanban, 06Operations, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3160595 (10Ottomata) [15:24:46] ottomata: [15:24:46] elukey@analytics1001:~$ sudo -u hdfs hdfs haadmin -getServiceState analytics1002-eqiad-wmnet [15:24:50] standby [15:24:52] ya! [15:24:52] rmadmin is ok too [15:24:55] \o/ [15:45:27] joal: wanna look at this jsonschema scala model thing with me real quick [15:45:29] tell me waht you thikn? [15:45:48] sure ottomata give me a minute and I'll join batcave [15:46:03] k [15:46:11] elukey: i just re-enabled the mysql backup task and puppet on an03 [15:46:24] i think an02 is good to go [15:46:33] did we want to switch it to master today? [15:46:37] or maybe on monday? [15:47:34] Trizek: look at agreggated pageviews , they have nor chnaged that much: https://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=user&start=2016-11-01&end=2017-03-31&sites=fr.wikipedia.org [15:48:24] There is a big change when we only consider mobile app nuria https://tools.wmflabs.org/siteviews/?platform=mobile-app&source=pageviews&agent=user&start=2016-11-01&end=2017-03-31&sites=fr.wikipedia.org [15:48:27] ottomata: I'd prefer on Monday JUST IN CASE [15:48:47] so we'll have time to verify that an1002 is working properly [15:48:49] wdyt? [15:48:53] (I am removing downtime) [15:49:24] Trizek: pageviews for app are affected the most when apps get released [15:49:59] Trizek: any bug in the app that -for example- does a silent request will "artifically" incresae pageviews [15:50:12] Trizek: you need to plot app releases to gether with those to see it [15:50:38] Trizek: and again, for apps is the same thing, major Mian_page requests are likely artificial [15:50:49] *artificially increase [15:51:42] Trizek: and looking a bit further to browser data (this data is not public) that spike is neither android nor ios [15:51:49] Trizek: for the app [15:52:12] elukey: sounds good [15:52:23] we'll promote it monday, do an01 reimage tuesday [15:54:18] https://usercontent.irccloud-cdn.com/file/rbGSXm7q/Screen%20Shot%202017-04-06%20at%208.52.19%20AM.png [15:54:23] cc Trizek [15:55:13] thanks nuria [15:55:46] Trizek: so, on further inspection, it really looks like "bogus" app pageviews [15:57:46] I'm not sure I'm getting everything nuria, but, at least, I known now it is not related to a design change that may have broken any parser. [15:57:47] 10Analytics-Cluster, 06Analytics-Kanban: Replacement of stat1002 and stat1003 - https://phabricator.wikimedia.org/T152712#3160686 (10Nuria) [15:59:50] Trizek: that graph is app pageviews by user agent [16:00:08] Trizek: the blue line is NOT android nor IOS which points to bot data [16:00:24] Trizek: will need to look at it a bit further [16:03:49] No problem nuria. Thank you for taking care of this. Plus I already have the answer I need. [16:04:27] 10Analytics: Add unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3160706 (10Nuria) We should be able to have data per country since the beginning. Let's do daily and monthly unique devices. - create ingestion spec for druid (json) - hql for hive - create 2 oozie jobs with coordin... [16:04:42] 06Analytics-Kanban: Add unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3160707 (10Nuria) [16:07:40] 10Analytics, 07Easy: Pre-generate mysql ORM code for sqoop - https://phabricator.wikimedia.org/T143119#3160710 (10Nuria) - changing scoop job to have a parameter for jars (with schema version changes there will be jar changes , those jars are already build) - passing jar to scoop job that will generate ORM code [16:10:19] 10Analytics, 10EventBus, 05MW-1.29-release (WMF-deploy-2017-04-04_(1.29.0-wmf.19)), 06Services (done): Create mediawiki.page-restrictions-change event - https://phabricator.wikimedia.org/T160942#3160713 (10Pchelolo) 05Open>03Resolved And the event is life! Resolving. [16:20:40] elukey: re: moving that meeting time... [16:21:04] elukey: when is that standup that you wanted to move it before? [16:21:26] 06Analytics-Kanban: Pre-generate mysql ORM code for sqoop - https://phabricator.wikimedia.org/T143119#3160760 (10Nuria) [16:22:01] urandom: I think that it was only conflicting this week, we can keep going and change if needed [16:22:04] wdyt? [16:22:34] elukey: up to you guys [16:22:35] 06Analytics-Kanban: Measuring non content pageviews - https://phabricator.wikimedia.org/T162310#3160775 (10Nuria) [16:22:43] 10Analytics, 06Performance-Team: Eventlogging client needs to support offline events - https://phabricator.wikimedia.org/T162308#3160776 (10Milimetric) p:05Triage>03Normal [16:23:02] elukey: happy to do it earlier if that makes it easier on you guys [16:23:39] elukey: everyone on that call but me is europe :) [16:24:09] urandom: maybe we can ask to Filippo next week so we'll have a good EU quorum :) [16:24:26] elukey: sure [16:28:12] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3160810 (10Nuria) a:03Ottomata [16:34:09] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3160832 (10Ottomata) @Smalyshev let's jump in a hangout sometime to discuss this more. Just a few quick points: > Does not have data back more than 7... [16:55:30] joal: sent thoughs about doc proposal, i think is a bit ambitious [16:55:39] joal: to analytics-internal [16:57:03] k [17:09:19] elukey: did you remove /etc/cron.daily/blogreport on eventlog1001? [17:09:22] its gone [17:11:10] ottomata: yep (do you remember yesterday's standup? :) [17:11:22] also noted in the sal, backup in /home/elukey [17:11:26] ah didn't realize you were gonna do it! thought i was ahhh ok [17:11:29] we need to email to tilman [17:11:30] i'll do that [17:11:35] have you already? [17:11:39] nope! [17:11:42] k will do [17:11:43] danke [17:16:42] logging off!! byeee [17:17:19] Bye elukey [17:22:48] AH joal! [17:22:53] i am wrong about spark json schema infer [17:22:58] default sampling ratio is 1.0 [17:23:03] so, it does actually pass through all records [17:23:11] hm [17:23:17] ottomata: WTH then ? [17:23:30] joal: i think it just happens that i don't have any records for my older schema that have the field [17:23:33] because the field is optional [17:23:34] which, is ok. [17:23:42] as long as we don't drop any fields on conversion [17:23:54] it is ok if the hive schema we make doesn't match the full merged history of jsonschema revisions [17:23:55] ottomata: nope [17:24:10] as long as the hive schema we make contains fields for all possible seen fields in hdfs [17:24:32] ottomata: nope, we won't drop fields - And, if we build schemas out of newer than never include a field, well, it's not present :) [17:24:38] right [17:24:39] which is fine [17:24:43] indeed [17:24:47] because it would always be NULL if we had it [17:25:00] ottomata: We might not have to go for jsonschema at then end :) [17:25:02] yea [17:25:05] Cool ! [17:25:06] i'm going to keep trying this route then [17:25:13] nice find ! [17:25:24] the .read.json method docs even say [17:25:24] * Unless the schema is specified using [[schema]] function, this function goes through the [17:25:24] * input once to determine the input schema. [17:25:31] nice learning as well, thanks for the pointer (sampling ratio for json schema inference) [17:25:40] https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala#L28 [17:25:57] awesome [17:34:15] ya joal, that means, that i don't need to group DFs by revision in order to find the uber schema [17:34:25] ottomata: Yup ! [17:34:29] the schema from the current load shoudl have everything [17:34:32] ottomata: Spark should do it for ya [17:34:34] and i can pass that + the hive schema to your code [17:34:39] to get any needed alters for past data [17:36:26] ottomata: sounds correct :) [17:36:33] ottomata: any issue with the ordering stuff? [17:38:13] joal: after we fixed ordering, but i'm not totally sure why [17:38:23] will see what happens [17:40:14] ottomata: I'm having a real lot of fun with graphframes on wikidata :) [17:41:11] 06Analytics-Kanban: All Dashiki Dashboards down - https://phabricator.wikimedia.org/T162320#3161046 (10Nuria) [17:41:52] graphframes joal? [17:41:57] yessir [17:42:02] graph on spark [17:42:38] I'm going to prepare some things to show off next week ;) [17:42:48] ok awesome! [17:43:29] Off for tonight - Later a-team [17:45:48] laters! [17:58:03] ottomata: Just recall I should ask you that [17:58:19] Since we are jessie everywhere on workers, would you mind installing a few packages for me ? [17:59:03] ottomata: And lastly, would it be feasible to also make a package out of my python data? [17:59:39] ottomata: https://gist.github.com/jobar/23226cb8e62d966b8a9f0621c4d86771 [17:59:51] joal will puppetize that ya [17:59:59] package out of data? sounds familiar, what data? [18:00:17] awesome - missing only one thing, the nltk data :( [18:05:33] ottomata: the data is nltk stopwords, downloaded through python by users - You said maybe you could package it ? [18:06:02] * joal waves gently to ottomata, expecting a yes :) [18:08:51] 06Analytics-Kanban: Pagecounts all sites data issues - https://phabricator.wikimedia.org/T162157#3161226 (10Nuria) At the time of loading: www.wikidata.org needs to be changed to wikidata www.mediawiki.org needs to be changed to mediawiki [18:14:18] joal: in order to reload data in cassandra should i drop existing data? [18:14:45] joal: there is data for ww.wikidata that will not get overwriiten as it should be wikidata, same for 'mediawiki' [18:14:52] nuria: depends - If there are /wrong/ data in cassandra, as in, incorrect key (www.mediawiki.org for instance), would be better [18:15:01] nuria: TRUNCATE [18:15:23] joal: ok, only for those 2 projects + meta as far as i can tell [18:15:34] joal: i was not going to load the meta spike at all [18:15:42] joal: if it makes sense [18:15:42] nuria: TRUNCATE will drop everything, and you need to reload everything [18:16:06] joal: that seems fine , data is small [18:16:21] nuria: good idea - Not loading meta at load actually makes sense (represent a lot of the time anyway) [18:16:25] coll nuria [18:16:54] joal: will super document and exclude met acompletely [18:16:56] *meta [18:17:00] joal: thank you [18:17:05] nuria: np, thank you ! [18:19:37] (03PS1) 10Nuria: Corrected triming of hostname [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346802 (https://phabricator.wikimedia.org/T162157) [18:25:25] (03PS2) 10Nuria: Correcting loading of pagecounts into cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346802 (https://phabricator.wikimedia.org/T162157) [18:49:37] joal: don't see why n ot! [18:49:43] sorry, was afk for min [19:06:31] joal: can i truncate table on command line? or do i need to use something else? [19:08:01] anybody left online that can help me figure out how to use intellij with refinery-source? [19:08:03] nuria: maybe? [19:12:53] ottomata: jaja [19:13:27] ottomata: sure, i have not used it for scala for while [19:13:39] ottomata: is it scala that is not working? [19:13:51] i can't even make it just compile in the IDE [19:14:27] nuria: bc real quick? [19:14:31] ottomata: ah, ok, 1st thing to check is teh default java version [19:14:35] ottomata: k [19:14:54] ottomata: let me get headset cause I am at the library [19:39:09] urandom: question for you if you are arround [19:40:07] ottomata: no hear no see [19:44:01] nuria: sure! [19:44:18] urandom: to truncate data from a table can i do it from cq command line? [19:44:26] urandom: or do i need to use nodetool? [19:44:48] nuria: yeah, cqlsh [19:45:00] TRUNCATE ; [19:45:03] urandom: and that truncate will propagate to all nodes? [19:45:07] yes [19:45:21] pretty sure that will create a snapshot, too [19:45:42] snapshots are hard links, so over time they'll start accumulating space [19:45:52] you can clear them with the nodetool clearsnapshot command [19:46:11] (that would need to be done on all instances) [19:46:13] urandom: ok sir, than you [19:46:20] sure thing [20:17:00] [20:17:00] org.wikimedia.analytics.refinery.core [20:17:01] refinery-core [20:17:01] system [20:17:01] /Users/otto/Projects/wm/analytics/refinery-source/refinery-core/target/refinery-core-0.0.45-SNAPSHOT.jar [20:17:01] [20:17:03] nuria: ^ [20:17:19] mvn test -Dmaven.surefire.debug.test -Dsuites=org.wikimedia.analytics.refinery.job.HiveFromJsonSuite -pl refinery-job [20:36:43] hello people [20:36:44] 22:28 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:36:48] etc.. [20:37:10] ottomata: --^ [20:37:38] nuria: --^ [20:37:49] oo [20:37:51] I saw your messages about truncate/reload [20:38:19] elukey: oh man [20:38:24] elukey: i just did it [20:38:36] elukey: through commandline [20:38:39] don't know context here, what's up? [20:39:17] elukey, ottomata : i just dropped data (after talking to urandom ) [20:39:35] nuria: catching up with the backlog, if you are reloading it is fine :) [20:39:48] elukey: so 404 here sounds right: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} [20:39:54] elukey: i am reloading now [20:40:01] elukey: all but meta [20:40:26] elukey: there were 3 wikis with bad data and joseph though it would be best to truncate and reload [20:40:38] elukey: so we can ignore alarm? [20:40:56] sure sure, I was checking IRC and saw the alerts so thought to ask :) [20:41:01] ok you are loading now? ok [20:41:32] elukey: so SORRY! [20:41:46] elukey: i did not though of alerts on this new data [20:41:56] ottomata: right, reloading now [20:42:31] nuria: it is super fine, I just wanted to make sure that it wasn't cassandra exploding :) [20:42:41] * elukey blames urandom [20:42:43] :P [20:42:56] elukey: ya, so SORRY again cause i should have though about data alarms [20:47:16] AQS alarms are not logged in here, I need to fix it [20:47:23] so there will be more visibility [20:47:39] AQS alarms should be logged in here though [20:47:40] grrr [20:48:00] elukey: ok, will write ticket, no worries [20:48:34] 06Analytics-Kanban: AQS alarms need to log to analytics channel - https://phabricator.wikimedia.org/T162407#3161778 (10Nuria) [20:48:52] :) [20:48:56] thanks! [20:49:28] nuria: let me know when data is reloaded so I'll ack the alarms etc.. [20:50:07] elukey: it will be couple hrs i think you might have gone to bed, but no worries, i will ping someone in ops to acknowledge if i cannot see them [20:50:42] it should recover by itself, I'll alert the other ops people [20:50:49] elukey: k [20:52:33] acked the alarms [21:23:02] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3161867 (10Smalyshev) > let's jump in a hangout sometime to discuss this more. Would be glad to. I'll try to set up something next week. > If there... [21:27:02] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3161868 (10Ottomata) >>If so, you may want to consider consuming from Kafka rather than EventStreams. >I am considering this too, but I assume it's mor... [21:40:36] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3161909 (10Smalyshev) > It will be more, a lot more. What language are you working in? The end consumer will be Java, but I don't want to consume the... [21:44:38] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3161911 (10Ottomata) > I'd rather have some intermediary that cleans up, deduplicates, etc. the changes. FYI, neither base Kafka Consumer clients nor... [21:47:29] 10Analytics, 06Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3161917 (10Smalyshev) > FYI, neither base Kafka Consumer clients nor EventStreams does this. Yes, I know :) It's one of the decisions I still haven't... [21:55:49] 10Analytics-EventLogging, 06Analytics-Kanban: Research Spike: Better support for Eventlogging data on hive - https://phabricator.wikimedia.org/T153328#3161925 (10Ottomata) Ok, after much spiking, here's the current route we'd like to go down. We are pretty sure we can infer a good enough schema for each impo... [22:30:48] 06Analytics-Kanban, 13Patch-For-Review: Pagecounts all sites data issues - https://phabricator.wikimedia.org/T162157#3162003 (10Nuria) Realoaded data (took about 2 hours) and now wikidata project column looks correct on db: cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".dat... [23:02:50] anyone previously setup pyspark locally for doing development? Running into an issue where i can essentially do collect_list(named_struct(...)) in our prod pyspark running 1.6.0, but i downloaded pyspark 1.6.0 from http://spark.apache.org/downloads.html with 'pre-built for hadoop 2.6' and it doesn't seem to include that functionality yet. [23:03:22] comparing prod, we seem to run hadoop 2.6.0-cdh5.10.0, so would seem it should work (unless perhaps cdh backported some functionality or something?) [23:17:23] ahh, it turns out cdh backports things, so cdh hive is not the same as the hive spark is linked to. will figure out how to get a local cdh instead of just pyspark