[02:26:09] Analytics, MediaWiki-User-preferences, Tool-Labs-tools-Database-Queries: Gadget usage statistics for Portuguese Wikipedia - https://phabricator.wikimedia.org/T61480#1837499 (coren) [09:01:59] Analytics-Wikimetrics, Education-Program-Dashboard: I want WikiMetrics integration with the education dashboard that lets you easily pull reports about courses, institutions, etc. - https://phabricator.wikimedia.org/T92454#1837707 (awight) p:Triage>High [09:26:54] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837766 (jcrespo) ``` jynus@db1046:/srv$ df -h | grep /srv /dev/mapper/tank-data 1.4T 1.3T 106G 93% /srv jynus@db1046:/srv$ du -h --max-depth=2 691G ./sqldata/log 119M ./sqldata/mysql... [09:40:35] Analytics-Kanban: Cassandra Backfill July [5 pts] {melc} - https://phabricator.wikimedia.org/T119863#1837784 (JAllemandou) NEW [10:38:23] (PS1) Addshore: Add wikipedia ref counting script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255971 (https://phabricator.wikimedia.org/T119607) [10:38:39] (CR) Addshore: [C: 2 V: 2] Add wikipedia ref counting script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255971 (https://phabricator.wikimedia.org/T119607) (owner: Addshore) [10:56:05] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837894 (jcrespo) This is a list of the first record on db1046 for each table: ``` mysql -A -BN -h db1046 log -e "SELECT table_name FROM information_schema.columns WHERE column_name='ti... [11:08:45] (PS1) Addshore: Add instanceof tracking script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255975 (https://phabricator.wikimedia.org/T119074) [11:08:48] Analytics-Backlog, Research consulting, Research-and-Data-Archive: Analysis on traffic through the HTTPS transition - https://phabricator.wikimedia.org/T102431#1837912 (Aklapper) Open>Resolved >>! In T102431#1675396, @ellery wrote: > @Aklapper this task is complete. @ellery: Is there a reason t... [11:09:14] (CR) Addshore: [C: 2 V: 2] Add instanceof tracking script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255975 (https://phabricator.wikimedia.org/T119074) (owner: Addshore) [11:13:45] (PS1) Joal: Update changelog.md for v0.0.23 deployment [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255976 [11:17:47] (PS1) Addshore: Rename wikidata_ -> wp_ ref script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255977 [11:17:55] (CR) Addshore: [C: 2] Rename wikidata_ -> wp_ ref script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255977 (owner: Addshore) [11:18:14] (Merged) jenkins-bot: Rename wikidata_ -> wp_ ref script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255977 (owner: Addshore) [11:20:31] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837970 (ori) >>! In T119380#1830707, @jcrespo wrote: > I have just one question, when and who decides when new tables are to be created within a schema? At the moment it is done manuall... [11:31:57] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837986 (jcrespo) > If we get agreement on T119144, we could potentially drop the clientIp column (varchar(191)) from all tables. Dropping columns is not an investment work persuing. Par... [11:33:26] PROBLEM - DPKG on gadolinium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:43:04] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:15] RECOVERY - DPKG on gadolinium is OK: All packages OK [11:46:55] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:24] joal, I just wanted to ask about the scala code [11:54:41] do you want to continue pairing? [11:55:19] mforns: depends :) [11:55:42] mforns: I'd like to move forward some other things, but I'd also like to pair :) [11:55:53] aha [11:56:12] if you want I have also things to do, currently backfilling EL [11:56:16] mforns: Let's discuss about the path forward, then I'll let you code and possibly get back to it when the other things I want to work will be done :) [11:56:26] ok mforns [11:56:31] ok [11:56:42] mforns: backfilling didn't automatically work ? [11:56:52] joal, have you added more modifications to the scala code? [11:57:01] backfilling... no... :( [11:57:35] mforns: batcave? [11:57:43] ok [12:39:16] (PS1) Joal: Add LRUCache to webrequest spider identification [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255986 [13:42:20] (PS2) Joal: Update changelog.md for v0.0.23 deployment [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255976 [13:45:18] (CR) Joal: "Tested on hive with for isSpider only simple group by: new computation time between 2/3 and 1/2 of the previous one." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255986 (owner: Joal) [13:52:12] (PS1) Addshore: Add script to count refs by type [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255991 (https://phabricator.wikimedia.org/T119777) [13:52:51] (PS2) Addshore: Add script to count refs by type [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255991 (https://phabricator.wikimedia.org/T119777) [14:09:42] (PS3) Addshore: Add script to count refs by type [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255991 (https://phabricator.wikimedia.org/T119777) [14:11:48] (PS4) Addshore: Add script to count refs by type [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255991 (https://phabricator.wikimedia.org/T119777) [14:15:21] (PS1) Addshore: Count total statements in statements_per_entity [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255995 [14:16:00] (CR) Addshore: [C: 2 V: 2] Add script to count refs by type [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255991 (https://phabricator.wikimedia.org/T119777) (owner: Addshore) [14:16:11] (CR) Addshore: [C: 2 V: 2] Count total statements in statements_per_entity [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255995 (owner: Addshore) [14:28:51] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838151 (BBlack) With varnishkafka-1.0.7 deployed and the patch above merged, the webrequest stream now has a correct "client_ip" field that analytics... [14:29:58] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838153 (BBlack) (and, I just read @ottomata's comment above - we can certainly switch the data into "ip" instead of "client_ip". That might be simple... [14:33:07] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838154 (JAllemandou) No problem for me to remove and reuse ip, and remove x_forwarded_for :) [14:36:02] morning [14:36:18] hey ottomata :) [14:36:20] How was your weekend? [14:36:23] Hi ottomata [14:58:51] Analytics-Tech-community-metrics, Possible-Tech-Projects: Misc. improvements to MediaWikiAnalysis (which is part of the MetricsGrimoire toolset) - https://phabricator.wikimedia.org/T89135#1838188 (01tonythomas) [14:58:59] Analytics-Tech-community-metrics, Possible-Tech-Projects, Epic: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1838192 (01tonythomas) [15:59:21] Hi ebernhardson [16:02:33] hi! [16:07:30] (CR) EBernhardson: Add page_id to webrequest and pageview_hourly (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/255318 (owner: EBernhardson) [16:07:36] (PS7) EBernhardson: Add page_id to webrequest and pageview_hourly [analytics/refinery] - https://gerrit.wikimedia.org/r/255318 [16:13:43] ebernhardson: I Didn't even have to tell you :) [16:13:51] ebernhardson: thanks for the new patch [16:14:09] (CR) Joal: [C: 2 V: 2] "Looks good to me !" [analytics/refinery] - https://gerrit.wikimedia.org/r/255318 (owner: EBernhardson) [16:15:45] joal: :) thanks for merging [16:16:09] ebernhardson: Deployment planning today, hopefully successful deploy tomorrow :) [16:17:30] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838344 (Milimetric) So there seem to be two threads here. Table level partitioning seems to me to complicate replication to the slaves and complicate application logic. It doesn't seem... [16:19:49] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838361 (BBlack) Ok, the old "ip" field now has the X-Client-IP data in the webrequest logs. The remaining pending patches here are: the (updated) on... [16:22:52] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838382 (jcrespo) @Milimetric: deleting data will not solve immediately the problem, as deleting data logically doesn't mean space is freed from disk. Hence the partitioning suggestion.... [16:28:35] Analytics-Kanban, operations, Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838414 (jcrespo) BTW, I found the acceleration issue: the automatic purge process was failing since some tables had been deleted. [16:30:12] Analytics-Kanban, Database: Delete obsolete schemas {tick} [5 pts] - https://phabricator.wikimedia.org/T108857#1838427 (jcrespo) Resolved>Open This task caused the purging process to fail (T119380). Table purge_schedule has to be updated to reflect the dropped tables. [16:30:13] Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1838431 (jcrespo) [16:44:43] ottomata: as per https://gerrit.wikimedia.org/r/#/c/256002/, should we update refinement not to compute client_ip but copy it from IP ? [16:55:26] (PS1) Joal: Deprecate client_ip functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256024 [17:01:50] a-team; coming to standup [17:01:56] ottomata: standup? [17:02:37] (PS1) Joal: Remove client_ip computation from refine [analytics/refinery] - https://gerrit.wikimedia.org/r/256027 [17:05:34] (CR) Ottomata: "I don't think either of these functions are deprecated. We just won't use them in webrequest jobs. They are still useful, if you have th" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256024 (owner: Joal) [17:05:43] Analytics-Kanban, Patch-For-Review: Create celery chain or other organization that handles validation and computation {kudu} [8 pts] - https://phabricator.wikimedia.org/T118308#1838677 (madhuvishy) [17:05:44] Analytics-Kanban: Implement the logic of each node in the celery chain {kudu} [5 pts] - https://phabricator.wikimedia.org/T118309#1838676 (madhuvishy) [17:05:47] (PS8) EBernhardson: Implement ArraySum UDF [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254452 [17:05:53] Analytics-Kanban, Patch-For-Review: Create celery chain or other organization that handles validation and computation {kudu} [13 pts] - https://phabricator.wikimedia.org/T118308#1838679 (madhuvishy) [17:06:26] Analytics-Backlog, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API overview - https://phabricator.wikimedia.org/T112956#1838690 (Milimetric) > I would like to use pageview data excluding spiders and bots but I need some kind of bulk download for all projects like it was before or... [17:06:42] (CR) EBernhardson: Implement ArraySum UDF (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254452 (owner: EBernhardson) [17:23:38] Analytics-Kanban, Services: Response times pageview API. Dashboard . - https://phabricator.wikimedia.org/T119886#1838795 (Nuria) NEW [17:23:43] Analytics-Kanban, Services: Response times pageview API. Dashboard . - https://phabricator.wikimedia.org/T119886#1838805 (Nuria) a:Nuria [17:26:29] (CR) Ottomata: [C: 1] Remove client_ip computation from refine [analytics/refinery] - https://gerrit.wikimedia.org/r/256027 (owner: Joal) [17:28:26] (Abandoned) Joal: Deprecate client_ip functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256024 (owner: Joal) [17:48:43] Analytics-EventLogging, Analytics-Kanban: EventLogging Kafka consumer stops consuming after Kafka metadata change - https://phabricator.wikimedia.org/T118315#1838961 (Nuria) [17:49:46] Analytics-Kanban, CirrusSearch, Discovery, operations, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1838966 (Milimetric) a:Ottomata [17:50:20] Analytics-Kanban, CirrusSearch, Discovery, operations, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old {hawk} - https://phabricator.wikimedia.org/T118527#1838970 (Milimetric) p:Triage>High [17:52:52] Analytics-EventLogging, Analytics-Kanban: EventLogging Kafka consumer stops consuming after Kafka metadata change. See if upgrade fixes it. - https://phabricator.wikimedia.org/T118315#1838990 (Nuria) [17:52:58] Analytics-Kanban, Services: Response times pageview API. Dashboard . - https://phabricator.wikimedia.org/T119886#1838992 (GWicke) This data from the perspective of the frontend RESTBase cluster is already available in graphite, under restbase.v1_metrics*. [17:57:58] halfak, any idea where Morten is? [17:58:42] Ironholds, has he been away for a while? [17:58:55] He's in Seattle, so I can't go throw a snowball at him [18:00:13] ahh [18:00:22] you have snow already?! [18:00:33] Can ping him more directly if you're looking for him. [18:00:38] First major snowfall was today. [18:00:45] ~4 inches of heavy stuff [18:00:49] * halfak went shoveling. [18:04:54] Analytics-EventLogging, Analytics-Kanban: EventLogging Kafka consumer stops consuming after Kafka metadata change. See if upgrade fixes it. [13 pts] - https://phabricator.wikimedia.org/T118315#1839067 (Ottomata) [18:05:06] Analytics-Kanban, Services: Response times pageview API. Dashboard . - https://phabricator.wikimedia.org/T119886#1839068 (Nuria) cache-control: max-age=3600, s-maxage=3600 should be added to AQS per @GWicke [18:05:14] Analytics-EventLogging, Analytics-Kanban: EventLogging Kafka consumer stops consuming after Kafka metadata change. See if upgrade fixes it. {oryx} [13 pts] - https://phabricator.wikimedia.org/T118315#1839069 (Milimetric) [18:05:29] Ironholds, shall I ping Nettrom directly? [18:05:59] halfak, naw, we're good, just an idle query :) [18:06:01] Analytics-Kanban, Services: Response times pageview API. Dashboard . - https://phabricator.wikimedia.org/T119886#1839071 (Nuria) [18:06:11] Oh! Gotcha. [18:12:30] Analytics-Kanban, Services: Response times pageview API. Dashboard . [8] - https://phabricator.wikimedia.org/T119886#1839084 (Nuria) [18:13:14] joal, :] [18:15:26] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages {slug} - https://phabricator.wikimedia.org/T117017#1839105 (Milimetric) @Ironholds: so far we're still of the same opinion. Filling the holes and/or adding more information to these responses should be the job of a hig... [18:21:25] mforns: ? [18:23:29] joal, are you staying for a while or leaving already? [18:23:41] do we have time for scala? [18:23:52] I'm still here, but I'll spend time with nuria on deployment plan [18:23:56] I see [18:23:58] ok [18:24:01] mforns: You can go for scala :) [18:24:25] joal, I'll do EL backfilling and if I have time, will go for scala, ok [18:24:43] mforns: If no scala no bother, we'll do tomorrow together :) [18:26:48] kevinator: Do you give me a minute? [18:26:51] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages {slug} - https://phabricator.wikimedia.org/T117017#1839139 (Ironholds) Multiple requests to the API? Just hit it over and over again until it 404s to identify the range of data? [18:29:01] joal: omw to batcave [18:29:07] k nuria [18:29:58] nuria: you can't hear me :) [18:46:58] (PS2) Nuria: Add LRUCache to webrequest spider identification [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255986 (owner: Joal) [18:49:30] (CR) Nuria: [C: 2] Add LRUCache to webrequest spider identification [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255986 (owner: Joal) [18:53:14] Analytics-Kanban: Remove avro schema from jar [1] - https://phabricator.wikimedia.org/T119893#1839308 (Nuria) NEW [19:02:27] dcausse: yt? [19:02:27] milimetric: around? [19:06:23] Analytics-Backlog: Create cron on 1002 to remove CirrusSearchRequest partitions - https://phabricator.wikimedia.org/T119897#1839364 (Nuria) NEW [19:09:16] ottomata: are you around? [19:09:37] ja [19:10:06] reviewing tomorrow deploy with nuria : do you htink we should add a field in webrequest table for network_origin (now that we have functions) [19:10:25] ottomata: --^ [19:10:35] ebernhardson: yt? [19:10:52] nuria: yup [19:11:23] ebernhardson: did you guys submitted a patch for camus.properties file on puppet for new cirrus search request runs? [19:11:47] ebernhardson: makes sense? [19:12:45] ebernhardson: see: https://github.com/wikimedia/operations-puppet/tree/production/modules/camus/templates [19:12:51] nuria: hmm, i don't think we did unless it was in one of david's patches, but i'm not seeing one [19:13:00] joal: sorry on phone, few mins... [19:13:07] np ottomata [19:13:53] ebernhardson, dcausse : i think we need the properties file that we tested with in puppet [19:16:11] nuria: the only patch he has there is https://gerrit.wikimedia.org/r/#/c/252432/2/modules/camus/templates/mediawiki.erb but that only adjust's camus for the schema id [19:17:02] joal: sure why not, maybe ask bd808 (he submitted it, right?) if he'd like that [19:17:56] ottomata: nuria think it's not needed, let's have bd808 decide :) [19:18:04] k [19:18:12] ottomata: can you review thsi one: https://gerrit.wikimedia.org/r/#/c/252432/2 [19:18:26] ottomata: I think is ready as dcausse tested it a while back [19:18:59] ottomata: i rather not add the bd808 ip dimernsions [19:19:22] ottomata: cause they are really mean to help qhen querying but i do not think tehy are of global interest to everyone [19:19:23] ok cool [19:19:23] i will merge [19:19:31] nuria: that's fine with me too [19:19:36] i think i don't have an opinion on that one :) [19:19:40] ottomata: then let's proceeed [19:21:15] ottomata: about the properties, will it break not having the new camus jar ? [19:21:21] nuria: properties updated on an27, they will be used during the next camus run [19:21:22] oh [19:21:28] uhhh, dunno! will it? [19:21:33] :D [19:21:36] ottomata: you ARE teh expert [19:21:38] *the [19:21:38] Dunno either [19:21:39] ajaja [19:21:48] but i do not think is an issue [19:21:53] i'm not up to date on what has been merged or deployed [19:21:54] as it adds one property [19:21:54] oh [19:22:01] ¯\_(ツ)_/¯ [19:22:02] that otherwise will be ignored [19:22:03] no, they jar doesn't use these properties yet, right? [19:22:06] ya [19:22:08] then it will probably be fine [19:22:14] i'll tail the logs and let you know if i see anything [19:22:15] no, until we deploy latest candidate [19:22:32] ok ottomata, I'll deploy the new jar tomorrow and will monitor [19:22:35] k [19:24:56] ebernhardson: there is no data on topic you guys need to save ? [19:25:23] nuria, joal: I think I agree with nuria that my ip origin function is probably not of general interest. I'll use it in the api specific tables when I finally get them setup [19:25:39] I think I checked with dcausse and (while changes are backwards compatible) there is no need to preserve data, correct? [19:25:44] Thanks bd808 for the heads up [19:25:49] bd808: thank you [19:31:34] sometimes it looks like I am writing from an enigma machine.... [19:34:05] Analytics-Backlog, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API overview - https://phabricator.wikimedia.org/T112956#1839444 (Nuria) @Symac: The poor response times are likely due to lack of caching and thus going to storage every time. We are working on fixing that and will u... [19:34:11] nuria: we can lose the data, we havn't started using it yet. I thought david set it up such that that wasn't necessary though [19:34:23] ebernhardson: correct, just confirming [19:34:34] cc joal so he is on the loop [19:35:18] ottomata: if I want to make a http request between stat100[123] and wdqs1001.eqiad.wmnet what might I be falling over? It looks like the firewall is open on wdqs1001, and I guess any outbound stuff is allowed from the stat servers, is there some extra firewall / segregation in the moddile I am missing? :) [19:40:37] yeah, no outbound stuff is allowed from the stat servers :p [19:40:49] the analytics cluster is firewalled off from the prod servers [19:41:11] Analytics-Backlog: Move App session data to 7 day counts - https://phabricator.wikimedia.org/T117637#1839469 (Nuria) p:Triage>Normal [19:41:14] you might be able to get away with setting http proxy [19:41:14] though [19:41:15] gm [19:41:16] hm [19:41:29] https://wikitech.wikimedia.org/wiki/Http_proxy [19:41:30] not sure [19:43:39] hmmm, okay *has a think* [19:45:17] yeh, I could use the webproxy thing! then I guess I just need to change the rule at the other end to allow stuff from the webproxy / look at the X-Forwarded-For ip [19:45:30] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages {slug} - https://phabricator.wikimedia.org/T117017#1839488 (Nuria) @Ironholds: I think part of the problem can be fixed with docs. We should certainly document that the pageview API will never have data older than May 2... [19:49:02] ottomata: yt? [19:49:48] ottomata: is there a way to access hive CLI outside 1002? I think not ... but just triple checking... [19:52:36] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages {slug} - https://phabricator.wikimedia.org/T117017#1839501 (Ironholds) Agreed, never older than May, but are you planning on clearing old data out as new data comes in for storage purposes? [19:53:00] nuria: ja [19:53:08] um, on analytics1027, ja [19:53:20] i mean, anywhere it is installed and has the right configs and can talk to hiveserver/hadoop [19:53:21] :) [19:58:57] Analytics-Cluster, Collaboration-Team-Current, Database: Replicate Echo tables to analytics-store - https://phabricator.wikimedia.org/T115275#1839508 (Neil_P._Quinn_WMF) [20:17:59] ottomata: see https://phabricator.wikimedia.org/T119108 [20:18:26] ottomata: is there a away to access hive w/o accessing it through a machine taht has access to private data? [20:19:20] ah [20:20:00] nuria: https://phabricator.wikimedia.org/T89887 [20:20:07] nuria, so [20:20:11] there is an 'analytics-users' group [20:20:13] that we basically don't use [20:20:28] but, this group will allow access to hadoop/hive, but not access to the webrequest table [20:20:35] since it is group-readable by analytics-privatedata-users [20:20:38] but [20:20:47] since currently we only give access to hadoop on stat1002 [20:21:02] and there aren't clean perms on stuff in /a [20:21:13] anyone with stat1002 can access private logs via the local filesystem [20:21:23] 2 possible solutions: [20:21:33] - 1 make hadoop accessable from stat1003 via analytics-users [20:21:45] - 2, fix permissions on stat1002 so that local files are not accessible [20:21:51] i think 2 is more difficult than it sounds [20:22:01] there are a lot of files there, and many things create those files [20:26:32] nuria: i would like to talk some eventlogging stuff if you have some batcave minutes [20:26:37] (i have 1:1 with kevin in 30 mins) [20:35:07] joal: didn't realize, but bblack has merged this [20:35:07] https://gerrit.wikimedia.org/r/#/c/256002/1 [20:35:21] should be fine, just need to deploy your patch to no longer do the logic [20:36:28] (CR) Ottomata: [C: 1] Update changelog.md for v0.0.23 deployment [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255976 (owner: Joal) [20:54:47] just starting out playing with pyspark and hive integration. just using the shell (pyspark --master yarn --deploy-mode client) and running sqlCtx.sql("select project, sum(view_count) as view_count from wmf.pageview_hourly where year=2015 and month=11 and day=11 and hour=11 group by project") is taking several minutes, where the same query in the hive cli takes a total of 40s [20:54:53] am i perhaps doing something wrong? [20:55:50] i'm up to 28 minutes of cpu time on the pyspark version, doesn't seem right [20:55:57] ebernhardson: spark does not auto scale, so dunno [20:56:02] how many mappers does hive make for you? [20:56:11] ottomata: back, let me know if you want to talk [20:56:15] with spark you have to tell it how many executors to run [20:56:23] nuria: k, let's talk after my 1:1 with kevin [20:56:23] ottomata: 7 mappers, 2 reducers [20:56:34] aye, default for spark is 2 exectuors (processes) [20:56:35] ottomata: the odd thing though is across all that hive used 40s of cpu time (not real time, but aggregate) [20:56:42] ottomata: k [20:56:42] dunno if that is the proble, but ja [20:56:42] and pyspark is up to 30 minutes [20:56:43] hm... [20:56:45] hmmmmmm [20:57:04] ebernhardson: not sure, but I have to say we haven't been impressed with the hive spark integration in our current version [20:57:13] we like SparkSQL/dataframes [20:57:17] but the hive integration was flaky [20:57:26] you can try to load the files from hdfs directly into a dataframe [20:57:32] hmm, ok. based on my reading of the doc's i can just read the parquet files directly intead [20:57:35] yes [20:58:49] madhuvishy: I accidentally closed IRC, sorry, what's up [20:59:03] ebernhardson: here is an example with sequenceFile/json, but parquetFile shoudl work similiarly [20:59:05] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Spark [20:59:06] Edit events with SparkSQL in Spark Python (pyspark): [20:59:51] thanks [21:00:02] ebernhardson: this is scala [21:00:03] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L188 [21:00:05] but same idea [21:00:11] or, https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L212 [21:12:03] milimetric: hmmm trying to remember what [21:12:25] :) [21:12:30] milimetric: oh [21:12:41] Amanda replied saying the name program metrics makes sense [21:12:52] was wondering if I should go ahead and change the code [21:14:18] (PS1) Addshore: Add inline TODOs [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256117 [21:14:19] madhuvishy: yeah, totally [21:14:25] milimetric: cool [21:14:29] (CR) Addshore: [C: 2 V: 2] Add inline TODOs [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256117 (owner: Addshore) [21:14:31] sounded like an even *more* global name :) [21:14:39] ha ha [21:15:09] GrantMakingMetrics? [21:15:30] GrantMakingGlobalProgramMetrics [21:18:36] oh madhuvishy I thought it could just be ProgramMetrics [21:19:01] milimetric: he he yes that's what i am going to change it to [21:19:02] or GrantProgramMetrics if you want to be more specific [21:19:19] oh ok [21:19:38] yeah GrantProgramMetrics sounds good too [21:28:36] i'm going to have to dig into a bunch more documentation. `source = sqlCtx.parquetFile("hdfs://analytics-hadoop/wmf/data/wmf/pageview/hourly")` from pyspark also takes ages to run (this will be aggregating over a week's worth of data, so selecting individual directories or even months wont work right) [21:31:00] ebernhardson: in either hive or spark, you'llh ave to specify the partitions you want [21:31:09] you should just compute which days are 'last week' [21:31:14] and load it up that way [21:31:55] madhuvishy: give ebernhardson some tips :) [21:31:57] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L193 [21:31:57] :) [21:32:17] ok nuria want to lbatcave? [21:32:26] ottomata: i was thinking that would happen after initializing the variable that points to the directory. clearly i have a lot to learn about this :) [21:32:26] k [21:32:30] give me 2 mins [21:32:54] yeah ebernhardson it isn't automatic [21:33:05] just like in hive, you have to tell it what data to load, so you don't hit all of it [21:34:29] yeah I wish it was smarter about it [21:35:36] ebernhardson: it could also be slow because of too low driver or executor memory [21:35:44] you can pass those in the cli [21:36:03] i think the defaults are like 512MB [21:37:01] like this: [21:37:12] spark-submit --master yarn --driver-memory 2g --num-executors=8 --executor-cores=1 --executor-memory=2g --class [21:38:11] ebernhardson: http://spark.apache.org/docs/1.3.0/configuration.html [21:38:33] madhuvishy: eventually it hit the GC overhead limit and started spewing stack traces, so that might be as well. it seems sensible to follow that AppSessionMetrics lead and generate a list of directories rather than passing the top level one though [21:39:12] ebernhardson: ya definitely ramp up memory then - it happened a lot for the session metrics job because we were loading 30 days of daya [21:39:14] data [22:20:46] (PS9) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the ProgramMetrics API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [22:58:08] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1839888 (Ottomata) @gwicke and I discussed the schema/revision in meta issue in IRC today. He had an idea that I quite like!... [23:21:30] mforns: I added hash-parsing to the API demo, wanted to be able to link directly like this: https://analytics.wmflabs.org/demo/pageview-api/#Beirut,Paris [23:21:49] milimetric, awesome! [23:22:21] mforns: the change in case you wanna update the gist / code review: [23:22:24] https://www.irccloud.com/pastebin/XPf2Y9l1/ [23:22:39] thx [23:25:34] milimetric: excellent! [23:26:06] :) it's not fancy like dashiki but I wanted to link from the blog post [23:26:51] milimetric: i was about to do that , just mentioned marcel the other day how bookmarks make your life so much easier [23:27:52] yes, I'm kind of worried this leads us on the path of people filing bugs and feature requests for the demo, but we'll see :) [23:27:53] milimetric: but you need to remove those if you remove them from the search bar , right? [23:27:54] milimetric, nuria: I updated the gist [23:28:13] nuria, yes, or add new ones if you add them to the search bar [23:28:25] milimetric: ha ha i was just about to ask - date ranges on the url? [23:28:27] but we can implement that later, I already updated the gist [23:28:31] nuria: yeah, but that's what I mean, I kinda don't want it to get too fancy [23:28:38] I was just doing it specifically to allow direct linking [23:28:59] makes sense [23:29:01] see?!! /me closes vim and forgets where on limn1 this code is [23:29:19] :D [23:29:32] project too please :P [23:29:42] hehehe [23:59:15] a-team, good night, see you tomorrow! [23:59:23] have a nice night