[00:02:22] madhuvishy: there's nothing built in, but you could make a method in fixtures.py to do it. You'd have to probably manually insert the whole chain of dependent objects [00:02:33] (WikiUser, CohortWikiUser, and Cohort) [00:02:42] milimetric: ya i think i'm just gonna manually change the validation_info [00:02:56] but madhuvishy I wouldn't do that, you can just test the form with a list of valid and invalid user names, no? [00:03:15] milimetric: yup! i'll do that for the controller test [00:03:16] madhuvishy: yes, that's a good idea. You can do that in setup, like common_cohort_1, then update [00:03:36] i thought this is okay just for checking the RunProgramMetrics class [00:03:49] I'm saying, if you cover the invalid cohort response path with the controller test, there's no need to do it as a unit test [00:03:56] hmmm [00:03:57] ok, as you want [00:03:58] okay [00:04:05] if this doesn't work out, i'll do that [00:30:13] (PS12) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the ProgramMetrics API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [01:03:43] Where's the code for the UA parser now? https://gerrit.wikimedia.org/r/#/c/238139/ shows it being deleted but doesn't say where it is… [01:06:51] James_F: I will check, but i think we use the upstream packages now [01:07:07] https://github.com/ua-parser [01:10:32] Ah. [01:10:49] Thanks! [01:16:05] James_F: Based on this thread, https://phabricator.wikimedia.org/T106134 I'm pretty sure that's what happened. We use to fork it before for packaging reasons, we don't anymore [01:16:12] * James_F nods. [01:16:25] I'm thinking of adding it to MW-core. [01:16:33] we have it in archiva [01:16:51] No worries, they publish a composer package [01:16:57] ua-parser:1.3.0-wmf2 [01:16:59] oh cool [01:17:13] of course, you'd need that for MW-core [01:17:29] * James_F nods. [01:17:31] Thank you! [01:17:46] no problem! [01:18:50] Analytics-Backlog, Design-Research: Bot to call global metrics to event page {kudu} - https://phabricator.wikimedia.org/T120330#1851192 (Abit) NEW a:Capt_Swing [01:26:47] (PS4) Milimetric: Add pages edited metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [01:28:23] (CR) Milimetric: "madhuvhishy: take a look at this and see if you can use it in your patch. I'm not 100% sure if it works with Marcel's rollup thingy in al" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [01:28:49] madhuvishy: it wasn't so bad, took like an hour with breaks for dinner and stuff [01:29:04] milimetric: okay :) i'll look at it [01:29:15] i'm trying to figure out the cohort length stuff [01:29:15] cool. have a nice night [01:29:19] oh [01:29:22] what's up with it? [01:29:35] __len__ is implemented in CohortStore [01:29:42] right, but that's like all the users, right? [01:29:46] I think that doesn't deduplicate [01:30:18] you can make an alternate method in CohortStore or the service is probably better [01:30:40] yeah i should get the original size of the cohort by deduplicating it right? [01:31:17] cool okay [01:31:24] i'll figure it out :) [01:32:58] yeah, madhu, in get_validation_info I'd just run another query [01:33:11] well... maybe take an optional parameter there [01:33:23] because for production those tables are getting horribly inefficient [01:33:28] (we have to clean that up sometime) [01:33:36] hmmm [01:33:58] so like get_validation_info(cohort, session, unique_users=True) [01:34:15] and if unique_users: run a separate query to get unique user names across all projects [01:34:24] ya alright [01:34:28] makes sense [01:34:37] thanks :) [01:54:22] (PS13) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the ProgramMetrics API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [09:34:41] hi a-team :] [09:48:36] Hey mforns :) [09:48:44] What's up ? [10:24:35] hey joal! was afk [10:25:08] I managed to make the fixed-point alg to work [10:25:52] and I have had some thoughts on fieldStats and choosing the dimension to anonymize [10:26:07] if you want we can talk about that [10:26:45] (PS1) Addshore: Track AVG and MAX sitelinks per item [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256910 (https://phabricator.wikimedia.org/T120357) [10:26:48] (PS1) Addshore: Add counts of entites with sitelinks & statements [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256911 [10:26:59] (CR) Addshore: [C: 2 V: 2] Track AVG and MAX sitelinks per item [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256910 (https://phabricator.wikimedia.org/T120357) (owner: Addshore) [10:27:07] (PS2) Addshore: Track AVG and MAX sitelinks per item [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256910 (https://phabricator.wikimedia.org/T120357) [10:27:11] (CR) Addshore: [V: 2] Track AVG and MAX sitelinks per item [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256910 (https://phabricator.wikimedia.org/T120357) (owner: Addshore) [10:27:23] (PS2) Addshore: Add counts of entites with sitelinks & statements [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256911 [10:27:28] (CR) Addshore: [C: 2 V: 2] Add counts of entites with sitelinks & statements [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/256911 (owner: Addshore) [11:24:48] Hey mforns :) [11:24:52] Sounds good :) [11:25:00] hi joal [11:25:03] Batcave for a minute ? [11:25:07] do you want to batcave? [11:25:08] ok [11:44:44] Analytics-Cluster, Traffic, operations, Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1851937 (fgiunchedi) confirmed, thanks @bblack! [12:45:54] mforns: I haz dataz ! [12:46:00] :D [12:46:11] joal, show me ze data! [12:46:15] :D [12:46:18] batcave ? [12:46:22] yes [12:46:23] :] [13:32:53] Analytics-Tech-community-metrics, DevRel-December-2015, Patch-For-Review: Kill out-of-date scr-organizations-summary in korma - https://phabricator.wikimedia.org/T119756#1852174 (Aklapper) Open>Resolved http://korma.wmflabs.org/browser/scr-organizations-summary.html is a 404 now and the items is... [15:11:41] hey nuria, yt? [15:12:24] Hi ottomata, I have the same question :) [15:12:27] nuria: ? [15:12:38] heheh [15:12:48] How is it for you ? [15:13:48] ottomata: I have found an interesting point on avro and hive: when writing avro data through hive, it's not naturally compressed [15:13:55] is goooOoOOod, working just a half day today, taking half of one of my TG holliday days today [15:14:03] OooOk [15:14:04] awesome :) [15:14:11] intesreting [15:14:23] I noticed playing with discovery data [15:14:40] I re-partitioned their data --> bumped the datasize by 3 ! [15:15:14] Investigation --> Camus uses deflate compression for avro by default [15:15:27] And some special settings need to be set on hive to enable that [15:15:39] HMmMm oh yeha i remember having to modify the json outputter I wrote to just use the default mr setting... [15:15:40] Like that you know everything :) [15:16:18] i think... [15:16:26] Also on EL, I was backfilling yesterday's discussion on the chan, and I have a question [15:16:28] looking [15:16:32] ja? [15:17:08] So, when inserting on one big schema, everything is blocked (for one processor), correct ? [15:17:55] Like, inserting a few thousands events on a big table takes a long time - ok, but during that, we could also continue to work (spawn a thread per inserting batch) ? [15:17:59] ah yeah, just for sequence file output [15:18:01] i fixed it [15:18:01] https://github.com/linkedin/camus/blob/8210820797b3757dce3be35a06fa87d83a667988/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/SequenceFileRecordWriterProvider.java#L28 [15:18:33] right ottomata :) [15:18:48] i guess we should fix that in our Avro writer too :/ [15:18:49] hm [15:18:51] hmm, um [15:18:52] yes [15:18:55] for EL yes, that is right [15:19:03] so, first of all, i dont' think it shoudl take 20+ seconds to insert [15:19:04] something is wrong there [15:19:07] oook [15:19:09] but, ja, if an insert does take that long [15:19:17] it blocks the mysql insert thread [15:19:28] Like the reason it's blocked is uncool (IO blocking) [15:19:37] so, we could parallelize that in a few ways: make a thread for each insert [15:19:38] right [15:19:40] and have a small queue [15:19:43] 4 threads or something [15:19:43] or [15:19:52] just parallelize more consumer processes, which is kiiinda the same, but not really [15:20:03] its easier because we did the work with pykafka balanced consumer [15:20:13] can just spawn more processes in the same consumer group, no configuration or new code needed [15:20:15] yeah, not really : every consumer dequeus a lot, adding to the possible lag [15:20:22] yeah, exactly [15:20:30] hm [15:20:35] so each consumer will still get big inserts, and will still block [15:20:48] Adding consumers would work with ensuring very small memory footprint :) [15:21:01] threads you mean? [15:21:03] I think we should fix inside consumer [15:21:04] for inserting? [15:21:06] yeah [15:21:09] i mean... [15:21:12] it might not be worht it? [15:21:14] we should fix mysql [15:21:15] :) [15:21:21] but ja, that would be a better solution [15:21:43] I meant: if we reduce the number of batches queued as well as the number of events in a batch, having many consumers gets closer to spawning threads [15:22:18] As per mysql: You'll never fix it, it's a DATABASE ! [15:22:20] :D [15:22:27] hmmm, no not really, because, each consumer consumes from particular partitions, and each one will be blocked by a long insert [15:22:35] right? [15:22:35] hm [15:23:00] as long as the events inserted randomly into partitions, i think big inserts will be possible on all partitions [15:23:03] hm, no sure I get it about the partition ... Having consdumer groups doesn't solve that issue ? [15:23:58] hmm so i guess its the same [15:23:59] hm [15:24:10] yeah it sthe same, huh? [15:24:10] hm [15:24:32] for a sec i was thining the behavior would be different if you had a few threads inserting into mysql [15:24:37] buuuut, not really, its the same as having more consumers [15:24:46] its just as likely for a batch to be large in many of the threads [15:24:51] as it is for many of the consumers [15:24:56] The only difference is the memory footprint [15:25:01] why? [15:25:16] If using many consumers, each of them must be small [15:25:39] small? [15:25:42] If using threads, each consumer should be bigger (to take advantage of multpile threads dequeuing) [15:25:51] oh i see [15:25:54] ja guess os [15:26:20] In a many-consumer approach, ahving big consumers means taking resources from others [15:26:38] While only being able to work a small amount of it [15:26:53] With the thread approach, it's the opposite :) [15:27:39] hey dcausse [15:27:45] You here today ? [15:27:51] yes [15:27:55] cool :) [15:28:33] Have a question for you: In the camus package of the refinery, do you use the avro schemas defined in the resources folder, or was it for hive ? [15:28:38] dcausse: --^ [15:29:06] it's for kafka decoders [15:29:51] kafka avro decoders use src/main/resources to load the avro schemas [15:30:12] hive uses the inline schema duplicated in the create table script :/ [15:30:19] Yeah I know that: ) [15:30:25] Arf, that's the ay it is :) [15:30:48] dcausse: so we need to keep the resource export in pom.xml, correct ? [15:31:32] joal: I think we can drop it and use the default maven settings, src/main/avro was used to create java bindings but we do not generate them anymore [15:31:59] ok sounds good [15:32:02] and maybe drop the avro compiler plugin? [15:32:04] I'm ganna test that [15:32:11] hm, good point ! [15:32:55] Thanks dcausse for reminding me :) [15:33:22] yw :) [15:35:33] milimetric, around? [15:35:49] milimetric, have i shown you http://en.wikipedia.beta.wmflabs.org/wiki/DynamicGraph http://en.wikipedia.beta.wmflabs.org/wiki/DynamicGraph2 http://en.wikipedia.beta.wmflabs.org/wiki/DynamicGraph3 [15:37:24] actually dcausse, can't remove the plugin entirely: we use test schemas for timestamp [15:37:37] ok [15:38:30] But the prod part of it can, it seems [15:40:35] Analytics-Kanban: Remove avro schema from jar [1 pts] - https://phabricator.wikimedia.org/T119893#1852423 (JAllemandou) a:JAllemandou [15:41:30] Analytics-Backlog, Design Research Backlog: Bot to call global metrics to event page {kudu} - https://phabricator.wikimedia.org/T120330#1852426 (Capt_Swing) [15:51:21] (PS1) Joal: Clean refinery-camus from unnecessary avro files [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) [15:52:12] Also dcausse, some tricky stuff about avro and camus: camus compresses avro using deflate by default [15:52:49] you use snappy by defaults for others tables? [15:52:52] joal: , i forget, are we using a custom avro output writer? [15:52:58] or the built in camus one? [15:53:01] a custom one, rigth? [15:53:26] I guess we use the default writer, let me check the property file [15:54:05] we use com.linkedin.camus.etl.kafka.common.AvroRecordWriterProvider [15:54:35] ok, then i think you can change it just by setting etl.output.codec=snappy in the camus proerpties file [15:54:54] the proper proper thing to do would be to fix .AvroRecordWriterProvider to use the default mapreduce ones [15:54:57] which we set to snappy [15:55:05] but for some reason the camus dev didn't do that [15:55:13] buuut [15:55:21] instead of writing more code, just use their setting :) [15:56:57] is it a problem to change the output format of an existing dataset? [16:00:42] ottomata: hola [16:00:59] dcausse: It's not, hadoop uncompress correctly based on file headers [16:01:12] ok [16:04:12] dcausse: wait .. we do not use the java bindings any more for schemas? [16:04:35] nuria: no, but maybe for the tests [16:05:30] dcausse: the bindings are used when decoding though, right? when the class that has teh bindings is instantiated [16:06:45] nuria: we use now GenericDatumReader/Writers and not SpecificDatumReader/Writers [16:08:00] dcausse: but wehn we are "getting" a schema what we are getting are bindings really.. right? lemme see how that worked again [16:08:50] avro will just instantiate a generic Record where you can access data with getters like get("fieldName") [16:09:32] nuria: uhhhh, was going to ask what you think about not using /var/log/eventlogging for eventlogging output data [16:09:37] and just using /srv/log/eventlogging [16:09:44] currently they are symlinked, but i don't think i like it [16:09:49] ottomata: i like var/log/ better [16:09:57] ottomata: i mean ... [16:10:20] that is what i would expect in any system rather than /srv/ but maybe i am alone on this one [16:10:45] dcausse: ah, ok, we are getting schema here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-camus/src/main/java/org/wikimedia/analytics/refinery/camus/schemaregistry/LocalRepoSchemaRegistry.java#L86 [16:11:06] you'd expect daemon logs to be in /var/log [16:11:07] not data [16:11:23] maybe /var/lib/ or /var/spool or /var/cache, but not /var/log [16:12:01] nuria: yes, I think it makes for camus to use generic records, camus should not care about specific data model, just the timestamp [16:16:03] dcausse: ok, i remember now (sadness) why are we not using bindings cc joal [16:17:47] (CR) Nuria: "I would like to keep these as they are real useful for testing, maybe we can just comment on the pom.xml of this fact and leave things as" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) (owner: Joal) [16:19:47] Analytics-Tech-community-metrics, DevRel-December-2015: Explain / sort out / fix SCM repository number mismatch on korma - https://phabricator.wikimedia.org/T116483#1852550 (Lcanasdiaz) It's already working :) [16:19:57] Analytics-Tech-community-metrics, DevRel-December-2015: Explain / sort out / fix SCM repository number mismatch on korma - https://phabricator.wikimedia.org/T116483#1852551 (Lcanasdiaz) Open>Resolved [16:20:39] Analytics-Tech-community-metrics, DevRel-December-2015: Explain / sort out / fix SCR repository number mismatch on korma - https://phabricator.wikimedia.org/T116484#1852555 (Lcanasdiaz) It works :D [16:20:49] Analytics-Tech-community-metrics, DevRel-December-2015: Explain / sort out / fix SCR repository number mismatch on korma - https://phabricator.wikimedia.org/T116484#1852556 (Lcanasdiaz) Open>Resolved [16:24:13] ottomata: for logs, as long as we document where they are let's change it to wahtever you think is best [16:25:59] nuria: well, i was taling to bblack about systmed and logs, and he recommended that I didn't use custom log files at all, and just output to systemd, which allows you to read logs using journalctl command [16:26:05] instead of usual tail -f [16:26:16] ottomata: that seems unfriendly [16:26:19] which i think is weird, but assures me it is better (logrotate handled automatically, etc.) [16:26:34] ottomata: we want different logs like the ones we been having to date [16:26:43] so, in my puppet patch, I had already configured log files, and made /var/log/eventloggign be daemon logs, and /srv/log/eventlogging be output data [16:26:43] buuut [16:26:47] nuria: oh [16:26:52] they are different logs per service unit [16:26:53] that is the same [16:26:55] its kinda the same as upstart [16:27:03] except you dont' see the actual files anywhere(?) [16:27:07] you have to use journalctl command to see logs [16:27:12] ottomata: only taht you cannot use tail [16:27:16] ottomata: ahem ... [16:27:18] journalctl -f -u eventlogging-service-eventbus [16:27:24] journalctl wil lwork like tail [16:27:31] it seems effectively the same, just no actual file to deal with [16:27:32] i guess [16:27:39] ottomata: i am not fond of that change [16:27:56] nuria: i wasnt' initially, but it seems ok [16:27:59] i would just do a log file [16:28:02] except [16:28:04] ottomata: cause remember logs are also on 1002 [16:28:13] nuria [16:28:15] those are differnent [16:28:17] ottomata: no those are only data logs though [16:28:19] those are output to a file via eventlogging [16:28:20] right. [16:28:25] those are not operational logs [16:29:57] how is systemd going to deal with keeping 1 week of logs -which we need - [16:30:19] dunno, don't ask [16:30:21] it keeps em there [16:30:27] ottomata: also is real handy to be able to see logsizes when doing ls -lt [16:30:35] ottomata: to eye ball errors [16:31:38] ottomata: i do not like the systemd approach, on teh services i have maintained before having distinct logs for a period of time on the machine is real useful for an eye ball view of system [16:32:05] ottomata: I really do not see the benefit of putting it all on systemd [16:32:34] nuria: would you like to come to ops channel and say that last statement!? :o [16:32:36] haha [16:33:09] nuria: i haven't been a fan of the upstart eventlogging stuff [16:33:45] ottomata: what is what you do not like about it? I do like having dinstict logs [16:33:50] with known locations [16:33:55] this system is real small [16:34:13] nuria: i would agree about the distinct log files, but i sense that is because I dont' know jorunalctl well, and we might actually like it [16:34:17] it looks kinda fancy [16:34:46] ottomata: looks to me that is optimized for short term system logs [16:34:50] but, for upstart, i dont' like the way it forces logs to be in /var/log/upstart, i don't like how the eventloggingctl script works, and i don't like how we have the daemons all tied together [16:34:59] nuria: am askin bblack about persistence [16:35:20] ottomata: let me re-join irc cause my log in ops channel is not right [16:37:02] a-team - Global message - Currently talking with kevinator, we should all go to the engagement staff meeting and either reschedule / cancel standup [16:37:22] makes sense joal & kevinator [16:39:03] a-team: to the engagement cave! [16:39:08] :P [16:39:10] :D [16:39:31] I'm married milimetric, can't be engaged anymore [16:39:36] omg!!! I just realized, that we can now say "A team, TO THE BATCAVE!!!" [16:39:37] lol [16:39:58] it doesn't fit ... [16:40:37] http://iqtell.com/wp-content/uploads/2014/05/The-A-TEAM.jpg fits perfectly in a batcave! :) [16:41:14] I'm gonna sell the rights to that movie for 2 million dollars and then we'll see who's laughing mforns [16:41:31] I don't imagine b. a. baracus in leotards [16:41:42] lol [16:42:01] mforns: I actually wanted to chat about the demo [16:42:07] (pageview api one) [16:42:13] milimetric, to the batcave? [16:42:16] it seems people want us to make the stats.grok.se replacement [16:42:18] sure :) [16:42:20] ok [16:46:07] Analytics-Engineering, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Story: WikimetricsUser reports pages edited by cohort - https://phabricator.wikimedia.org/T75072#1852602 (Abit) another thought: maybe a way to do this without wikimetrics? the wikiwomen page already has code to... [17:02:14] a-team: stanbduppp? or staff breakfast? [17:02:24] nuria: staff breakfast [17:02:25] nuria, I'm in staff breakfast [17:02:26] it's an all-hands [17:02:30] staff, as per kevinator suggestion [17:02:33] k [17:02:35] all hands [17:02:41] ooof, survey results? [17:02:44] do I want to go? [17:02:44] :) [17:02:45] HmmM dunno [17:02:47] no [17:02:50] nobody *wants* to go [17:02:54] not even me [17:02:55] do I care to go? [17:03:10] tentative yes? :) [17:03:21] mforns: is there a blue jeans meeting? [17:03:29] nuria, yes [17:04:02] nuria, I passed it to you in private [17:05:39] a-team: should we have standup after or via e-mail? [17:05:54] a-team 10:30? [17:05:54] nuria: after's fine [17:05:59] works for me [17:08:37] youtube link? [17:09:24] a-team? [17:09:50] ottomata: https://www.youtube.com/watch?v=KlNcD3WgLDA [17:10:08] joal: not public :) [17:10:14] nope [17:15:33] (CR) Joal: "nuria: test schemas are still available in the resources package. Is that enough or do you want the pom thing to be commented for future r" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) (owner: Joal) [17:29:57] Analytics-Tech-community-metrics: Update ITS related data from Bugzilla to Phabricator/Maniphest in project-info.json - https://phabricator.wikimedia.org/T114636#1852708 (Aklapper) So if I get things right, `its_*` defines Bugzilla and `its_1_*` defines Phabricator/Maniphest (because e.g. `its_1_closers` exi... [17:32:05] Analytics-Kanban: Eventlogging devserver needs maintenance - https://phabricator.wikimedia.org/T120245#1852711 (Nuria) [17:32:28] Analytics-Kanban: Eventlogging devserver needs maintenance [3] - https://phabricator.wikimedia.org/T120245#1852714 (Nuria) [17:43:19] Analytics-Tech-community-metrics: Update ITS related data from Bugzilla to Phabricator/Maniphest in project-info.json - https://phabricator.wikimedia.org/T114636#1852758 (Aklapper) [17:43:20] Analytics-Tech-community-metrics, DevRel-December-2015: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1852757 (Aklapper) [17:43:59] Analytics-Tech-community-metrics: Update ITS related data from Bugzilla to Phabricator/Maniphest in project-info.json - https://phabricator.wikimedia.org/T114636#1852759 (Aklapper) p:Low>Lowest Alright, `browser/config/README` says: > The second file: projects-info.json is not really used now. The only... [17:49:41] Analytics-Tech-community-metrics, DevRel-December-2015: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1852766 (Aklapper) Looking at `mediawiki-dashboard/browser/config/menu-elements.json` the order in the side bar menu is: * scm * m... [17:57:07] (PS2) Nuria: Clean refinery-camus from unnecessary avro files [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) (owner: Joal) [17:57:42] (CR) Nuria: [C: 2] "Sorry, I misunderstood your change. Things are left as they are on testing but we are removing these schemas that are no longer used plus " [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) (owner: Joal) [17:57:51] (CR) Nuria: [V: 2] "Sorry, I misunderstood your change. Things are left as they are on testing but we are removing these schemas that are no longer used plus " [analytics/refinery/source] - https://gerrit.wikimedia.org/r/256952 (https://phabricator.wikimedia.org/T119893) (owner: Joal) [18:00:45] dcausse: did you test the snappy thing? i assume it works, but you know... :) [18:01:44] a-team, standup? [18:01:46] ottomata: no but I will :) [18:01:53] k [18:02:02] ottomata: i'm still here in this breakfast thingy [18:02:10] yea, about to be over though [18:02:34] I have one on one w/ kevinator in theory, but I guess we can move it [18:04:13] It's the retention, stupidQ! [18:04:14] :) [18:04:29] oops, wrong ch. [18:10:36] a-team: I'm in the batcave now [18:11:29] i'm in some stnadup hangout i guess [18:11:30] coming to batcave [18:11:32] a-team: standupppp cc ottomata [18:13:21] Analytics-Kanban, Patch-For-Review: Create eventloggging alarm that triggers when sql insertion goes to zero [3] - https://phabricator.wikimedia.org/T119771#1852804 (Nuria) Open>Resolved [18:15:42] Analytics-Kanban: mobile_apps_uniques_monthly not updating {hawk} - https://phabricator.wikimedia.org/T120224#1852811 (Nuria) Please see: hive (wmf)> SELECT year, month, unique_count FROM wmf.mobile_apps_uniques_monthly WHERE platform = 'Android'; OK year month unique_count 2015 1 6218069 2015 2 5924051 2015... [18:15:58] Analytics-Kanban: mobile_apps_uniques_monthly not updating {hawk} - https://phabricator.wikimedia.org/T120224#1852812 (Nuria) Open>Resolved [18:21:06] dcausse: thank you [19:17:30] Analytics-Kanban: mobile_apps_uniques_monthly not updating {hawk} - https://phabricator.wikimedia.org/T120224#1853021 (Tbayer) Thanks everyone for addressing this quickly! I already saw the data coming in yesterday afternoon PT and updated https://www.mediawiki.org/wiki/Wikimedia_Product#Reading accordingly,... [19:25:28] Analytics-Kanban: mobile_apps_uniques_monthly not updating {hawk} - https://phabricator.wikimedia.org/T120224#1853039 (Nuria) @Tbayer: it is not possible to fill in October data as we no longer have it. Sorry about that. [19:48:42] a-team: this is the gist for the insertion, it took 2.9 seconds on my local machine: https://gist.github.com/milimetric/cbf974bf6c192fcc2487 [19:49:07] I'll be doing that on el1001 now, lemme know if you see anything wrong with it [19:51:19] a-team: it took 3.53 seconds to run there [19:52:00] so bulk inserting works fine. The thing to do next would be to try and bulk insert into one of the very small schema tables vs. one of the very big ones [19:52:10] yeah [19:52:21] table sizes looks like a possibility [19:53:18] I think the size hasn't changed too much, but maybe the logs for each table or the indices or something [19:53:31] something's different about certain tables that the same exact code behaves differently [19:53:32] ya [19:53:37] some metadata [19:56:06] milimetric: with data size goes indexation issues [19:56:20] So double checking on indices could be a good idea :) [19:56:43] (PS14) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the ProgramMetrics API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [19:57:13] milimetric: btw, i figured out the bug with cohort size so it should work now [19:57:24] i'm gonna try rebasing on yours [20:12:54] madhuvishy: the easiest way to rebase is to grab the checkout command from gerrit and instead of checkout FETCH_HEAD just do rebase FETCH_HEAD [20:19:43] milimetric: aah [20:20:23] that's good to know, i was able to do it though by pulling it to another branch, and rebasing mine - at research brown bag now [20:41:38] madhuvishy: were you able to get the last access jobs to start running? [20:42:16] nuria: haven't gotten to it yet - will do later today [20:43:17] madhuvishy: k, i can also do it if you want if you let me know where is the code patch you are running them from. [20:44:07] nuria: its mostly on local/home folder. i will push to gerrit in sometime. I can do it i think, but it's possible i wont figure it out then i'll ping you :) [20:44:32] madhuvishy: sounds good, i will be working late today [20:44:38] ah okay [21:08:15] Sigh. http://bluerasberry.com/2015/11/is-wikipedia-traffic-dropping/comment-page-1/#comment-22437 [21:13:00] Nemo_bis: lemme see [21:19:03] perhaps an oddball question but my google-fu is weak...is there any way to get a struct<...> output in a column from a hive select statement? i have a feeling this is very non-standard, i'm trying to write a query that regenerates a log in a current format but from data we now have in hive [21:19:24] i might need a UDF, which would be fine [21:19:50] ebernhardson: cause you want to read an avro field like array ? [21:20:31] nuria: other way around, the the old log format is tab delimited, and one of the fields is json encoded. I'm trying to reconstruct the structure of that [21:21:05] ah "build" a struct type? i doubt it [21:21:30] ebernhardson: w/o a udf that is [21:21:41] nuria: makes sense, thanks! [21:56:28] milimetric: yt? [21:56:37] milimetric: check out this user agent [22:26:54] ebernhardson: you don't mean like this right? select get_json_object('{"eventid":12}', '$.eventid'); [22:27:53] if you do mean that, then https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object [22:28:21] but that's typically the extent of JSON parsing in Hive, there's another one called json_tuple that I know of, and the rest is all up to UDFs [22:28:38] milimetric: other way around still :) i mean have a table field fields foo, bar, baz, and generate output of "foo\t{"some" bar, "thing: baz} [22:28:55] the tab delimiters can be handled easy by hive output formats, but not sure about generating the struct [22:29:07] ebernhardson: oh, we're doing that too, but it's more manual [22:29:09] here, one sec [22:30:23] ebernhardson: line 89 here creates JSON with collect_set [22:30:24] https://gerrit.wikimedia.org/r/#/c/236224/23/oozie/cassandra/daily/pageview_top_articles.hql [22:31:11] notice on line 54 it's replacing tabs with '' [22:31:33] sorry, line 40, and also doing other JSON escaping things there [22:32:02] so it's possible using the various aggregate functions, but certainly not pretty. We didn't love the UDFs we found though, so we just went this way [22:34:48] intersting [22:35:16] i'll have to play with that a bit :) mostly trying to figure out if i can stop generating these outputs from mediawiki but without making us rewrite processing pipelines yet [22:46:20] * ebernhardson now also has to look up what grouping sets are just because curious :) [23:37:15] (PS9) EBernhardson: Implement ArraySum UDF [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254452 [23:42:00] ebernhardson: did you tested your udf in the cluster? [23:42:09] ebernhardson: if so i think is ready to go [23:42:57] nuria: i havn't tested this updated one yet, just with the tests i wrote in java. sec i'll copy over the jar i just generated [23:43:16] ebernhardson: ok, let me know [23:46:32] nuria: yea works as expected [23:48:03] ebernhardson: ok, let me run tests on the patch ' [23:50:49] (CR) Nuria: [C: 2 V: 2] Implement ArraySum UDF [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254452 (owner: EBernhardson) [23:52:32] Analytics-Engineering, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Story: WikimetricsUser reports pages edited by cohort - https://phabricator.wikimedia.org/T75072#1854425 (Milimetric) @Abit: I was able to figure out a hack to make this work for the program metrics. When you te... [23:54:34] Analytics-Backlog: Set up metrics for Time on Site - https://phabricator.wikimedia.org/T119352#1854427 (Milimetric) [23:58:17] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1854435 (Milimetric) As far as I could personally tell from the anecdotal analysis I did (as in not over 6 months of data but...