[13:04:28] yoooooooooo [13:16:37] morning milimetric [13:21:33] morning drdee! [13:25:11] hangout? [13:28:11] i'm there drdee [13:28:15] 1 sec [14:32:47] (PS1) QChris: Ignore MySQL credentials and logs [analytics/geowiki] - https://gerrit.wikimedia.org/r/82413 [14:32:48] (PS1) QChris: Allow to pass MySQL config and database to make limn files from [analytics/geowiki] - https://gerrit.wikimedia.org/r/82414 [14:32:49] (PS1) QChris: Add script for cron to make limn files and push them [analytics/geowiki] - https://gerrit.wikimedia.org/r/82415 [14:36:58] (CR) Milimetric: [C: 2 V: 2] Ignore MySQL credentials and logs [analytics/geowiki] - https://gerrit.wikimedia.org/r/82413 (owner: QChris) [14:41:30] morning average [14:41:34] morning qchris [14:41:40] Hi drdee [14:41:41] morning ottomata [14:41:41] (CR) Milimetric: [V: 1] Allow to pass MySQL config and database to make limn files from [analytics/geowiki] - https://gerrit.wikimedia.org/r/82414 (owner: QChris) [14:42:32] morning! [14:43:24] aaaaaand i have to run home! I'm at a cafe now, but I am expecting comcast internet man to come to set up internet at my new place between 11 and 2 [14:43:34] good thing I have this mobile router! [14:43:37] back in a bit... [14:44:24] (CR) Milimetric: [V: 1] Add script for cron to make limn files and push them [analytics/geowiki] - https://gerrit.wikimedia.org/r/82415 (owner: QChris) [14:58:30] baaack [16:02:01] heya drdee, you there? [16:02:10] always there [16:02:23] but in product meeters hangout [16:02:24] what's up [16:02:26] ? [16:03:42] aye, just wanting some priortization! [16:04:57] things I have to work on: [16:04:57] - Camus timestamp bucketing using kafka message key [16:04:57] - reinstall hue, oozie, hive, etc. (would like to loop in leslie on this) [16:04:57] - Install openjdk 7 on prod nodes? [16:04:58] - Look into remote DC kafka/zookeeper options (probably ahve to work with mark on this) [16:05:23] reinstallation of hue etc. [16:05:30] and openjdk in labs first? [16:05:48] i've done openjdk with hadoop in labs before, but then redid it with oracle jdk 6 [16:05:53] can reinstall in labs I think [16:05:55] yeahhhhh yhea [16:05:58] lets do taht first [16:06:01] because then we can test the hive thinga [16:06:01] k [16:06:04] yup [16:06:04] and building udfs, etc., right? [16:06:09] that was one of the potential issues? [16:07:45] yeah but that will require more work [16:07:52] including changing the pom's [16:07:55] to use java 1.7 [16:08:01] and use the new dependencies [16:08:07] (cdh 4.3) [16:08:21] i would make that a separate card [16:08:31] and just first run the existing stack using openjdk 7 [16:11:23] right right, that's what I mean [16:11:27] we can run the stack on openjdk 7 [16:11:38] and build using the old jdk, and make sure stuff still works, right? [16:12:02] yup [16:12:20] and if not we can open mingle cards :D [16:24:35] making some food.... [17:00:52] ottomata, drdee, tnegrin, average, qchris: standup [17:00:54] milimetric ottomata scrum [17:00:59] eh? [17:01:04] yeah! [17:01:06] am I at the wrong link [17:01:07] it's 1pm [17:01:14] no I mean I'm in the standup [17:01:52] ottomata : ^^ [17:17:17] (PS5) Stefan.petrea: Implemented Survivor metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/81421 [17:32:42] I have to go and get some stuff, brb [19:34:41] drdee, tnegrin, I'm going to wait til the terasort is done before I run other benches and do the actual upgrade to openjdk 7 [19:34:45] its looking fine in labs though [19:34:53] drdee, did you want to try out any hive stuff in labs? [19:38:56] maybe start running some of those example hive jobs from hue [19:39:01] that's easy [19:41:46] hmm ok [19:41:51] oh i'm still on 4.2.1 in labs [19:41:52] ah [19:41:54] gotta upgrade [19:49:03] milimetric: http://ipython.org/ipython-doc/dev/config/extensions/autoreload.html [19:50:32] interesting average [19:50:43] oh, did you want to talk about being stuck? [19:52:25] 100% [19:52:33] milimetric: let's hangout [19:52:42] k [20:09:16] drdee, have you used json + hive before? [20:09:29] nope [20:20:37] drdee, know anything about hive partitions? [20:20:47] is it possible to use them without partition column names in the file paths? [20:20:49] yes [20:21:01] i don't think you wanna do that [20:21:25] no? [20:21:26] don't I [20:21:27] ? [20:21:37] 2013/08/27/08 [20:21:43] YEAR/MONTH/DAY/HOUR [20:21:46] that woudl be handy, right? [20:22:44] yep and that's how you want to partiiton [20:23:28] fyi: http://dumps.wikimedia.org/other/pagecounts-raw/2013 -> 403 [20:23:39] yeah, but how do I do that? [20:23:45] if I don't have column names in the path [20:23:46] like [20:23:58] year=2013/month=08/day=27/hour=08 [20:23:58] ? [20:24:59] awight: not for me [20:25:49] ok i will mention in -operations [20:28:46] drdee, is that possible? ^ [20:29:16] regex? [20:30:52] ottomata 1 sec [20:32:40] so what you want is an external partitioned table [20:32:53] so you do the partitioning and tell hive how it is partitioned [20:33:05] using the PARTITIONED BY command [20:34:19] ah, bo [20:34:20] boo [20:34:27] so I have to move the data so it has the column names in the file path though, right? [20:35:03] no, i think you can say [20:35:31] PARTITIONED BY (year INT, month INT, day INT, hour INT) [20:36:08] the whole purpose of PARTITIONED BY is that you don't need to have the column names in the folder structure [20:36:16] that only happens if you let Hive partition the data [20:39:27] how does it know where to find the values in the path? [20:42:18] i think it will split by '/' [20:42:21] not sure [20:42:22] though [20:54:37] doensn't like it [20:54:37] hm [20:54:53] well, i'll play with this more, but i can probably adjust camus to do what we want if we have to [20:54:54] but! [20:54:56] good news [20:55:09] I just created a (non partitioned) hive external table using json data from varnishkafka imported from camus [20:55:16] and it works! [20:55:26] I used com.cloudera.hive.serde.JSONSerDe [20:55:34] A-W-E-S-O-M-E [20:56:17] I especially like the varnishkafka part! [20:56:39] no shit :D [20:56:42] ja! andactually, Snaps_ and drdee, remember how we were talking about the key timestamp bucketing thing? [20:56:47] here's a q: [20:56:48] aight [20:56:59] drdee, are we going to do geocoding and anonyimzation as part of ETL? [20:57:00] because we can! [20:57:12] i would like to@ [20:57:16] I could write a Camus decoder or writer that does that [20:57:19] and [20:57:19] if so [20:57:38] i think we don't need to worry about having the timestamp key in kafka, because at that point we'd have to parse the json anyway [20:57:45] milimetric: I think you were right [20:58:25] i was? [20:58:27] :) [20:58:29] you are1 [20:59:14] WERE right [20:59:40] :D [20:59:54] milimetric explained that I'm doing date arithmetic on what I thought were timestamps [20:59:59] but sqlite doesn't agree [21:00:07] I mean that's my mistake [21:00:11] so milimetric was right [21:00:35] I'm actually doing arithmetic on dates and I expect them to behave like timestamps [21:00:54] cool, that's good to know that's how that works though [21:01:25] I thought of another way to verify, and that's to put the User.user_registration + survival_days in the select clause [21:01:39] (however you end up doing that date arithmetic) [21:01:48] to make sure it's calculating it properly [21:02:00] milimetric: that's exactly what I did, I put it in the select clause [21:02:03] and if you have a few ideas, you can add all the variations at the same time and see which one's right [21:02:05] k, cool [21:02:16] and it's zero all teh time, but sqlite didn't want to tell me. sqlite hides facts from me [21:02:21] it's a conspiration [21:04:01] now I have to look up in func to see if I can find something like strftime [21:04:20] what func? [21:05:14] func.strftime [21:05:21] func from sqlalchemy [21:05:22] average, milimetric: if you are serious about date arithmetic then have a look at python-dateutil [21:05:30] that's a real good library for doing that stuff [21:05:42] yeah, i encountered it with timeseries [21:05:48] i'm still not 100% sure we need it [21:05:49] the datetime lib from python has weak timezone support [21:05:57] right me neither [21:06:04] yeah, in this case we're doing simple stuff still [21:06:14] but dateutil seems very nice [21:06:14] but if ever you need to do timezone stuff then go easy go python-dateutil [21:06:18] yep :) [21:06:18] milimetric: mediawiki/revision.py says "rev_timestamp = Column(DateTime)" [21:06:30] milimetric: should we do a migration to turn these to UNIX timestamps ? [21:06:30] it is! it's really awesome actuallty [21:07:40] drdee: that lib is pretty cool, it solves half of the problem. the other part is how to store datetime objects with timezone included inside dbs transparently across sqlite/postgres/mysql. For me that's an open question because I'm still learning sqlalchemy [21:07:41] I don't think they're UNIX timestamps either average, they're the mediawiki format [21:08:08] Column(DateTime) seems to map to the mediawiki databases properly though [21:08:23] we can't change those, but we could change how we map to them [21:08:48] | rev_timestamp | varbinary(14) | NO | | | | [21:09:13] right, it's a 14-precision number right? [21:09:18] milimetric: ^^ this is the structure of rev_timestamp in the enwiki_p database on wikimetrics.pmtpa.wmflabs [21:09:26] milimetric: so it's a number in mysql yes [21:10:13] i remember when we first mapped the tables, DateTime seemed to make the most sense [21:10:22] if you can find another type, I'm ok switching to it [21:10:29] milimetric: should we migrate our revision.py model to treat it as a number also ? (changing the type of the Column rev_timestamp) ? [21:10:34] ok cool [21:10:34] but I would rather not use a mysql-specific mapping of varbinary [21:11:21] I will not change it for the moment, to get the metric done. but afterwards I'll give it a try. I suspect some other parts of the codebase will need changes if I do that right now [21:15:17] milimetric: are there types in mysql to store datetimes with timezone ? [21:15:33] but wait, why do we need these timezones ? uhm we don't need them [21:15:37] why is timezone a concern here? [21:15:38] dumb question, sorry [21:15:40] k :) [21:16:08] we do need timezones from the UI of wikimetrics treated properly inside the backend code [21:16:15] I mean that's the only time we can get stuff with timezones [21:16:33] milimetric: http://garage-coding.com/timeconvertor2/test.html [21:16:44] milimetric: can you have a look at this and tell me what you think ? :D [21:17:03] I tried to show it in the demo but I was completely unprepared so that didn't come out right at that point [21:17:14] milimetric: drag those green/red thingies around [21:18:39] so this would be a way to pick the timezone average? [21:19:11] milimetric: well, yeah, it's rather clumsy right now, but I could polish it and turn it into something more good-looking [21:19:27] it's just a proof of concept [21:19:31] drdee: there's a card for the timezone support also right ? not sure if for this sprint [21:19:40] no not yet [21:19:44] ok [21:19:49] i think we agreed it was not necessary for the time being [21:20:24] interesting idea, I think it could use a little thinking about how to display the times in all three clocks, but it's cool. So green would be the start and red would be the end? [21:20:42] milimetric: yes [21:21:28] ok, I get it. Cool idea average, we should revisit when we need it [21:32:00] average, I'm not liking how disorganized the test data setup is [21:32:01] I have two ideas [21:32:10] 1. move it all into a SQL insert script [21:32:28] 2. think about how to modularize it kind of how you did [21:32:50] for 2., I'd like to go some steps further and have the tests only create the data they need and nothing else [21:33:00] which way would you prefer? [21:33:54] I'd go for 2. [21:34:55] I agree it's constructing a fair bit of things for each test. At the moment I'm much more comfortable with things because I just discovered autoreload in ipython [21:35:33] Test running time is still a problem though. I'm not 100% sure that if we refactor tests, we'll get a better Testsuite running time, but at least we have them more organized yes [21:36:20] milimetric: we should definitely talk about 2. when you have time [21:37:38] ok, cool [21:38:23] I think 2. is better too. I think with 1. it'd be easier to set up one single monolithic dataset that can be used in all situations [21:38:31] but I think that's becoming harder and harder with so many metrics [21:38:53] so 2. would allow us to make it very clean, and that's important I think. Otherwise it just gets harder to add tests and that ultimately impacts quality. [21:39:09] not to mention community participation and ease of involvement for people like ryan faulkner [21:39:29] so then it's a matter of the cleanest way to set this up [21:39:56] after the metrics meeting Thursday I want to spend some time thinking about it and working on a cleanup [22:24:14] msg tfinc http://stat1001.wikimedia.org/public-datasets/analytics/mobile/ [22:24:23] qchris: i did [22:24:58] qchris: "Authentication Required" "WMF E3 Metrics API" [22:25:16] tfinc: Even plain wget allows to navigate to that url for me ... [22:25:49] drdee: Does http://stat1001.wikimedia.org/public-datasets/analytics/mobile/ need any kind of login? [22:27:40] Since ottomata is not around I'll try to get this solved via email. Thanks tfinc for trying. [22:28:00] qchris: sure thing. as soon as i can see the data i can let you know how many more of these we will need [22:37:15] qchris: that's new :) [22:37:31] ohhh that's UMAPI [22:37:32] mmmmmm [22:37:36] drdee: So we do require login? [22:37:38] no [22:38:04] It works for me™ [22:38:45] worksforme as well [22:38:59] qchris: about gerrit replication [22:39:06] I sent email to ottomata about it. Let's see if he knowns more. [22:39:12] does every repo in gerrit have a github remote added? [22:39:40] Every repo gets replicated to github. Yes. [22:39:42] qchris: maybe because tfinc has a wmf ip address [22:39:53] k [22:52:42] drdee: what's the best place for me to request a feature for WikiMetrics? [22:52:58] this would be a good place :) [22:53:01] or.... [22:53:17] an email to the analytics mailinglist [22:53:26] or an email to the wikimetrics mailing list [22:53:34] wow, I just weeded out like 5 bugs because of tests [22:53:38] yay :) [22:54:53] swweeeeeet [22:55:11] cool. maybe I'll do both! drdee: are there currently any plans to allow people to specify user-specific date ranges within a given cohort? Let's say I've got 10 new users who got treatment X on 10 different days, and 10 users who got treatment Y on 10 other days. I want to measure ns0 edits by each user in each cohort, in the 2 weeks after that user got the treatment. [22:56:25] you couldn't' give them the treatment on the same day? :D [22:57:03] nope :) [22:58:24] J-Mo: we can add a "dates-relative-to" field to users as part of the cohort creation process [22:58:32] we talked about that but have no plans of doing it soon [22:58:44] I'll post something to the list. I've got a use case for ya [22:58:56] cool, that'd be useful [22:58:58] and I'll follow up at the meeting next week [23:00:12] (PS6) Stefan.petrea: Implemented Survivor metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/81421 [23:00:16] milimetric: apart from flake8 and assert_equal(results[self.dan_id]["survivors"], False) [23:00:19] assert_equal(results[self.evan_id]["survivors"], False) [23:00:22] assert_equal(results[self.andrew_id]["survivors"], True) [23:00:29] damn I pasted .. my touchpad.. :( [23:00:40] milimetric: apart from flake8 and UI changes needed, it's currently working [23:00:43] tests passing [23:01:08] milimetric: tests refactored (partially, converging to the 2. you wrote about above) [23:04:20] for some reason, comparison of timestamps only works if I move everything in the LHS of the inequality [23:04:35] if I have stuff in the RHS also, it doesn't work [23:04:46] but that's not a problem since I can have everything in LHS and RHS will be zero [23:04:55] something related to sqlalchemy I suppose [23:17:01] milimetric: if I make flake8 pass should we deploy this metric ? [23:27:02] hey that's great average, let's deploy it together tomorrow [23:27:15] ok [23:27:21] ping me when you're around, I can't work on it quite right now [23:27:29] alright [23:27:36] later guys [23:27:41] drdee: how would I got about fiddling with hive locally ? [23:27:55] poke ottomata, get access to the labs instance [23:28:01] average: can you demo something tomorrow? [23:28:10] drdee: yes [23:28:15] upload some cohorts ;) [23:28:21] uploading now [23:28:51] drdee: but we have demo tommorow ? [23:29:01] oh yes, just saw it [23:29:13] * average thought it was every 2 weeks [23:46:49] it is every two weeks [23:46:54] last week was not a sprint demo