[00:01:20] DarTar, whenever you can, join the Hangout [00:04:50] * Ironholds hugs ^demon|headache [00:04:56] have an ibuprofen! [00:05:33] leila: yes, running a bit late, I need 5 more minutes [00:05:41] np, DarTar [00:06:18] <^demon|headache> Ironholds: I've been taking them since late morning like they're tic-tacs :\ [00:07:05] ...*hugs more* [00:09:01] Tentatively, it appears that I have won in hadoop. [00:09:05] \o/ [00:10:05] halfak: whadya win? [00:10:08] Ironholds, I have a strategy that will let us split up requests by LUCID/Hash/Whatever and process the page views in order -- in a stream -- to generate session statistics. [00:10:21] This will work for diff processing of revisions and persistence processing of diffs. [00:11:25] that's an ottomata, not an Ironholds! [00:11:32] difference is one of them is MEANT TO BE ON HOLIDAY >.> [00:12:12] Oh woops. [00:12:15] o/ ottomata [00:12:16] :) [00:12:25] I'm still skeptical that this will win generally [00:12:35] I can't get the KeyFieldBasedComparator [00:12:37] to wortk [00:12:59] but it will when I set up a many-fielded key to partition, it sorts the way I expect anyway. [00:13:25] I very nearly sent you an email before I ran a set of tests to confirm that it is working. [00:13:49] DarTar, I have another meeting in 15 min. you think you can make it to a Hangout soonish? [00:13:49] ok dinner time and holiday star ttime [00:13:57] byeyyeyeye i am glad you are winning! [00:16:19] dear god, C++. Why are you so basic. [00:16:26] simultaneously such a simple language and such a complex one. [00:16:33] It's like ML and Haskell had a baby. [00:16:49] "We're not going to write any of the functions you need, but will be pedantic enough to make writing them yourselves a /pain/" [00:17:14] leila: yes, coming [00:17:19] thanks! [01:08:26] so, DarTar or tnegrin, can you write up the format changes mobile asked for for the session data? [01:08:29] I'm blocked until I have it [01:08:40] (also, the prototype of the how-many-unique-users...thing. is now running.) [01:09:15] I'll update you -- Howie told me what they wanted [01:09:40] ta [01:09:55] if they're asking me to drop the geometric mean I am gonna make them a graph [01:15:26] halfak: how do I run one of the unit tests? [01:17:17] "nosetests" in the base directory [01:17:36] * halfak has no idea why the testing library is called "nose" [01:17:41] but it is awesome [01:26:18] halfak, I know this one! [01:26:18] http://nose.readthedocs.org/en/latest/more_info.html [01:27:11] thanks [01:27:14] That's a highly amusing piece of documentation. [01:27:31] I like reason #2: "Pythons have noses" [01:29:01] yup [01:29:10] hmn. I wonder if I can make thanksgiving a floating holiday. [01:29:26] I've got nothing to do then so I don't exactly know what to use it for. [01:31:06] hola nuria__ [01:31:13] if you have some time, let's chat about https://gerrit.wikimedia.org/r/#/c/162194/ [01:38:16] leila on a metting, i will ping you after [01:38:24] sounds good, nuria__ [01:39:17] halfak: what do you think? https://github.com/he7d3r/Revision-Scoring/commit/97a68a321fc79aa2874404f33870064def4871c4 [01:39:58] Huh. I thought I tested the same strategy and it didn't work. [01:40:02] The test is passing? [01:40:36] it seems to be :-) [01:41:05] Cool! [01:41:06] $ nosetests [01:41:07] Ran 49 tests in 0.190s [01:41:07] OK [01:41:11] Looks good to me then. [01:41:22] Nice work :) [01:42:08] I got tips from #wikipedi-pt :-) [02:04:02] leila: still there? [02:04:06] yes [02:04:35] halfak: what happen with editor metrics and hhvm, did we run that experiment? [02:04:45] sent a Hangout nuria__ [02:05:36] leila: it doesn't work, can you join batcave? [02:05:48] sure, nuria__. can you invite me [02:05:52] leila: http://goo.gl/1pm5JI [02:05:53] nuria__, we did. [02:05:59] I have yet to complete the writeup. [02:06:04] halfak: aaannnnnnddd??? [02:06:07] spoiler? [02:06:24] in batcave nuria__ [02:06:28] We didn't see the effects we were looking for. [02:06:32] nuria__, ^ [02:06:40] ay.... [02:07:04] At least they weren't terribly obvious. I have a working hypothesis. It's testable. I've already got some data on it. [02:07:19] But I don't think that HHVM is making newbies edit more. [02:07:57] It may be having more prominent effects elsewhere. [02:30:04] Hi! Apologies for the silly question, but how are reverts stored internally? Is a revert simply a new revision (with duplicated content), or are reverted revisions flagged somehow? [02:43:08] halfak, do you know where the "thank you" data is stored in labs? [02:43:25] You can get thanks events in the logging table. [02:43:37] But I don't think you can figure out who it was sent to. [02:43:57] oww, so you get the user_id_from but not the _to [02:44:18] http://quarry.wmflabs.org/query/1080 [02:44:22] Yes. [02:45:14] I see. thanks! and: do you have a way for getting the diff of talk page messages? [02:46:09] diffs are not stored in the db :( [02:46:28] leila, if you want the thanks data, it's stored (unsanitised) in x1-analytics-store [02:46:36] or x1-analytics-slave. I forget. [02:47:24] Ironholds, I'm wondering what's the best for getting the talk page diffs [02:47:44] again, on labs, preferably, or using the dumps [02:49:37] ahh [02:49:43] for diffs, you have two options! [02:49:49] 1: use the dumps and compute diffs by hand. [02:49:58] 2: use the API and have it compute diffs. [02:50:19] advantage of #1: you can go at whatever rate your machine will tolerate. Disadvantage of #1: you have to either use python or build a diff generator. [02:50:45] advantage of #2: easy as hell. Disadvantage of #2: not tremendously fast. Also it has the silliest error handling ever (Roan has admitted this is his fault) [02:51:03] I looked at just this problem for the purpose of crunching talkpage diffs and identifying when barnstars were issued. Wuz fun. [02:52:07] humm. do you sample code from that research Ironholds? [02:52:18] and which of the two approaches did you take? [02:55:08] I gave up in disgust, actually [02:55:20] uhun. how about the API approach [02:55:26] oh, that one's FUN [02:55:26] do you have a sample code for that? [02:55:39] I do! In fact, the API wrapper I wrote has a function which does just that. [02:55:52] takes a revID, computes a diff (backwards, forwards, whatever, you specify) [02:55:53] @leila we actually figured the thanks stuff. the user_to username is stored in log_title [02:55:59] there is only one minor caveat. [02:56:08] (Hi @leila, by the way ;) this is Lars ) [02:56:24] see, if you ask the API for >1 uncached diff, it goes "NO". Silently. It just returns an empty string for diff #2. [02:56:25] Hi roemheld. [02:56:38] so if you want to make sure not to run into these problems, you have to make the requests 1 diff at a time. [02:56:52] I think I worked out that doing that, and not attempting to internally DDOS our servers, would take approximately a year. [02:57:14] so, roemheld, log_user is user_id_from and log_title is user_name_to? [02:57:24] @leila exactly [02:57:40] so I've debated building a connector to https://neil.fraser.name/software/diff_match_patch/myers.pdf 's C++ implementation, instead. [02:57:42] as a way around that. [02:57:43] That's tricky! [02:57:45] I don't know if that would be useful. [02:57:59] oh, you've found the fun in the log table! Yeah, that's...my favourite ;p [02:58:11] lemme parse Ironholds [02:59:02] kk [02:59:16] top-level summary: API works but is slow, non-API you have to use python or roll your own, so I'm thinking of rolling my own. [02:59:20] (and putting it in WMUtils) [02:59:48] so, you're saying API is as slow as taking 1 year, Ironholds? ;p [02:59:57] while we're talking about this: we have user_id_from and user_name_to, but the log_page is ambiguous in that it related to a page, not a revision -- correct?! [03:00:21] halfak, fyi, roemheld has figured out that log_title in logging is user_name_to [03:00:39] (I thought it's user_name_from) [03:01:06] roemheld_afk, check out log_comment or log_params [03:01:12] leila: that was for 3 years of talkpage actions, yeah. [03:01:30] @Ironholds thanks -- did that but to no avail [03:01:33] I wanted to identify barnstar-givings so I could do an analysis of thanks versus barnstars, and how the effect varied from method to method and population to population [03:01:44] roemheld_afk: this is specifically for a thanks example? [03:01:47] yeah, Ironholds. roemheld_afk is looking at 1 year. that still will take a long time if your 1 year estimate is accurate [03:01:50] I think that may be a feature, not a bug. [03:02:00] thanks is designed to not make public all of "X thanked Y for Z" [03:02:13] to avoid neutering the positive effect, goes the theory. [03:02:47] we definitely have that data in the x1 cluster of tables, along with things like flow talkpage posts, if that's of interest to people and they're NDAd ;) [03:03:17] roemheld_afk: yes, log_page is about the page, not a revision. [03:03:34] ah, the deeper levels of knowledge. . . ;) @Ironholds [03:03:41] we'll try to make due without that for now [03:03:49] (afk for real now, be back in a bit) [03:04:54] (time for me to hop on the bike and go home, too) [03:05:18] Ironholds, see you tomorrow, or later tonight. [03:06:14] probably tonight. Take care! [03:06:26] roemheld_afk: okay! What are you working on? [03:12:42] Ironholds, they're going to make a research page on meta about it, basically sentiment analysis on user talk pages, plus making predictions models to understand the impact of sentiments on user retention, etc. [03:13:02] okay. really out of here. see you in while. ;p [03:19:54] okie! [04:19:15] Hey guys, allow be to bump a previous question: how are reverts stored in the tool labs database? How do I know whether a specific revision was reverted? [04:33:23] roemheld: good question! [04:33:40] if rev_sha1 of [successive revision] matches rev_sha1 of [preceding revision], it's reverted. Probably. [04:33:59] of course, it gets a lot more fun with talk pages, where bots automatically archive in a lot of cases, leading to that happening for non-revert reasons [04:34:22] so for there you have to exclude revisions WHERE rev_user IN (SELECT ug_user FROM user_groups WHERE ug_group = 'bot'); [04:35:16] yes, this is precisely as silly a storage mechanism as it sounds. By which I mean, you know: not actually a storage mechanism. [04:39:23] @Ironholds thanks. Unfortunately I'm struggling to understand this entirely logical procedure ;) [04:40:02] so to figure out if revision (n) was reverted, I need to look at revision (n-1) and (n+1) and compare their hashes?! [04:40:16] is there really a revision (n+1)? I thought revisions are tree models [05:53:39] roemheld: tree model? I mean, the contents is, but they're sequential within a page. [05:54:45] and nope, if you wanna be accurate you need to look at, assuming A revisions in a day, whether a member of A[>n] has the same hash as A[ and then the reverted revisions are the members between the two matching hashes. [05:55:07] * Ironholds thumbs up [05:56:11] See the methodology in http://dl.acm.org/citation.cfm?doid=1240624.1240698 [06:05:37] oh, shit. . . [06:07:03] thanks, Ironholds. so finding out how many of a user's edits were later reverted is (very) nontrivial :o [10:39:19] Ironholds: I sent you an email [10:39:24] I wonder if it'll go past your spam filter [13:51:21] does anyone know how I can distinguish a redirect article from a "normal" article? Are they in different namespaces? [13:53:30] Do you want to do it with the DB? [13:53:39] Do you need to know what it redirects to? [13:53:49] milimetric, ^ [13:54:19] halfak: yeah, with the db, no don't care where it points [13:54:28] was just pondering Erik's point on our recent warehouse thread [13:54:35] page.page_is_redirect [13:54:38] it's boolean [13:54:40] on the page table [13:54:42] oh, great! [13:54:49] :) [13:55:02] i had left that out of the proposed draft, thanks! [13:56:21] hth :) [14:38:57] morning! [14:54:38] Hey Ironholds [14:54:52] I've got most of the session datasets ready. [14:55:04] I still haven't sampled the lol one yet though. [14:57:54] hey halfak! [14:57:56] that's okay :) [14:58:04] I've spent my morning replying to knuckleheads on stack overflow [14:58:23] one day I'm going to become emperor of earth solely so that I can pass a law mandating that all classes featuring R open with an explanation of for loops. [14:58:36] It was very nice to have the Wikimedia datasets pre-anonymized. [14:58:46] and why they are inefficient? [14:59:36] Also, I'm amazed that we know who sent thanks to whom. I was told that this was kept private. [15:00:33] it is! [15:00:37] by which we mean, only we have that data [15:00:47] I can find out, most users can't. MWAHAHAHA. [15:00:52] Well.. I can find it in quarry. [15:00:55] It's in the labs DBs [15:01:02] you can? [15:01:03] oh! [15:01:05] yeah [15:01:10] yeah, I think there we just anonymise revision [15:01:14] And, for loops have two levels of inefficiency. [15:01:31] The first is growing objects. You have an empty vector, you append to it in a for loop. BAD MOVE. It has to find new memory each time. [15:01:38] Solution: whenever possible, pre-specify output length. [15:02:07] If you append to a vector, it shouldn't be re-allocating every time. [15:02:19] not re-allocating, but allocating an additional N fields [15:02:49] Yeah.. It's common to double the amount of available memory when you run out [15:02:57] So you will need to re-allocate log(n) times. [15:03:02] specifically log2(n) [15:03:13] Still. not the best [15:03:14] the second is iteratively modifying data.frames. BAD MOVE. Data.frames are non-primitive and, as such, copy-on-modify. Iterative df modifications make baby jesus cry. You can almost always use vectors instead, and then df$vector <- vector at the end, for only one memory-intensive stupid operation instead of N. [15:03:56] Why is there no "rbindappend" that does this for us? [15:04:14] I guess because we should be doing it with rbindlist instead. [15:04:27] huh [15:04:39] at the same time, people who rely on lapply instead are silly. Lapply is only faster because it does those things inside itself and relies on different object types. You shouldn't accept silly output formats and arghing around turning a list of vectors into a data.frame or whatever just because you can't write output <- character(length(input)) [15:04:57] Lapply is good for a lot of stuff (like, when you specifically want a list, or specifically have a list as input. [15:05:17] but it's rare to find a situation where you're using it for vector/df based things and it's noticeably faster than properly setting up a for loop. [15:05:38] All this concern over optimization. [15:06:08] When I find the worst issues with code are due to a lack of coherence. [15:06:20] Now, if you can make fast, coherent code -- all the better. [15:18:26] halfak, define coherence? [15:19:52] Hmmm. Not quite the right word. I mean to say that the code captures the simplest abstraction possible. [15:20:00] This is often at the cost of performance. [15:23:42] ahh [15:23:44] gotcha! [15:23:53] I like my code to be hyper-fast. everything else is secondary. [15:23:58] I say as I roll my own TSV reader [15:24:29] I like to spend as little time coding as possible :) [15:24:42] This means I want to write my code to be clear and maintainable first. [15:25:00] I'd rather let the process run overnight than spend an extra couple hours getting it a little bit faster. [15:25:18] Now, this does not hold for operations where you know optimization can give you significant gains. [15:25:30] this is clear when an operation is performed all of the time and is generally useful. [15:25:37] yup. All of my optimisation is for crap in the generalised package [15:25:50] I think that geolocation and ua-parsing is a great example of code that should be *fast( [15:25:57] when a day of work gets me a 1-2 OOM speed improvement in code I call 500 times a day? hell yeah. [15:25:58] exactly. [15:26:08] I'd argue that reading the sampled logs should be fast for the same reason, mind. [15:27:57] "reading the sampled logs"? [15:28:25] that is, getting them into a data.frame, from the base TSV [15:28:40] fread can't handle the idiocy of the format, read.delim is slow as hell. [15:36:11] halfak, so I got the TSV reader working. 1 OOM faster and leaves us with a smaller memory footprint [17:21:30] * Nettrom no standup, CHI rebuttal [17:22:52] May the muse be with you Nettrom [17:24:54] Nettrom, I had to write mine on Sunday. [17:25:07] I wouldn't mind only reviews were 1.5:2 uniformly, and I've never written one before [18:06:55] ewulczyn, https://docs.google.com/a/wikimedia.org/spreadsheets/d/1CiJdg2tK9RG9ozL6h1QtzradYL_mnPMm6s21JepmEj4/edit#gid=0 when you have a sec for non-work [18:24:59] Hi! quick question, on Tool Labs server, larger query results should be written to /tmp/, correct? [18:25:50] halfak, would you consider a 0.002 variation between two functions acceptable? [18:25:53] ...it sounds stupid, but... [18:25:57] *% [18:29:03] tacotuesday, is it? [18:29:24] It's tuesday, isn't it? [18:32:27] and you get tacos? [18:35:04] I want Tacos [18:39:35] too bad. [18:40:06] okay, it takes 197 seconds to read in a file using C++ [18:40:21] (well, mostly that time is spent making it processable) [18:40:23] let's see for R... [18:50:54] okay, we're saving about 1.2 minutes [18:51:27] mind you, over 600 files, that's what, 10 hours? [18:53:49] halfak: you're within the delivery area of Taco Cat, no? [18:54:08] * halfak did not know about Taco Cat [18:55:02] Nooo! I'm outside the delivery area. [18:55:11] darn, sorry [18:58:17] blah [18:58:21] signs my workload is crazy: [18:58:28] my grandmother emailed me two weeks ago. I just got the time to reply to her. [18:58:54] I hear you. [18:59:08] My mom's emails tend to sit in the queue for at least 48 hours. [18:59:24] And I really *like* to email my mom. [18:59:38] Ironholds: did an email from me but pretending to be from you make it through to you? [18:59:57] nope [19:00:04] halfak, yeah, your mom is awesome! [19:00:07] can you check spam? [19:00:18] it's like the best bit of me being at your wedding: I legitimately got to say "your mom says hi" [19:00:33] also I got to see a couple of very special people be gooey at each other, but that's comparatively unimportant. [19:00:44] the important thing is the jokes. [19:00:46] :P [19:02:44] halfak, DarTar, leila, question for you [19:02:48] would you consider https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg a pageview? [19:02:57] Nope [19:03:00] Ironholds, halfak, ellery: analytics showcase [19:03:33] Ironholds: nope (but I would count the image impression as an image view) [19:03:48] totally [19:03:53] but I don't care about image impressions ;p [19:04:04] I'll care about image impressions when someone tells me it's the next step [19:04:18] Ironholds: yup [19:04:56] alright, thankee! [19:05:01] I'll scold the multimedia team [19:05:27] Ironholds, Jeff wants to talk about Hadoop/hive and this may be of interest to you. so come to the showcase if you'd like [19:05:42] I've really got to get this pageviews stuff done :( [19:06:09] np. he specifically brought up your name and halfak's. halfak is in. so I thought you should know. [19:06:21] alrighty, I'll make it along [19:21:43] okay, the cluster has spent 20 minutes prepping an exploratory query over a single hour of data. [19:22:03] this capacity problem is getting on my nerves [19:23:42] leila, Ironholds: lmk if you guys want to jump on the check-in with Maryana (starting in 7) [19:23:50] sure, happy to [19:24:35] halfak, apparently whatever revision thing you're doing is using 802816MB of memory [19:24:50] Na. It's using 50 more GB than that [19:24:50] clusters are /awesome/ [19:24:54] :P [19:24:57] oh yeah, reserved space [19:24:58] my bad [19:25:17] It's a big job. There's no excuse for using that much memory though. [19:25:26] It isn't my streaming jobs. That's for sure. [19:26:24] I'll join DarTar. [19:27:14] k cool [19:28:00] DarTar, so when does it kick off? [19:28:20] leaving now, it’s at 11.30 - brb [19:28:48] kk [19:28:52] send me the link? [19:28:56] will join in t-5 [19:29:35] done [19:43:28] halfak, so is your massive job a technological exploration kind of thing, or for an active project? [19:44:06] What? [19:44:11] the big hadoop job [19:44:14] Yes [19:44:32] What's the difference between an exploration and an active project [19:44:34] ? [19:44:38] rephrase [19:44:44] latter have more meetings [19:44:53] heh [19:45:09] Are you asking "is this work or play"? [19:45:11] It's work. [19:45:14] like: are you testing how to go about distributing this type of job over hadoop, and the results of the computations will not be used for an R&D task, or are you performing an R&D task that requires the results? [19:45:20] naw, not work or play, just what happens at the end [19:45:34] I appreciate my clarification is also fuzzy (working out how to do our jobs better should totally be considered a result, darnit) [19:45:35] Oh yeah. This is quarterly prioritized stuff. [19:45:40] gotcha; cool! [19:45:56] It looks like it isn't blocking other jobs, but it is making them hang on "PREP" for a long time. [19:46:11] So, I'm working with Ellery now to figure out what we can do to change priority. [19:46:40] I've tried to lower the priority -- and it looks like the command worked, but the lower priority is not showing. [19:47:03] The startTime lies, this has been going on for a few days. [19:49:31] I think I might know a more efficient way to do this. [19:49:44] I think that WikiHadoop is the problem. [19:50:59] Kill command sent [19:51:22] Farewell, my sweet progres. [19:51:23] s [19:52:53] My other thing is totally working now though. Time to change strategies :) [19:57:45] halfak, cool! Thanks :) [21:34:44] Anyone working from stat1003 right now? [21:36:12] noope. leila, ewulczyn ? [21:37:45] halfak, not at the moment but I'm writing a python script which will start running there shortly. [21:38:04] I just sent out an email about a couple of big jobs I have going. [21:38:15] They'll look scary, but they shouldn't be in the way of anything you try to do. [21:38:29] As in if you look at "top" right now, you might curse me. [21:38:31] sounds good. thanks. (email hasn't arrived yet) [21:39:46] I wont be using stat3 [21:50:54] halfak, we can't install something like db.py on stat1003, right? this should go through ottomata [21:53:44] or you just install it with python db.py install --local [21:53:47] and don't tell anyone ;) [21:56:49] I see Ironholds. so I have to download the package from somewhere and then do this? [21:59:48] yup :( [22:05:48] leila, I've been installing my own packages R style. [22:07:13] * Ironholds fistbumps halfak [22:07:17] install.packages(), aw yiss. [22:08:32] pip install db.py [22:09:19] nooo. pandas install failed. [22:20:11] leila, did you sudo? :D [22:26:11] I didn't Ironholds [22:31:19] aha [22:31:21] hh! [22:31:23] *huh [22:31:32] it's weird, ha? [22:31:44] tnegrin, can you let me know what format changes product wants for the mobile apps session data? [22:31:50] it's okay. I'm not in rush for it. let's wait until ottomata comes back [22:32:00] it's my only blocker on that front (we resolved time period questions at our weekly get-together this afternoon. Great success!) [22:39:31] ewulczyn: running a few mins behind, let me see if we have a room [22:40:50] Is there some big replication lag at the minute? [22:40:59] We're sending events but I can't see them in the DB. [22:41:33] Deskana: you should jump on #wikimedia-analytics [22:41:39] Oh, whoops. [22:42:51] Ironholds: hangout? [22:44:41] tnegrin, totally [22:44:46] just send me a link [22:49:58] tnegrin? [22:50:12] kk [22:51:07] https://plus.google.com/hangouts/_/wikimedia.org/tnegrin?authuser=0 [23:15:05] Helder & ToAruShiroiNeko: re. meeting with gabriel, I'd be happy to take that on and report back [23:15:16] However, if you want to join in, that's OK with me. [23:15:46] It will tomorrow at 2300 UTC [23:16:46] I think I won't be available, but I'm ok with you reporting back [23:17:18] Sounds good. [23:17:33] I think this will be a politicalish meeting anyway, so it's good to save your time for more productive things :) [23:26:05] hear, hear ;) [23:28:05] :) [23:28:33] btw, do you already have info on the kind of indexing (if any) you need? [23:29:23] A btree or hash will work fine for us. We'll likely be doing simple identity matching on rev_id. [23:29:52] At most, we'll have a record to match every revision in a wiki. [23:30:37] Now that I've thought about it, we might want to index on (wiki, rev_id) [23:38:12] what indexing are you talking about? [23:39:13] Helder, halfak: I was primarily thinking about something like 'give me the pages with the highest spam score' [23:39:26] secondary indexes on individual fields basically [23:39:32] Ahh. That's a use-case we were not thinking of supporting. [23:39:52] okay, easy enough then ;) [23:42:15] * halfak imagines looking at the least damaging edits out of curiosity. [23:42:32] halfak: but what if we have a wiki where patrolling/reverting bad edits has a big backlog? then the users could benefit from having a list starting from the more likely vandalisms [23:43:13] or even something like "give me the non-reverted edits with the highest vandalism score" [23:43:40] Helder, you're not wrong, but I'd expect a tool to manage that. [23:43:43] e.g. for users who are offline for a few days to catch up in the patroling [23:44:08] E.g. "give me scores for the revisions made in the last week -- I'll sort and shuffle them how I like" [23:44:25] yeah, maybe [23:44:26] But I can certainly see this use-case. I'm just worried about scope for *us*. [23:44:37] Now, a metadata service on the other hand ... :) [23:45:13] to us it's very interesting to think ahead about use cases like that [23:45:33] gwicke's "us" is the metadata service [23:45:45] us as in restbase team [23:47:44] Same difference? [23:47:53] (sorry I must be missing something) [23:48:33] my understanding is that you are currently interested in storing a fairly flat json blob with a few numbers [23:48:46] That sounds right to me. [23:48:50] we could store this as a json blob, or we could place each number in its own column [23:48:55] the latter is easier to index [23:49:15] Well, you can store the json how you like, of course. [23:49:15] and enforces a schema [23:49:36] You can enforce a schema on json. [23:50:16] okay, so it'll be relatively stable? [23:50:24] What will? [23:50:29] the json structure [23:50:50] Well. The way I imagine it being is stable. [23:50:54] Potentially additive. [23:51:07] yeah [23:51:14] But we're talking about a system that isn't standing yet. [23:51:27] You could specify an output format and we could just implement that. [23:51:47] additive extension is something we plan to support for table schemas [23:51:55] as it's easy to do [23:52:21] if it's going to be more complex than that, then I'd lean towards storing stuff as a blob [23:52:44] anyway, we can always start with a blob & refine later [23:52:53] KISS & all that [23:52:54] Hmm.. Well a mongo or postgres strategy would allow you to store and index json fields. [23:53:24] But neither will enforce the schema for you. [23:53:30] yeah [23:53:39] similar with elasticsearch [23:54:41] we've been toying with the idea of adding elasticsearch indexing at some point in the future [23:54:56] that should also work with a json blob [23:55:37] I really don't know what you're imagining for restbase. [23:55:51] But I think that sounds interesting. [23:56:26] What *is* restbase? [23:59:54] it's mostly a storage and cache service