[12:31:43] Hey YuviPanda. Saw your message about query working on phones. [12:32:19] Not sure when I'll be writing SQL on a phone, but I suppose it might be nice to know if a query finished. [12:34:11] halfak: yeah, and it was a one line change anyway, so not much time spent on it [12:36:50] Finally getting all the diff for enwiki loaded into MongoDB. [12:37:01] It turns out that mongo has a 16MB limit on documents. [12:37:18] Regretfully, some of Wikipedia's revisions are > 32MB [12:37:24] So... I had to work around. [12:37:44] halfak: oh wow, diffs for all revisions, ever? [12:37:49] you must have a big mongo instance :D [12:37:55] also consider looking at rethinkdb at some point [12:38:09] * YuviPanda tried to setup a shared mongo instance for toollabs, and ran into so many mongo bugs... [12:38:52] Yeah. Not a huge fan of mongo. I just indexes on JSON blobs. [12:38:58] heh [12:39:14] and if you try to do performant-ish mongo, you end up doing things rdbms-ey anyway [12:39:39] The instance I have probably won't be able to contain all of the diffs. I'll have a pretty good estimate of the storage requirements at the end though. [12:39:49] heh, nice [12:40:35] I *could* put my mongo DB in /data/scratch. Good idea or bad idea? [12:40:44] TERRRIBLE IDEA [12:40:48] heh [12:41:08] 1. you make a request to Mongo, 2 Mongo makes request to NFS 3. NFS spins up its disks... [12:41:12] will be veeeerrrryyyy slow [12:41:27] I'm hoping that /srv (the storage partition) is local then. [12:41:33] it is [12:41:48] YuviPanda, if I'm constantly interacting with NFS, then the disks would remain spinning, no? [12:42:08] halfak: true, but other projects are also going to be using NFS and the disks would be spinning for them... [12:42:17] there's still an extra network call [12:42:17] Indeed. [12:42:51] Yeah. I don't think this will be high bandwidth and I'm guessing that this will be nearby in the data center. Regardless, I don't just move the DB into there. [12:42:56] *won't [12:43:21] :) [12:43:28] halfak: it's not bandwidth that's the problem, it's the latency [12:43:47] network is orders of magnitude slower than a 'local' disk [12:44:17] Oh sure. But I'll be writing append :) [12:44:35] and reading forreevvvveerrrr! :) [12:46:09] reading forever? [12:47:36] halfak: as in, very very slowly :) [12:47:55] halfak: also, magnus has started using Quarry! Asked for JSONP as a feature in an email, I just re-implemented CORS [12:49:56] YuviPanda, I don't know what you are talking about with reading, but I'm only worried about getting all the CPU work out of the way. Copying the data to a different location for reading is cheap. [12:50:27] CPU work is going to be the same, no? [12:50:35] or more in fact with NFS [12:50:43] since it'll spend more time on iowait [12:51:00] halfak: if you want mongo to be able to store data larger than local disk, you need mongodb cluster [12:51:01] YuviPanda, I don't think we're understanding eachother. [12:51:05] oh [12:51:08] that's highly possible :) [12:51:10] I just need to get the diffs generated *once* [12:51:17] That takes a lot of CPU. [12:51:31] then I can copy them around anywhere else. [12:51:40] yes? [12:51:40] Generating the diffs will take on the order of 3 weeks. [12:51:52] Copying the data will take less than 24hours no matter what kind of drive. [12:52:03] * YuviPanda continues listening [12:52:14] That's it. [12:52:20] where does mongo come into this? [12:52:32] Mongo is the indexed storage medium [12:52:42] That I happen to be using right now. [12:53:13] right [12:53:15] A mongo cluster would be fine, I guess, but I shouldn't need more than 500GB. [12:53:23] hmm [12:53:34] you'd need a cluster of 8-10 machines, I'd think [12:53:44] no chance of doing this in prod, I suppose [12:53:52] Why would I need 8-10 machines? [12:54:09] since each machine is going to give you only about 100GB of usable Mongo space? [12:54:18] hmm, 8-10 maybe too much, 6-7 at least tho [12:54:44] Wait... this isn't configurable? I can't just go to Coren and say, "I've got this project that needs some more storage space" [12:54:55] He'd rather have me spin up a bunch of instances with all that overhead? [12:55:07] hahhhaaaa [12:55:09] yeaaaaah [12:55:14] 160GB isn't very configurable, I'm afraid [12:55:33] Looks like this might not exist in labs then. [12:55:41] we could make them build a new image with 500G of /srv space, but that's a bit of work [12:55:45] and I dunno if they'll make time for it [12:55:54] it's not impossible, just a bit more work that they havne't done before [12:56:22] indeed, doing this on prod is going to be so much simpler [12:56:54] But one should now experiment on prod, right? [12:57:10] *not [12:58:20] theoretically no [12:58:26] does prod even have a mongo instance? [12:58:34] I wonder if we have budget in analytics. [12:58:47] and if it does I suspect it's going to be very outdated [12:58:58] halfak: I know there are spare machines with 12TB disks lying around [12:59:39] halfak: https://wikitech.wikimedia.org/wiki/Server_Spares [12:59:56] That'd work [13:00:05] halfak: and it'll be fairly trivial to set them up with mongo. you need to go through toby and mark tho [13:03:24] * halfak writes an email [13:04:40] YuviPanda, I wonder if it is worth confirming with Coren that he doesn't have something up his sleeve. [13:05:05] halfak: yes, with Coren and andrewbogott. if they're up for providing a larger image, it should be fine [13:05:11] It makes me sad to invite a bunch of new no-sayers to the conversation. [13:05:38] I'll hold off on the email and bring it up in -labs today. [13:06:16] halfak: ok. an email labs-l is probably not too bad as well. andrewbogott is going to be busy with wikitech migration today as well I suppose [13:06:36] wikitech migration? [13:07:16] halfak: wikitech.wikimedia.org was hand run hand maintained wiki separate by itself. it is now being migrated to be managed the same as the rest of the cluster [13:07:50] Isn't that a bad idea? What if the cluster goes down? [13:08:22] halfak: there is https://wikitech-static.wikimedia.org/wiki/Main_Page [13:08:26] which is a mirror hosted elsewhere [13:08:55] halfak: wikitech was always *in* the cluster, it just wasn't managed in the same way [13:09:04] gotcha [13:12:15] off for a bit [13:12:29] halfak: hmm, where's the code that generates diffs, btw [13:18:19] Do want just the diffs or the whole server? [13:18:43] https://github.com/halfak/Difference-Engine [13:18:53] https://github.com/halfak/Deltas [16:33:17] Ironholds: yt? we’re starting the research group [16:33:19] Ironholds, group meetingh?/