[14:43:17] lexein: as and are always found alone in their own lines, that means you can use those as markers for beginning and end of an article [14:43:40] lexein: under that assumption the following oneliner allows you to parse the dump with constant memory usage and no disk usage at all [14:43:44] curl http://dumps.wikimedia.org/enwiki/20131104/enwiki-20131104-pages-articles.xml.bz2 | bunzip2 - | perl -ne 'if(//){ print "PAGE START (buffer reset)\n"; $buffer=""; } elsif(/<\/page>/) { print "PAGE END (do any processing required)\n"; $buffer.=""; } else { $buffer.= $_; };' [14:44:07] lexein: you may and should do any per-article processing in the elsif branch [14:44:19] ooh - neat. excellent. Thanks - and thankis for addressing me by name so the ding got my attention! [14:44:57] that's AWESOME [14:45:07] ok, well, it's average [14:45:13] ;) [14:45:15] :) [14:46:08] I had quit and come back. I so glad you kept your channel history up [14:46:12] I -> I'm [14:46:29] I always read all the backlog [14:46:37] but in batches [14:47:04] Batches? Batches? I don't need no steeking batches. [14:47:38] * lexein delirious because insomnia [14:48:26] millimetric - did you see what average just put up for my pretty, pretty, pretty good [14:51:33] average - I see where to put my per-block text search. Should do nicely, if slowly over my present connection. I suppose I really should get a labs account. [14:55:07] lexein: also, keep in mind that shows up in the dump as "<ref>" [14:55:14] lexein: < is turned into < [14:55:20] lexein: > is turned into > [14:55:43] That would have thrown me. Thanks [14:56:29] Is average your wiki username? [14:56:45] Sending you some thanks [14:57:23] average - I just noticed you're connecting from Romania - sorry for throwing the TV and movie references at you [14:57:39] In case you *didn't* know them [15:01:34] lexein: the analytics team is spread across the globe yes. [15:02:37] lexein: as the link above is 9.6GB , you can actually just download it [15:03:14] True - haven't found my ext drive yet. Several computer maintenance tasks have been deferred of late. [15:03:26] lexein: you don't need the dump decompressed in order to get the information you need from it [15:03:36] I was very happy to see that [15:12:25] lexein: you can do xml parsing without having all the xml also.. [15:12:56] eh? That's indistinguishable from black magic [15:13:00] * lexein runs away [15:13:01] lexein: you can consider the bzip output as an xml stream, there are parsers on CPAN for xml streams [15:13:14] Oh [15:13:23] * lexein says from a distance [15:13:48] lexein: in the oneliner above $buffer contains the xml for one single page [15:13:58] Right. I assumed that [15:14:17] lexein: and you can consider $buffer to be an xml. there is XML::Fast on CPAN for example, and you can use it to parse $buffer [15:14:28] lexein: that is, if you actually need XML parsing.. [15:14:34] lexein: there's also XML::Stream [15:14:41] and many other modules [15:15:38] average - you won't believe what I'm trying to do. I just want to count all the and [15:16:11] lexein: sure, you can do that [15:16:49] At the moment Citation Style 1 templates are used at 2.7 million articles so far. I'd like to get a sense of what other styles are used, and how much. [15:17:21] This all goes to the question of a "house style" and whether it's time to start suggesting the house style for new, non-specialized articles [15:17:42] I'd like to be able to discuss from the data, rather than handwaving. [15:18:58] I plan to bring up an RFC to deprecate any citation style which, by design, obfuscates or omits information which could later be used to rescue a dead link in a cite. [15:19:08] So the numbers will help. [15:20:33] It might come to nothing. But it's better to know, than not know. [15:20:47] Can't work on it now, though. Maybe this evening. [15:37:43] lexein: there's also a bzip2indexer you might be interested in, it allows you to seek in the bzip2 stream and get what you need http://www.pmx.it/bzip2indexer.html [15:38:16] lexein: you could for example store offsets in the bzip2 that correspond to each article, and then access them [15:38:37] (all of this evades the approch of just throwing everything inside a database and doing queries) [15:38:42] *approach [15:38:46] Again, amazing, and thanks very much. It's all in my notes now [15:39:22] lexein: alternatively, if you lack disk space, you can use ZFS which is a filesystem designed to keep the data compressed but at the same time you're able to access it just like a normal filesystem. there is a performance overhead but.. it's better than nothing [15:40:04] so, there are many many ways you can avoid using a database for this [15:40:13] I think I'll be able to get everything I need from the help you've given. [15:41:11] I am not sure if they are still generated ... but there once were multistream dumps. They came with such an index. [15:41:29] Or, someone could decripple the WP searcher, so that I could just search for " lexein: above, if you get bzip2 offsets you can create an inverted index, and so you basically will have search capabilities inside a compressed dump without the need to have it uncompressed [16:07:14] lexein: http://en.wikipedia.org/wiki/Inverted_index [16:07:36] Heh. Aw, man, now I gotta go learn more stuff into my head. [16:07:48] * lexein brain broke [16:08:20] well, you asked how you can do it, and I told you. it depends what you need. my point is you don't need a database, and you don't need a lot of space to get what you need :) [16:08:59] I'm seriously thankful for all the effort you put it. It's all in my notes now, and I'll have toget to it later. Seriously, I am grateful. [17:31:20] (PS1) Milimetric: Implements the Standard Deviation Aggregator [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97052 [17:31:34] (CR) Milimetric: [C: 2 V: 2] Implements the Standard Deviation Aggregator [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97052 (owner: Milimetric) [22:46:16] (PS1) Milimetric: fix column width problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97128 [22:46:27] (CR) Milimetric: [C: 2 V: 2] fix column width problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97128 (owner: Milimetric) [22:49:58] DarTar: I'm considering running a query on s7-analytics that would join page and revision. Bad idea or OK to do every one in a while? [22:51:45] halfak: you mean like this ? https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/metrics/pages_created.py#L58 [22:51:51] :) [22:52:24] Maybe. I'm not sure. I don't read SQLAlchemy [22:52:33] It looks like this is per-user. [22:52:41] I'm planning to do the whole thing once. [22:53:14] halfak: I encourage you to learn sqlalchemy, I think it's very useful [22:53:26] I'm sure milimetric would agree [22:53:48] Yeah. It looks interesting. [22:54:05] But I already read SQL so well. And so does everyone else who touches the database... [22:54:10] it's not useful if you're just doing one-offs [22:54:38] I've been meaning to pick up SQLAlchemy just for the interoperability. [22:54:40] sqlalchemy is just great if you're creating queries based on multiple, modular user desires [22:54:57] eh, interoperability is more of a theoretical tihng [22:54:59] *thing [22:55:05] it's not often you have the same schema on multiple dbs [22:55:13] Oh really. Doesn't work in practice? [22:55:22] Oh yeah. [22:55:24] :P [22:55:32] * average grabs popcorn [22:55:56] * average puts back popcorn [22:56:30] BTW, do any of you guys know who I should talk to about hitting one of the "analytics slaves" that's not S1 with a big query who won't just say "no" on principle? [22:56:41] I was thinking of ottomata, but he seems to be AFK [22:58:57] Maybe I should ask for forgiveness rather than permission on this one. There's a reason we have these slaves available after all. [22:59:18] you could spin up a node in labs and do your bidding on it .. [23:00:23] but maybe that would involve re-importing the dump and stuff.. [23:00:45] Sadly, one of the things I need is the archive table which contains private data. [23:00:51] Otherwise, I'd just do this on labs. [23:01:29] * YuviPanda whispers 'tell your boss to tell the ops boss to get a DBA on getting the archive table on tools!' [23:01:55] halfak: how about cleaning archive of sensitive data and throwing it on labs ? [23:02:32] I was considering that, but it will require exporting the table which is part of my concern for long running queries. [23:02:51] +1 YuviPanda [23:02:52] you could run mysqldump with nice (man nice) [23:03:08] I don't think will help :P [23:03:14] This isn't a CPU intensive activity [23:03:39] then use ionice [23:03:42] halfak: it's usually easier to find ops time when someone on a team says 'I am blocked on this by ops action' :D [23:04:51] avergae: it's IO on the mysql server, not the server I'm logged into. [23:05:21] YuviPanda: Good to know. I'm currently being clever, but if clever fails I'll be bugging tnegrin. [23:05:34] Speak of the... manager :P [23:05:37] halfak: bug him anyway, since it'll also help other people :D [23:06:58] halfak: oh you're right, since the db sits on a remote node [23:08:55] tnegrin: I'm a sad panda because I can't access the archive database table on labs. Is there some way you can help me speed up the work on getting that table in? It's planned and just waiting on DBA time. See https://bugzilla.wikimedia.org/show_bug.cgi?id=49088 [23:09:19] I guess not. [23:10:27] halfak: email him and then send someone after him, but not on friday evening :D [23:11:01] Good call. Email will go out. [23:11:11] see, tnegrin materialized in this chat [23:11:26] but it sounds like this is more an ops request [23:11:43] Yuvi thinks that it's better if it goes through managers' hands. [23:12:09] (PS1) Milimetric: fixing cohort upload comma problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97134 [23:12:23] (CR) Milimetric: [C: 2 V: 2] fixing cohort upload comma problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/97134 (owner: Milimetric) [23:12:28] I hadn't seen this thread by the way [23:12:42] I'd be interested in knowing what the issue is with exposing revision summaries [23:12:45] for Legal [23:13:31] halfak: I think there's another option here [23:14:04] ask ops to allow the creation of temp tables on one of the s-* hosts [23:14:44] that might be faster than waiting for the censored version to be replicated on labs [23:15:24] DarTar: it went though legal, but this was just 'blocking pending DBA action' or somesuch [23:15:34] DarTar: That sounds good. What's the right channel for that kind of request? [23:15:37] how about taking this in #wikimedia-operations ? [23:15:47] DarTar: https://bugzilla.wikimedia.org/show_bug.cgi?id=49088 [23:15:51] YuviPanda: is there any written response by Legal that I can look up? [23:16:39] yep, I was referring to that ticket [23:17:42] halfak , DarTar /j #wikimedia-operations and mention the ticket https://bugzilla.wikimedia.org/show_bug.cgi?id=49088 [23:17:42] DarTar: you will have to ask Coren or someone [23:19:12] DarTar: depedency on 49189, so it's more springle [23:19:15] average, YuviPanda: ...or I could go and find Michelle and ask her directly [23:19:29] you could do that too :D [23:19:49] Cheater -- with your co-locatedness. [23:21:20] let me see if she's around [23:32:33] I talked to Michelle and she sees no reason why edit summaries or any other field would need to be censored from the archive table [23:32:33] Did you check for a virus on your system? [23:33:02] I'll send her 49088 so she can help us figure out who made that recommendation [23:33:37] if we can bypass censoring, that's one problem less towards replication on labs [23:33:49] I've got to run guys. Thanks for your help. [23:33:49] halfak, mutante, YuviPanda, average ^ [23:34:02] wooo [23:34:07] thanks DarTar [23:34:13] Thanks DarTar! [23:34:14] DarTar: you should get her to comment there [23:34:17] isn't the bug report assigned to Sean Pringle ? [23:34:25] YuviPanda: yes [23:34:33] average: from an ops perspective, yes [23:34:45] oh, so there are multiple perspectives [23:34:46] ok [23:34:47] but we need someone from Legal to chime in [23:35:30] I see [23:39:33] DarTar, I can think of a fairly obvious reason sa to why to exclude edit summaries [23:39:34] The client must have been hacked [23:41:35] I write a slanderous article and don't include an edit summary [23:41:40] updating my comment to BZ based on what Ironholds told me [23:41:42] so it automatically loads in *drumroll* the first slanderous line [23:42:05] I'm not sure this was the reason why Legal recommended censoring but it sounds like an obvious candidtae [23:42:06] The marketing department made us put that there [23:43:17] 01:32 < DarTar> I talked to Michelle and she sees no reason why edit summaries or any other field would need to be censored from the archive table [23:43:18] Well, at least it displays a very pretty error [23:43:29] so it needs to be censored or not ? [23:44:07] two things: [23:44:17] 1) I was going to add a qualifier [23:44:19] in one message you say Michelle says there's no need to censor, now you say it would be good that it should be censored [23:44:49] no need to censor any field other than those related to oversight [23:45:34] 2) I'm asking her to chime in on the thread to give the rationale for the edit summary censorship, but I Ironholds got it right [23:46:09] feel free to CC me in