[17:02:08] djellel_: thanks for pointing out the cirrus search dumps (https://dumps.wikimedia.your.org/other/cirrussearch/current/) -- i always forget that they are there. they are actually super useful for some of the language modeling work we're doing but the HTML dumps would be different in that the cirrussearch dumps contain the processed HTML with everything but the text removed so you can't use them for identifying # of references or [17:02:09] more complicated text/link analyses. the HTML dumps would likely be like the text output in this API result: https://en.wikipedia.org/w/api.php?action=parse&page=Chicago&format=json [17:24:16] isaacj: cool! that's going to be a really huge corpus too. [17:54:41] djellel_: yeah, i suspect bigger than the wikitext dumps though i'm not sure how much bigger. i think at least it'll just be the current versions for now instead of the full history, which is what would massively blow it up [17:57:31] isaacj: https://download.kiwix.org/zim/ hints how big (assuming LZMA compression) [18:05:30] Yeah unfortunately I don't know how that compares to bzip2 [18:05:55] But I expect same order of magnitude as the wikitext dumps at least [18:06:12] And hopefully not even twice as big [19:28:14] wikipedia_en_all_maxi_2018-10.zim does "maxi" include the pics?