' | wc -l [20:15:42] <myier_> yes I know it was going to be slow, I'm not in a hurry at all, it was just demotivating to see the expected number get passed and having no idea up to how many it would continue to grow [20:16:41] <myier_> whan you say the stubs, is it like enwiki-20191120-stub-articles.xml.gz 1.9 GB? [20:16:50] <apergos> I'm running the grep right now on the stubs file [20:17:00] <myier_> thanks! [20:17:24] <apergos> yes, stubs-articles is faster to decompress because it's gz and smaller since there's no page content, only metadata about pages and revisions [20:17:57] <myier_> that's still a lot of data for only that [20:18:01] <apergos> what are you going to do with the wiki once you've got all the artcles imported? [20:18:12] <apergos> out of curiosity [20:18:13] <myier_> an unconnected local mirror [20:18:35] <apergos> that's dedication [20:19:01] <apergos> so here's the bad news [20:19:05] <apergos> 19799262 entries [20:19:21] <myier_> the idea is to have a low cost (rasberry pi4) copy of wikipedia, for the possible hard times coming [20:19:33] <myier_> ok... :( [20:19:52] <apergos> oh, someone else put it on raspberry pi by downloading the kiwix bundle, I read somewhere [20:19:58] <myier_> (and then create a wireless network that can access it) [20:20:30] <apergos> I wonder how it's going to be at parsing some of those longer entries on the pi4 [20:20:34] <apergos> speed-wise [20:21:09] <apergos> I think the kiwix copies are meant for local use on one machine, rather than for network access [20:21:48] <myier_> I don't really know, but having a mediawiki on this kind of processor is not a problem, except here there are more than a hundred extensions and large pages, a 5s delay is not a problem also [20:22:06] <apergos> some of those pages will take a lot longer than 5 seconds, trust me [20:22:09] <myier_> yes that's what I understood, and it's a specific reader, not a web browser [20:22:11] <myier_> ok [20:22:16] <Reedy> https://www.kiwix.org/en/documentation/how-to-set-up-kiwix-hotspot/ [20:22:32] <Reedy> But that is web based [20:22:44] <Reedy> Based on "Using your Kiwix Hotspot" [20:23:37] <myier_> at 7 million pages, it already takes 150G for the imported database files [20:23:58] <myier_> I need more than twice as much [20:24:02] <apergos> ooohhh [20:24:25] <apergos> yeah and imagine if you had the media files that go along with them :-/ [20:24:47] <apergos> bookmarking that, I need to look into it with some small wiki [20:25:12] <Reedy> looks interesting indeed [20:27:01] <myier_> indeed, I wasn't sure how it was official or not, not having found this page before, but it's briefly described in the kiwix wikipedia page I think [20:27:19] <apergos> oh it's official since it's on the kiwix site [20:29:01] <Reedy> I think it's something they're still actively working on [20:29:08] <Reedy> So maybe not completely "production ready" [20:29:51] <Reedy> https://www.kiwix.org/en/downloads/kiwix-hotspot/ [20:29:54] <Reedy> >End users just need a WiFi-enabled device with a browser: no need to download and install anything on their devices. [20:29:56] <Reedy> So yeah [20:36:07] <apergos> I do think the pi4 will be very underpowered for on demand parsing [20:36:49] <apergos> this might be an instance where prepopulating cassandra/restbase with html expansions of the more likely useful pages might be a good idea [20:37:07] <apergos> it adds a level of complexity though, where the kiwix thing is just grab, install, done [20:38:01] <apergos> enwiki doesn't embed wikidata references right? [20:38:04] <apergos> yet? [20:43:49] <myier_> as I understand it, it's like a kiwix OS, whereas with mediawiki, it's not static, possibly faster, and the computer can still be used for something else, but requires considerably more storage space [20:58:26] <apergos> mediawiki will be slower to serve pages that it must render, than kiwix which serves pre-rendered pages [21:00:10] <apergos> but if you want to be able to update the pages locally or import selected updates from the live site, then mediawiki is the way to go [21:00:26] <myier_> it's more fun any way [21:02:08] <apergos> :-) [21:14:27] <apergos> so you get everything except user pages and talk pages of any sort, in these pages-articles dumps; that includes project pages, categories, lua modules, portal pages, help pages, pages for any locally uploaded media, and gadgets [21:18:02] <apergos> there are 1191896 pages in the wikipedia namespace alone [21:19:45] <apergos> another 1878204 pages are templates [21:19:55] <apergos> sorry, not templates. categories! [21:21:36] <apergos> for templates we're talking about "only" 653213 pages [21:35:13] <myier_> it's crazy that there can be so many templates :) [21:35:21] <myier_> and categories! [21:37:16] <myier_> at about 150k pages per day, it'll take some time :) [21:37:33] <myier_> like 4 months [22:17:21] <apergos> usually it's best to use a tool to convert these xml files into a set of sql files and then import those along with the downloaded links and other tables [22:17:39] <apergos> I'm not sure we have any tools that are current (but it's on my todo list) [22:18:44] <Vulpix> mwdumper is broken afaik [22:19:08] <apergos> yes, it needs to be updated [22:20:59] <apergos> all right, I'm wandering off for the night [22:21:57] <apergos> good luck with the import! [23:00:52] <myier_> yes i've looked at all the tools, didn't find any, good night apergos, thanks a lot! [23:01:05] <myier

[10:39:31] Whenever I run npm run test, why do I get this error ? [10:39:36] Warning: Failed to load config "wikimedia/client" to extend from.Referenced from: /home/sohom/src/mediawiki/.eslintrc.json Use --force to continue. [10:40:44] Is there any way to remove the error ? [15:26:22] hi, why is there "h" in the recommended backup command? [15:26:24] https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki [15:26:38] "tar zcvhf wikidata.tgz /srv/www/htdocs/wiki" [15:26:52] why not "tar zcvf" [15:27:31] -h, --dereference follow symlinks; archive and dump the files they [15:27:31] point to [15:27:46] I guess because some people may symlink in things... [15:27:51] So it's easier to make it follow and back those upt oo [15:27:52] I found that too - what is the rationale behind [15:28:23] yeah but in my install, there is "README.mediawiki" a link to "README" [15:28:46] the link is stored as regular file if -h is used [15:29:38] (just tested) [15:41:09] Reedy: thank you for your answer anyway..! [18:23:54] Whenever I run npm run test, the eslint:all task fails with wikimedia/client not found... [18:24:16] Is there any way to get npm to run cleanly ? [18:29:46] Did you run install or whatever first? [18:33:09] Yup, I did [19:36:43] hi, I'm trying to import a dump of wikipedia en from the enwiki-20190920-pages-articles.xml file, I understood there are about 6 million pages in this wiki, but the import process is continuing while I've reached 7 millions. What's happening? [19:37:30] In other words, what's the overhead, and how high will it continue, because it's been 40 days to go through the first 6 millions... [19:37:37] thank you [19:38:12] "Recombine articles, templates, media/file descriptions, and primary meta-pages." [19:38:25] Yes there's only ~6M articles... [19:38:30] ala https://en.wikipedia.org/wiki/Special:Statistics [19:38:43] But Templates and file description pages will be in the millions too I would sugest [19:39:04] And I think "primary meta pages" would be Wikipedia namespace... Again, there'll be a lot [19:40:00] Nearly 50M pages on the wiki [19:40:11] I would suggest a lot of those won't be in there... But still [19:41:16] https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Page_count_by_namespace [19:41:41] indeed I didn't actually read the title ot the section... [19:41:45] There's ~0.9M File pages [19:41:56] ~0.5M Templates [19:42:09] ~1M Wikipedia pages [19:42:33] So I would guess at least 8.5M? [19:42:34] what pages are not considered articles or templates or files? [19:43:07] Everything else? Ish ;P [19:43:08] https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Page_count_by_namespace [19:43:18] Numerous NS on the wiki [19:43:35] I'm presuming talk pages aren't in that file [19:43:36] user and talk pages are not supposed to be in this archive [19:44:28] apergos: What counts as "primary meta-pages"? [19:44:55] do you have an example page that is not considered as an article, that falls into the namespace ID 0, by curiosity? [19:45:17] A redirect presumably [19:45:50] And enwiki would look to have ~3M of them [19:54:48] ok, thank you very much Reedy, I'll wait for it to complete then :) [19:56:59] oh talking about this, I have another question. If I understand correctly, kiwix and maybe others bundle the wikipedia media in an archive they distribute, anybody knows if it would be possible to use such archives to complete the imported wikipedia pages? I don't find much info about mirrors containing packages of wikipedia media [20:03:40] I guess it depends what you plan to do after... [20:08:10] um [20:08:31] pages-articles has stuff like templates in it [20:08:47] no user, no talk [20:08:57] but no just ns 0 [20:08:58] *not [20:09:00] Reedy: [20:09:07] Yeah, I'm reading ;P [20:09:20] I was just curious as specifically "primary meta-pages" [20:10:02] imports take a long time if you are using the import maintenance script because in order to deal with the links it is probably doing parser expansion on all the wikitext [20:10:07] which is hideously slow [20:10:38] by 'deal with the links' I mean populate all the links tables: category links, page links etc [20:11:20] if you want to know now many items are in the file, download the stubs and do a zcat stubsfile | grep '' | wc -l [20:15:42] <myier_> yes I know it was going to be slow, I'm not in a hurry at all, it was just demotivating to see the expected number get passed and having no idea up to how many it would continue to grow [20:16:41] <myier_> whan you say the stubs, is it like enwiki-20191120-stub-articles.xml.gz 1.9 GB? [20:16:50] <apergos> I'm running the grep right now on the stubs file [20:17:00] <myier_> thanks! [20:17:24] <apergos> yes, stubs-articles is faster to decompress because it's gz and smaller since there's no page content, only metadata about pages and revisions [20:17:57] <myier_> that's still a lot of data for only that [20:18:01] <apergos> what are you going to do with the wiki once you've got all the artcles imported? [20:18:12] <apergos> out of curiosity [20:18:13] <myier_> an unconnected local mirror [20:18:35] <apergos> that's dedication [20:19:01] <apergos> so here's the bad news [20:19:05] <apergos> 19799262 entries [20:19:21] <myier_> the idea is to have a low cost (rasberry pi4) copy of wikipedia, for the possible hard times coming [20:19:33] <myier_> ok... :( [20:19:52] <apergos> oh, someone else put it on raspberry pi by downloading the kiwix bundle, I read somewhere [20:19:58] <myier_> (and then create a wireless network that can access it) [20:20:30] <apergos> I wonder how it's going to be at parsing some of those longer entries on the pi4 [20:20:34] <apergos> speed-wise [20:21:09] <apergos> I think the kiwix copies are meant for local use on one machine, rather than for network access [20:21:48] <myier_> I don't really know, but having a mediawiki on this kind of processor is not a problem, except here there are more than a hundred extensions and large pages, a 5s delay is not a problem also [20:22:06] <apergos> some of those pages will take a lot longer than 5 seconds, trust me [20:22:09] <myier_> yes that's what I understood, and it's a specific reader, not a web browser [20:22:11] <myier_> ok [20:22:16] <Reedy> https://www.kiwix.org/en/documentation/how-to-set-up-kiwix-hotspot/ [20:22:32] <Reedy> But that is web based [20:22:44] <Reedy> Based on "Using your Kiwix Hotspot" [20:23:37] <myier_> at 7 million pages, it already takes 150G for the imported database files [20:23:58] <myier_> I need more than twice as much [20:24:02] <apergos> ooohhh [20:24:25] <apergos> yeah and imagine if you had the media files that go along with them :-/ [20:24:47] <apergos> bookmarking that, I need to look into it with some small wiki [20:25:12] <Reedy> looks interesting indeed [20:27:01] <myier_> indeed, I wasn't sure how it was official or not, not having found this page before, but it's briefly described in the kiwix wikipedia page I think [20:27:19] <apergos> oh it's official since it's on the kiwix site [20:29:01] <Reedy> I think it's something they're still actively working on [20:29:08] <Reedy> So maybe not completely "production ready" [20:29:51] <Reedy> https://www.kiwix.org/en/downloads/kiwix-hotspot/ [20:29:54] <Reedy> >End users just need a WiFi-enabled device with a browser: no need to download and install anything on their devices. [20:29:56] <Reedy> So yeah [20:36:07] <apergos> I do think the pi4 will be very underpowered for on demand parsing [20:36:49] <apergos> this might be an instance where prepopulating cassandra/restbase with html expansions of the more likely useful pages might be a good idea [20:37:07] <apergos> it adds a level of complexity though, where the kiwix thing is just grab, install, done [20:38:01] <apergos> enwiki doesn't embed wikidata references right? [20:38:04] <apergos> yet? [20:43:49] <myier_> as I understand it, it's like a kiwix OS, whereas with mediawiki, it's not static, possibly faster, and the computer can still be used for something else, but requires considerably more storage space [20:58:26] <apergos> mediawiki will be slower to serve pages that it must render, than kiwix which serves pre-rendered pages [21:00:10] <apergos> but if you want to be able to update the pages locally or import selected updates from the live site, then mediawiki is the way to go [21:00:26] <myier_> it's more fun any way [21:02:08] <apergos> :-) [21:14:27] <apergos> so you get everything except user pages and talk pages of any sort, in these pages-articles dumps; that includes project pages, categories, lua modules, portal pages, help pages, pages for any locally uploaded media, and gadgets [21:18:02] <apergos> there are 1191896 pages in the wikipedia namespace alone [21:19:45] <apergos> another 1878204 pages are templates [21:19:55] <apergos> sorry, not templates. categories! [21:21:36] <apergos> for templates we're talking about "only" 653213 pages [21:35:13] <myier_> it's crazy that there can be so many templates :) [21:35:21] <myier_> and categories! [21:37:16] <myier_> at about 150k pages per day, it'll take some time :) [21:37:33] <myier_> like 4 months [22:17:21] <apergos> usually it's best to use a tool to convert these xml files into a set of sql files and then import those along with the downloaded links and other tables [22:17:39] <apergos> I'm not sure we have any tools that are current (but it's on my todo list) [22:18:44] <Vulpix> mwdumper is broken afaik [22:19:08] <apergos> yes, it needs to be updated [22:20:59] <apergos> all right, I'm wandering off for the night [22:21:57] <apergos> good luck with the import! [23:00:52] <myier_> yes i've looked at all the tools, didn't find any, good night apergos, thanks a lot! [23:01:05] <myier_> (any working)