[13:37:45] hi there [13:38:16] hello [13:38:20] are we having the chat here? [13:38:28] Hi [13:38:48] in a few minutes [13:41:58] Molly should be here soon [13:45:34] so it will be one student and one mentor on each side? :) [13:46:24] Looks like that. :) [13:47:21] Yes... I hope so ! [14:00:01] Rtdwivedi, Zaran let's start [14:00:16] Is Molly here? [14:02:18] Raylton: I think Molly mentioned 2:pm UTC in one of her email [14:02:27] I don't mind waiting a few more minutes [14:02:35] unless you have other commitments later [14:04:55] If you do not mind waiting okay ... but our scope is well defined and we can start without her. I do not want to block your work [14:07:21] Rtdwivedi, Zaran, which the agenda of this meeting? [14:07:53] 1. How does the bookManager extension use pageMapping? [14:08:25] 2. How does BookManager extension uses ProofreadPage extension? [14:08:54] I think 2. comes before 1.: 1. is a subquestion of 2., the one we want to discuss today [14:09:10] also you might want to explain more precisely what you mean by pageMapping [14:10:01] Rtdwivedi, define: pageMapping [14:10:47] For any book uploaded, there are pages that we don't need to consider. Also, there are pages which have no numbers. [14:11:47] sorry ... I can not understand [14:11:58] ok, let me try to jump in here [14:12:05] let's look at an example: https://fr.wikisource.org/w/index.php?title=Livre:Marx_-_Diff%C3%A9rence_de_la_philosophie_de_la_nature_chez_D%C3%A9mocrite_et_%C3%89picure.djvu&action=edit [14:12:19] as you can see, the as a few parameters [14:12:25] In the Index page, there is a field 'pages'. The pagelist tag is used to define which page number in the wiki corresponds to which one in the book. [14:12:39] Sorry, I was looking for an example. [14:13:08] these parameters are there to define a mapping between page numbers in the file and page numbers as they are printed in the book [14:13:20] very often the first few pages in a book don't have a number for example [14:13:40] so you want to indicate that for example, the 10th page in the file has the number 1 printed on it [14:13:50] and that the first 9 pages are unumbered [14:14:09] sometimes it's more complicated than that, there is one unumbered page between each chapter, etc. [14:14:51] the pagelist tag has a pretty flexible set of parameters/options to specify the mapping between the page number in the file, and the displayed page number in the book [14:15:10] this is useful for displaying purpose [14:15:23] for example if you display the index page that I linked to above: https://fr.wikisource.org/wiki/Livre:Marx_-_Diff%C3%A9rence_de_la_philosophie_de_la_nature_chez_D%C3%A9mocrite_et_%C3%89picure.djvu [14:15:40] the page numbers you see correspond to the real page numbers as they are printed in the book [14:17:39] Raylton: does that make any sense? [14:18:03] we'll probably use djvu only to fill the content of the pages [14:18:38] Raylton: AFAIk, yes. [14:18:48] Raylton: the current understanding that I have, from previous discussions with you (or Molly, I don't remember) [14:19:09] is that when using the BookManager extension jointly with the ProofreadPage extension [14:19:22] we'll be able to define a few extra attributes when defining the book structure [14:19:47] for example, we would be able to specifiy for each subdivision (chapter, etc.) a corresponding page range in the scanned book [14:20:21] I guess that's not the top priority for you at the moment [14:20:38] but I think it could be useful in the long run [14:23:16] now the question is: how will you allow the user to specify the page range when defining the book structure? [14:23:28] we will support the navigation books without digitized version. But if any digitized version ... intend to import the metadata [14:24:09] navigation in books* [14:24:46] Raylton: Do you mean djvu versions? [14:24:50] yes [14:24:52] now the question is: how will you allow the user to specify the page range when defining the book structure? [14:25:03] ^out of the scope [14:25:48] Raylton: I see [14:26:03] Raylton> we'll probably use djvu only to fill the content of the pages [14:26:17] I'm not sure what you meant by that exactly [14:26:29] Content extraction? [14:26:44] but it looks like there will be some (loose) binding between the book structure as defined in BookManager, and the scanned book [14:27:33] yes ... but we intend to do just a little integration with the version that you plan to replace tag [14:28:27] Raylton: That has been dropped for the moment. What we had planned was a table that listed out the mapping between the page number in wiki and the book. [14:28:44] Raylton, I see: so basically you will just relying on whatever is specified by the or tag [14:28:48] Some user would have to proofread this mapping as well. [14:28:49] would, but we decided to leave, not to come within the scope of PP [14:29:23] I was wondering if you would need to be able to access the information in an easier way [14:29:39] to spare you the trouble of reparsing the tag [14:30:44] Zaran: Are you proposing 'storing' the result of the parse? [14:31:23] Rtdwivedi: I was thinking about maybe registering an API function [14:31:41] where you could get the page numbering mapping in a structured format [14:31:47] JSON for example [14:33:06] Sorry... i'm back in few minutes [14:33:31] see: https://meta.wikimedia.org/wiki/Book_management [14:48:39] From what I see, the major point of integration between the two extensions is the form where the book details are entered. Is it already decided how the BookManager metadata page and Proofreadpage index page will be integrated? [14:48:44] Zaran, Rtdwivedi i'm back [14:49:13] Raylton: ok [14:49:32] Rtdwivedi: as I currently understand, the integration will be very minimal [14:49:48] Rtdwivedi, not yet, low priority [14:50:25] Raylton: I'm reading the JSONĀ schema [14:50:31] what is the "source" of a section? [14:51:55] source of work [14:53:22] in case where the book is a collection of various sources/works? [14:54:28] I can not answer ... actually this is a point which we will discuss further [14:54:50] oh ok [14:55:07] but the principle is possible to add more than one value [14:55:35] yeah, in the long run, we could imagine specifying a page interval in a Djvu file here [14:56:33] but IĀ understand that's not in the current scope of the BM project [14:57:02] do you have a wireframe of how you intend to handle the tag and ? [14:58:23] at some point Rtdwivedi suggested a few possible way to improve them, but it's out of scope at the moment [14:58:43] my current opinion is that they should stay the same internally [14:58:53] I figured it would be just one more tool in visual editor [14:59:09] but we should provide an easy user interface to enter the tag [14:59:15] [14:59:38] since the syntax is a bit ugly [15:00:24] I wanted that the user should not have to list out the pagelist tag. We should use it internally. It should not be visible to the user, [15:02:56] I could not understand the scope of PP from the beginning. So I suggested a meeting at the beginning (and no one agreed). Our scope was virtually all defined "community bonding period" [15:02:56] And as we did not know what was the scope of PP, we put the integration as low priority [15:03:14] Raylton: Pardon me if I missed it but I am not very clear on how pageMapping is being used in BM. [15:04:25] Are you talking about the scope of the PP GSoC project or the extension itself? [15:04:50] both [15:05:20] where is your planning page? [15:05:53] https://www.mediawiki.org/wiki/Extension:Proofread_Page/GSoC [15:06:58] It is mainly about refactoring so I don't think it will help you much in understanding the scope of the extension itself. [15:07:39] Raylton: the scope of the extension will stay the same throughout the summer, so thinking the integration with BookManager can be done on the current set of features [15:09:41] You guys have something ready in the labs? [15:10:28] Raylton: The features are almost the same as the version that is already deployed on Wikisource. [15:10:34] Raylton: it's only on Gerrit for now, we'll do extensive testing on labs before merging into master [15:11:32] but Rtdwivedi is right, since it's mainly refactoring, seeing how the extension works on Wikisource should give you a good idea [15:12:18] and what has changed in the extension? [15:13:38] Till now there are no changes that a user would notice but we will be integrating with VisualEditor. [15:13:42] Raylton: so far, it's only refactoring, mainly rewriting in Php most of the features which were written in one gigantic and unreadable piece of Javascript [15:14:31] Fine... but please, do mockup [15:14:47] its so good to understand [15:15:27] Sure, will do that. :-) [15:16:07] if you will integrate with visual editor it will be sufficient to maintain an integration [15:16:54] and import of metadata from PP is also in our plans [15:18:20] Raylton: ok [15:18:45] I'm sorry if your questions felt a bit out of scope [15:19:07] I think we were not estimating properly the goals of the BM extension [15:19:12] but thanks for clarifying things [15:21:18] you intend to use json to stored this data: https://fr.wikisource.org/w/index.php?title=Livre:Marx_-_Diff%C3%A9rence_de_la_philosophie_de_la_nature_chez_D%C3%A9mocrite_et_%C3%89picure.djvu&action=edit [15:21:20] ? [15:22:06] or you already use? [15:22:13] Zaran, Rtdwivedi [15:23:22] Raylton: it's not JSON at the moment [15:23:33] it's stored in a mediawiki template [15:23:44] I currently don't see where we would use it if it were stored in JSON. [15:23:47] (you can see it if you request the raw text of the page) [15:24:10] Raylton: in the long run it would be better to store it in JSON, but for now we kepts the historical format [15:24:35] changing the format would require to make a pass over all the already existing Index: pages [15:24:42] and is not something we can do lightly [15:26:14] well, then I think you should narrow the scope. because BookManager replaces that template [15:27:50] yes, we have had this discussion by email about how the metadata should be split between BM and PP [15:28:09] Raylton: A temporary solution would be to allow export of JSON of the Index data in PP. [15:29:20] Zaran: What do you think about it? [15:29:38] Rtdwivedi: good point, it would be easy to implement on our side [15:30:16] we will also replace this template https://en.wikisource.org/wiki/Template:Header [15:30:42] in fact we already have substituted both [15:31:54] Raylton: Ok. Using JSON for both of them? [15:32:48] json "is" the first. and the second is generated from json [15:33:46] Ok. [15:34:52] I think you could only disable both the proofread and keep it with us [15:35:28] so the scope of PP would be narrower [15:35:31] ;D [15:36:18] disable both in proofread [15:37:08] I see. I don't think we should be doing this hastily. I propose any such action after integration with BM. [15:37:46] Raylton: indeed, we could narrow the scope of PP like that, but that would create a dependency of PP on BM [15:37:55] and we would need to enable BM on all Wikisources [15:38:27] which is something we could do when the integration between the two extensions is seamless [15:38:32] you are working on the same version of PP? [15:38:43] Raylton: Yes. [15:40:26] oh...That let you free to be a little less daring [15:41:08] But I would like to have a backup way in case of integration withBM. [15:42:00] these two features I mentioned, could be deactivated in this case. [15:42:38] right? [15:43:49] Not yet. Not until I work on PP side for integration with BM. [15:44:56] Would that be too conservative? :P [15:47:11] excuse me? [15:47:28] btw, in fact the scope of BookManager is very broad, if you manage to provide a friendly interface to add the page range from djvu. we can do the rest [15:49:17] I mean to say that I want to first add 'the integration with BM' part first and then disable the two features. [15:49:44] i see [15:51:58] I think now we are clear about 'who is doing what'. [15:53:12] we: [15:53:14] Reading interface: metadata/list of chapter/prev-next links [15:53:45] editing interface: Edit metadata/list of chapter/prev-next links [15:53:52] :D [15:54:02] we: [15:54:21] A user friendly interface to add the page range from djvu [15:54:34] Switch to JSON instead of templates in Index pages. [15:55:30] :-) [15:56:00] a love that: A user friendly interface to add the page range from djvu [15:57:23] but on json ... we are thinking of leaving it, because there is a small problem processing json in prev-next links in very large books [15:57:46] thinking about* [15:58:12] eg 40.000 pages [15:58:38] What is the problem? [15:59:14] Raylton: what format would be faster to parse than JSON in this case? [15:59:24] (besides, are there really book that big?) [15:59:51] maybe mysql, but is low priority in gsoc time [15:59:58] Raylton: Why don't you store the prev/next information in the chapter metadata rather than the book metadata? [16:00:28] bull's eye marktraceur ! [16:00:38] * marktraceur tries [16:01:23] Storing smaller chunks of data at a time might be possible, and helpful - the problem with JSON is, if you need a list of chapters, you'll need to either store it somewhere else (duplicating information) or run through a lot of JSON listing them [16:01:47] But if you're loading a big book all at once (or the list of chapters all at once) you might be OK with that [16:03:31] So, in the book metadata, just have the list of chapters with the page number at which they begin and the rest can be done in chapter metadata. [16:03:47] marktraceur, you talked with Molly about it? [16:04:33] Not about this in particular, no, but I can [16:07:32] I do not know if I understand what you mean ... but our problem is with the algorithm that processes json. He apparently need to read a large part of the list to find out which the chapters key [16:07:48] he = it, sorry [16:09:25] Raylton: Which the chapters key? That doesn't make any sense to me [16:09:53] Do you mean that the entire JSON structure needs to be loaded, and that takes too much time? [16:10:03] marktraceur: Which chapter corresponds to which page. [16:10:07] As opposed to a SQL-like database which can have small parts loaded? [16:10:13] Rtdwivedi: Gotcha [16:10:33] never mind, i mean: read a large part of the list to find the current chapter [16:10:44] Raylton: Maybe the answer is to index stuff that you need to be quickly accessible in the database and update it on edit? [16:11:15] That way editing is the slow activity, at least [16:12:58] we do not know yet ... we are thinking of possibly using cache to work around that, but I'm not the best person to talk about performance issues [16:13:30] actually found the problem, but the solution is not yet clear [16:14:03] marktraceur, ^ [16:14:42] KK [16:15:06] I'll keep an eye out, I'm pretty sure "Throw out ContentHandler" isn't the answer, but maybe I'm wrong [16:16:50] marktraceur, Please talk with Molly when you can :D [16:17:22] and matthew [16:18:17] *nod* [16:21:44] I thought you could work with .djvu pages (like this: https://pt.wikisource.org/wiki/P% C3% A1gina: Elementos_de_Arithmetica.djvu/20) and "interface to add the page range from djvu" [16:21:51] Zaran, ^ [16:25:18] Raylton|lunch: can you be more specific? [16:27:13] Zaran, you take care of these page https://pt.wikisource.org/w/index.php?title=Especial%3A%C3%8Dndice+por+prefixo&prefix=P%C3%A1gina%3AElementos+de+Arithmetica.djvu&namespace=0 [16:28:29] and we take care these pages: [16:28:30] https://pt.wikisource.org/w/index.php?title=Especial%3A%C3%8Dndice+por+prefixo&prefix=Elementos+de+Arithmetica&namespace=0 [16:29:45] Raylton|lunch: I understand [16:30:10] then you provide an interface to add the page range from djvu so we can use your work [16:30:20] :D [16:30:40] Zaran, ^ [16:33:47] Raylton|lunch: when you say "provide an interface to add the page range from djvu" [16:34:01] where do you want it added? [16:34:13] is it in the metadata provided by BM? [16:34:46] Yeah, the schema has that already [16:34:56] I...think [16:35:01] The problem is it's hard to index [16:35:06] AFAICT [16:35:22] Zaran, like replace , by a way to use visual editor tool [16:37:14] Zaran, No in metadata, i think [16:37:40] marktraceur, what that means? [16:40:03] Raylton|lunch: As Far As I Can Tell [16:40:42] marktraceur, i know it [16:42:24] I mean the sentence. [16:44:53] Raylton|lunch: The problem, as I see it, is that when you want to determine whether a page is within a range, you need to parse a big ol' JSON file to find what the range even is. [16:45:22] Wherease a database would offer you faster access to _only_ the range based on either the current page number or the chapter number [16:46:18] is about prev-next or djvu? [16:46:25] marktraceur, ^ [16:48:24] I'm honestly not sure, entirely, but I'll talk with mwalker and GorillaWarfare about it later