[16:57:26] hey isaacj, I need to get the topic for 2M Wikidata Items, it's ok if I do this with your API? (it will be an stress test ;) [17:07:06] dsaez: go for it -- if you crash it, that's fine. i can also provide you the code for doing this in bulk locally too [17:50:22] what do you mean by the topic? [17:50:33] dsaez: [17:52:29] apergos: check this https://tools.wmflabs.org/wiki-topic/ [17:53:10] how does it work? [17:53:56] apergos: it's based on the topic taxonomy that was developed for ORES. unfortunately there isn't a great overview of this (note to self: work on that) but i'll find a decent link [17:54:15] halfak: might also have a good link? [17:55:15] * apergos waits hopefully [18:12:55] https://www.mediawiki.org/wiki/ORES#Topic_routing [18:13:02] apergos & isaacj ^ [18:13:11] ooooohh [18:13:14] * apergos click [18:13:14] s [18:14:33] what does the model need as an input to make its prediction? wondering how this could be run locally and fed the required data from an xml dump or wikidata entity dumps or whatever [18:14:55] idly wondering, not seriously 'yep gonna sit down and do this' wondering :-) [18:14:56] It makes its prediction from the text of the article. [18:15:03] No other inputs. [18:15:05] parsed text? [18:15:07] * halfak gets a demo [18:15:09] halfak: thanks! didn't realize you'd updated the documentation. much appreciated! [18:15:13] or wikitext is good enough? [18:17:24] Wikitext is good. [18:17:27] https://ores.wikimedia.org/v3/scores/enwiki/12124/articletopic?datasource.revision.text=%22I%20am%20a%20science.%20%20Look%20at%20me%20math.%20%20Derivative.%20%20Probability.%20Statistic.%20%20Divisor.%20%20%22 [18:18:01] It's a prediction for the text "I am a science. Look at me math. Derivative. Probability. Statistic. Divisor." [18:36:25] hmm so if one was clever, feeding it wikitext from the enwiki article dumps you could get all the predicted topics you wanted [18:36:37] I'm asssuming that running the pretrained model is not too expensive [18:37:38] I guess there's the matter of how much text to give it and what gets stripped out but these are implementation details [19:01:36] apergos, I think it would be pretty straightforward to run this model on the dumps directly. [19:01:48] I might be interested in creating a utility for that if there was a clear use-case :) [19:02:03] well this comes back to dsaez's question [19:02:19] if there are a number of folks who want to be able to do bulk requests, maybe it's worth it [19:07:16] djellel: the intralanguage work for section recommendation we were just talking about is at https://arxiv.org/abs/1804.05995 [19:07:49] apergos, makes sense. [19:09:06] pro-tip: there's so-called 'multi-stream'dumps with an index, which give the offset into a file consisting of concatenated bz2 streams 100 pages per stream, and the offset is to the start of the stream containing the 100 pages [19:09:26] so it's pretty easy to run right on the compressed dumps even [19:09:44] depends again on the use case [19:26:48] mgerlach: you may be interested in this article given our conversations about the remote conference: http://hiltner.english.ucsb.edu/index.php/ncnc-guide/ [19:54:33] leila: looks very useful. currently gathering some ressources and reaching out to people that can share some experience in organizing a remote conference [19:57:37] saw that you are giving a talk at remotecon https://www.remotecon.org/ [22:27:48] mgerlach: sounds good. yeah, I'll talk there, and I've also registered to attend at least a few of the sessions to learn how they work. [22:30:00] mgerlach: (for your tomorrow) I'd love to see innovations in the virtual conference world where the "networking" need is addressed. Every conference that I've attended or has been part of the organizing committee for calls "networking" as the number one reason people want to attend a conference. (networking can mean different things in different communities.) Can this need be satisfied in an online set-up? [22:30:14] if yes, in what ways? [22:30:30] I love this topic. happy to brainstorm more. :D