[15:33:12] mgerlach: hi. [15:33:30] mgerlach: is the research office hour today? (I don't see it in the staff calendar.) [15:33:32] o/ [15:36:03] hmm, I see it [15:37:41] it might be one hour shifted since I kept the time-window based on UTC [15:37:53] leila: ^ [15:38:23] and yes, it is today [15:38:50] mgerlach: you're right. it's shifted by one hour on our ends. thanks. [15:39:24] djellel: I'll reschedule our 1:1 in light of the above. [15:46:51] leila: sorry for confusion (tried to keep time consistent at least on one end ;) , hope you can still make it. [15:48:30] mgerlach: sure. I can make it work. I was about to cancel Research Staff meeting and then I realized there is no reason for it. ;) [16:17:42] isaacj: I reviewed the covid-19 data description. I can send it out to the teams for their review. shall I? [16:21:57] leila: sure [16:54:11] office hours will be in 10 mins [17:00:23] hey everyone, welcome to office hours [17:01:11] feel free to come forward and drop any questions or items around wikiresearch you would like to discuss [17:01:54] I will try to answer or forward [17:05:42] Hello [17:06:19] o/ (also watching the wiki tech talk on maps but i'll be watching the chat too) [17:06:24] hi Csisc1994 ! [17:06:28] o/ Csisc1994 [17:07:48] I am Houcemeddine Turki from University of Sfax, Tunisia [17:08:10] I am a newly-elected vice-chair of Wikimedia TN User Group [17:09:52] I have several questions: [17:10:37] 1. We are willing to begin working on Wikimedia Analytics. I ask if there are several tutorials about that. [17:12:15] what types of tutorials are you looking for? [17:12:49] Hello hello [17:13:29] hi you all. glad to have you here for the office hours. [17:13:35] Csisc1994, here you have the slides from a tutorial https://upload.wikimedia.org/wikipedia/commons/d/d0/Wikimedia_Public_Research_Resources.pdf [17:13:53] Csisc is about general resources of Wikimedia [17:14:37] about anaylitcs itself, I don't think there is a tutorial, but stats.wikimedia.org/v2 is pretty intuitive [17:15:08] Using APIs to browse Wikimedia pageview statistics [17:15:30] Hi all, thanks so much for hosting these! My name is Marianne Aubin, I am a researcher at Cornell working on an analysis of including celebrity names in news headlines vs excluding them to create an information gap. I have a dataset of headlines published between 2013 and 2015, and I’d like to create a dataset of ‘famous people’ at the time [17:15:31] and some sort of ‘famousness scale’. I have a few questions related to this endeavour. I am hoping you’re able to point me to some general directions that I can explore. I am happy to use any API to retrieve these (my preferred language is Python):1. Given a Wikipedia article, what is the best way to validate whether it is an article of a [17:15:32] “person”? 2. Given an article, what is the best way to check the # of pageviews it received in the timespan of 2013-3015? - On the listserv, someone pointed me to https://cricca.disi.unitn.it/datasets/. This is an awesome resource but seems somewhat unwieldy for what I am trying to achieve. Might there be any better dataset I can use?- 3. If [17:15:32] I were to try and go an exhaustive route (pull all articles deemed a ‘person’ on wikipedia), how could I go about this? Would I have to scour every single list of lists under the “Person” category? [17:15:41] dsaez: might help with APIs? [17:16:04] (sorry the formatting didn't turn out so nice, and I hope it is ok to interject with my questions) [17:16:16] Csisc1994: analytics has some documentation on wikitech https://wikitech.wikimedia.org/wiki/Analytics [17:16:34] and more specifically, this is the documentation for the Pageviews API: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews [17:16:53] o/ marianne [17:17:04] Csisc1994, you can fork this repo https://github.com/digitalTranshumant/Wiki-examples/blob/master/WikiMediaPublicTools.ipynb [17:17:21] There you have snippets of code to internact with most of the API's [17:18:20] dsaez mgerlach This is absolutely useful. Thank you. [17:18:39] marianne, check the same notebook I linked before, to check pageviews https://github.com/digitalTranshumant/Wiki-examples/blob/master/WikiMediaPublicTools.ipynb [17:18:54] hi marianne! thanks for your questions, for your problem I would probably use Wikidata. Through Wikidata, you can generate lists of items with specific characteristics (e.g. people, monuments etc), and then retrieve the Wikipedia articles which are linked to the items in the list [17:18:56] or check the Pageviews API that isaacj ha linked [17:19:23] I believe the Pageviews API doesn't go back as far as 2013 sadly [17:19:30] oh [17:19:41] that is problematic [17:20:04] there are pageviews, from 2013, but the methodology to count them changed in 2015 [17:20:14] so results are not really comparable [17:20:26] marianne: trying to get an example [17:20:48] thank you so much miriam that would be super helpful! [17:21:31] How about this https://dumps.wikimedia.org/other/pagecounts-raw/ [17:21:48] marianne https://dumps.wikimedia.org/other/pagecounts-raw/ those are the old ones [17:22:13] you can use that dumps, you will need to process them. you have from 2013 to 2016 [17:22:29] yep all my articles are between jan 2013 and April 2015, so I think they would all need the old method [17:23:12] taking a look now dsaez! [17:23:40] marianne one way to check is wikidata [17:24:02] marianne: every article has a wikidata-item [17:24:46] typically this is done with checking whether instance_of (P31) == human (Q5) [17:24:48] https://www.wikidata.org/wiki/Property:P31 [17:24:58] https://www.wikidata.org/wiki/Q5 [17:25:32] 2. I ask if there are tasks of Wikimedia Research that can be allocated to research teams [17:27:22] marianne: this approach is used to track gender bias based on male/female biographies in different projects, see for example http://wmdeanalytics.wmflabs.org/WDCM_BiasesDashboard/ [17:27:23] marianne: example in quarry: all wikidata items about cats, with their English Wikipedia articles (you can change language or remove the line if you want to see articles in all languages) [17:27:37] https://w.wiki/LDV [17:28:35] Csisc1994: do you mean in the form of collaborations? [17:28:57] Yes [17:29:01] marianne now for people, you might not want to use SPARQL and use the Wikidata Json dumps instead, to generate lists and get the links to Wikipedia: https://www.wikidata.org/wiki/Wikidata:Database_download [17:29:21] marrianne: this is because people is a very large category and the Sparql endpoint might time out [17:30:13] marianne: Did you say Python or R? [17:30:36] Csisc1994: yes, we do collaborations with external researchers, see for example our list of formal collaborators https://www.mediawiki.org/wiki/Wikimedia_Research/Collaborators [17:31:00] I'm more familiar with python but I can use R If it's easier for this. thanks all for the responses I'm reading through everything now! [17:31:20] mgerlach I ask if there is a process to establish such a research collaboration [17:31:25] marianne: to process a Wikidata dump from Python: https://gist.github.com/mcobzarenco/863af69690fb44eb22a4 [17:32:05] wow GoranSM thanks! this is very useful for me too :D [17:32:09] leila: Cscisc1994 is asking about ways to establish research collaborations [17:32:35] marianne: in R, see: https://github.com/GoranMilovanovic/R-Ladies_Belgrade_20190911, and then RWikidata_RLadies20190911.nb.html Section #3 [17:32:59] last link marianne, not to overwhelm you too much, but isaacj suggested you try this API endpoint for getting wikidata claims for a wikipedia article https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities [17:33:00] mgerlach I also ask if you can make calls of contributors to research tasks related to the projects of the WMF Research Team. [17:33:01] miriam: there's also a Python package to process the dump but I can't remember which one right now [17:33:33] thank you goranSM those GitHub links look super useful! [17:33:45] marianne: Keep in mind that Python and R are not the most efficient languages to process the dumps. If you need something that will run every now and then and efficiently, Wikidata Toolkit in Java is the way to go. [17:33:54] marianne: You are welcome. [17:34:04] GoranSM i'll look for it! thanks! [17:34:12] miriam: anytime. [17:34:17] also whoa Miriam thanks for that API! [17:34:34] goranSM: It's a one time effort only lucky for me :) [17:35:27] no problem marianne :) thanks for your questions, I am also discovering new things! [17:36:25] marianne: in R, to process the whole dump *without parallelization* - and that means you extract the .bzip archive once and as a single file - it might take 10 - 15 hours or more, depending on what you need to do. I wouldn't say Python could do any better both are interpreted. [17:36:27] ok last question and then I'll take some time to dig through all the amazing resources shared here! I came across this user who has posted articles that could be helpful like https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages and https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages Do you have any idea how reliable these [17:36:28] are? [17:37:42] goransm: gotcha on processing time, I'll see how it pans out [17:37:53] 3. Are you willing to create a Wikimedia Research Taskforce about COVID-19 [17:37:58] isaacj: do you think outreachy or gsoc could be useful for Csisc1994 [17:38:28] marianne: Finally, here's the Python package that provides classes to work with Wikidata + dump processing: https://qwikidata.readthedocs.io/en/stable/readme.html [17:38:56] marianne: re who constitutes a celebrity, I'd encourage you to reach out to Bob West, our research fellow. He may be able to point you to a good direction. (There are things you can do with Wikidata, I'm not sure at the moment how well populated wd is on this front and what kind of biases that can introduce to your study. For example, you can look at https://www.wikidata.org/wiki/Property:P1258 and see if the [17:38:56] value for it includes the work "celebrity".) [17:39:09] Csisc1994: We will have meeting tomorrow to talk about COVID and research. In the mean time you can check some of our work about finding and measuring COVID-19 related pages here: https://meta.wikimedia.org/wiki/User:Diego_(WMF) [17:39:55] dsaez That's excellent [17:40:16] Csisc1994: I'd like to understand your request about tasks better. what kind of tasks are you or your lab interested in? [17:40:20] mgerlach: for which part? if they want to mentor, the next round is not until later this year and i'm not fully certain of the criteria though in the past we have co-mentored projects with people who are not staff at WMF. if it's as a participant, then they would have to check whether they meet the outreachy / gsoc criteria [17:41:29] isaacj Is there a link to know more about outreachy and GSoC [17:41:51] going off to grab lunch now! thanks all so much for all the help! [17:41:59] isaacj: as a participant to contribute to tasks. I know that in the past there were tasks from research (?), not sure about criteria [17:41:59] leila Minor tasks related to the project you are working on [17:43:03] Csisc1994 outreachy: https://www.outreachy.org/ and GSoC https://summerofcode.withgoogle.com/ [17:43:17] for more wikimedia-specific links: https://www.mediawiki.org/wiki/Outreachy [17:43:28] and https://www.mediawiki.org/wiki/Google_Summer_of_Code [17:43:37] leila e.g. If WMF Research Team are working on analyzing the reputation of Wikipedia articles in social media, it will need algorithms of network embeddings. [17:45:07] @leila Here, you can call contributors to work on this task [17:46:13] leila I think that Outreachy and GSoC are excellent methods for that [17:47:24] 4. We are interested in attending Wiki Workshop 2020 online. [17:48:28] I ask if we can participate to it online as you said in a previous email. [17:48:55] miriam: are there more infos on attending wikiworkshop online? [17:50:29] thanks mgeralch, yes Csisc1994! We will send information out soon. It will be a fully virtual workshop, on Hangouts or Zoom, and you everyone will be welcome to atten [17:50:41] we are finalizing the details and get back to you asap!! [17:51:12] @miriam Thank you. [17:51:20] marianne: thanks for dropping by [17:51:40] These are all my questions. I thank you for answering me. [17:51:40] no prblem Csisc1994! Stay tuned for more info! [17:52:02] Csisc1994: I understand. Your ask is general. If our activities in GSoC and Outreachy addresses your request, let's explore that path as we can use the infrastructure already in place for it. [17:52:14] miriam I will do that. That is for sure [17:53:08] Csisc1994: re Wiki Workshop: as miriam said, please stay tuned. (an email will go to wiki-research-l about it by 2020-03-31.) [17:53:18] leila I asked about that as we would like to let B.Sc. students work on small tasks related to Wikimedia projecta [17:53:26] miriam: I committed us to the end of March. let's do it. ;) [17:53:45] I mean projects [17:53:51] leila: let's do it! [17:54:13] Csisc1994: makes sense. and I should say isaacj and miriam have been most active on Outreachy and GSoC front on our end and will be able to tell you more if you need more details. [17:54:45] leila That will be excellent. [17:55:34] It will be useful to ameliorate Wikimedia-related skills of computer scientists. [17:56:42] I just ask if isaacj and miriam can give me an overview of the kind of projects they are mentoring on GSoC and Outreachy projects [17:56:47] yes, agreed. unfortunately the submission period is closed to new applicants right now for this summer but if we run a project in the winter, we can try to remember to broadcast via wiki-research-l or this channel so that people can be aware and check it out [17:56:51] Csisc1994: also check out the wikimeida hackathon https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2020 [17:57:08] unfortuntaly canceled this year [17:57:37] Csisc1994: this was the task that i am mentoring this summer: T245848 [17:57:37] T245848: Productionize Wikidata-based Topic Model on ORES - https://phabricator.wikimedia.org/T245848 [17:57:39] but hopefully will be happening next year (or somehow) [17:58:00] mgerlach We are also preparing for 2021. [17:59:14] Csisc1994: the task I was mentoring during fall was about releasing data dumps from the output of ML classifiers (for categorizing sentences needing citations): https://phabricator.wikimedia.org/T233707 [17:59:25] it's been a great experience!! [17:59:28] stashbot isaacj This is an important topic. We are also working on Social Media-Based Topic Modelling [17:59:28] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [18:00:20] mgerlach For what I see, the topics are about: Data Analytics and Social Network Analysis [18:01:17] That is excellent. Thank you. [18:01:45] Csisc1994: do you mean for possible projects next year? [18:02:08] Yes [18:02:17] sounds good [18:03:05] mgerlach Thank you [18:03:17] Csisc1994: in case you havent seen, check out the research showcase from last week on topic models including some work from our team headed by isaacj https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2020 [18:03:45] mgerlach I attended the meeting and asked a question [18:04:01] Csisc1994: great, thanks for coming by, hope we could help [18:04:40] everyone, thanks for attending the office hours, and for your questions and answers [18:04:45] mgerlach Definitely yes [18:05:00] Thank you. Have a nice day [18:05:19] See you [18:05:19] Have a good day and be well [18:05:20] thank you!! [18:05:55] next office hours will be in one month [18:06:10] Take care everyone. [18:06:29] on 2020-04-23 [18:06:52] thanks again to everyone for staying around and sharing their thoughts and questions [18:09:26] sorry, next office hours on on 2020-04-22 [18:09:56] wednesday