[17:25:13] bmansurov: Olga confirm that Stephen is available for code review once you make the changes. Let me know if you need my help somewhere. Otherwise, thanks! :) [17:32:47] leila: thanks, I left a question to Stephen and you. Other than that everything is clear. [17:32:57] great. [17:33:20] I see you've already replied. [17:41:07] phew [17:46:25] leila: awesome on the survey emails [18:08:04] apergos: thanks. and thanks for chiming in the other time. [18:08:11] tizianop: I'd like to say hi to you. ;) [18:08:36] leila: hi!!! [18:09:11] tizianop: it's been a long time. all good on your end? [18:10:16] leila: yes, everything good, thanks! You? [18:10:58] tizianop: I'm doing well. adjusting to the void the conclusion of thewebconf has created in my life. ;) [18:12:00] tizianop: now that you're here: what is one question that you look excited to learn the answer for when it comes to citation usage data? (your question can change over time, I want to know how you think now;) [18:12:02] leila: Yaaay, I heard that was a success! Congrats! [18:12:10] tizianop: thanks! [18:20:34] leila: there are many questions, but I got stuck in a lot of technical issues :( For example, the big limitation is that we don't have a way to understand the navigation behavior of the users. [18:21:12] leila: An interesting question is to understand if there is any linguistic feature or "type of statement" in the referenced sentence that attracts user attention (with first hover and maybe click). [18:22:47] tizianop: re technical items. why the limitation you say? You can connect the data to webrequest logs and get that info. (though I confess it's not super straightforward, but that's what we have done in the past for readers) [18:22:55] leila: using some sentences classification like in the paper "citation needed" to understand when people check the source: controversial, unclear, etc [18:26:03] tizianop: are you building prediction tasks? (1. build a prediction model that predicts whether the user clicks on the in-line citation and if yes, whether the features characterize the prediction. 2. given that the user clicked on the in-line citation, predict whether the user reads the citation itself further, and again see what features characterize this activity) [18:26:12] tizianop: or are you focusing on one prediction task only? [18:26:58] miriam_: fyi, I'm talking with tizianop about what he finds interesting in terms of the current research questions for citation research. [18:28:42] hei leila and tizianop, sorry, I need to run, but I'll read what you have here later! [18:28:56] miriam_: np. the ping was just FYI. [18:29:38] leila: yes, but there is no trivial way to link the events with the web-request logs. In the logs there is only the IP (not session token) and a good portion (>16%) are for sure shared ip (like all the users with Google Fiber share the same IP). Not dramatic a limitation, but it had a bit of overhead in having a clean dataset [18:30:52] tizianop: don't you have user_agent information on both ends? that plus IP should give you some reasonable accuracy. [18:33:53] leila: the direction of prediction task with linguistic features of the statement was the goal of the last couple of weeks, but we didn't find any strong predictive feature (SpaCy to get the entities of all the referenced sentences + features of the page: topic, length, etc) [18:36:33] leila: now, the direction for the next weeks is to study the behavior in the page by representing the sequence of actions in a transition matrix (hover -> hover -> click) [18:37:33] leila: we hope there is a stronger signal there [18:40:44] tizianop: I see. I'll think about it, too, though I know you have great people already on the team. :) looking forward to learn more. [18:44:02] leila: regarding the user-agent, yes, but still there is around 14% of ambiguous page-views. In short 14% of the page-views are generated by ip-useragent with more than 480 pages per day (=1 page per minute for 8 hours without breaks). I guess we have to live with this limitation and discard these events [18:45:03] tizianop: you absolutely have to live with that limitation. ;) [18:45:40] tizianop: the only way to properly address that is to have a unique device number shared across these datasets which we don't have. I highly recommend that you manage the research around this constraint (which is a very messy one). [18:46:12] tizianop: Bob and Miriam are fully aware of this. This is what we had to deal with in Why the World Reads Wikipedia, too. You really shouldn't take this up on yourself to solve it. :D [18:47:23] tizianop: If you have other shared features, you can save a few more percentages, for example, if you have browser language, or when you build sessions (because the timing of some requests can help resolving some of the unassigned items). but generally, we have learned that the extra features only help slightly and they may not be worth the effort. [18:49:05] leila: Yes, this unique identifier is what I'm doing now! Launching a big Spark job soon :) [18:49:54] tizianop: good. I'll be quiet then. ;) [18:50:01] tizianop: good luck! [18:52:53] leila: Thanks! And BTW, I'll send you an email soon! I'll be in the Bay Area for two months working from remote (mainly on this project)! [18:55:04] tizianop: eh! good to know that it's happening. I won't be around until June 28, but after that am mostly around and look forward to cowork with you. :) [18:56:51] leila: I'll be there between 8 July and 9 September \o/ [18:57:06] great.