[14:15:11] o/ everyone [14:20:36] * leila waves to the channel as she prepares for breakfast and then catching the train. [14:28:13] leila: o/ [15:06:47] hi bmansurov [15:07:15] bmansurov, do you know any python package to parse the wikidata json dumps? [15:07:34] o/ [15:07:46] dsaez: unfortunately, no [15:08:07] JSON should be easy to parse as is, unless Wikidata complicates it. [15:08:32] I've tried the simples thing, but just reading line by line doesn't work for me [15:09:30] dsaez: are the dumps line by line or one giant JSON structure (multiline)? [15:09:54] you may have to read the whole file at once and parse [15:10:07] I'm not sure, it looks like there are some breaklines (\n) in the parse [15:10:35] I;ve tried the obvious think, read line by line, and decode the json, but I get errors [15:11:13] Since you're getting an error, I'd guess, you'll have to convert the whole thing to a python object at once. [15:12:57] I'm not sure, I don't think so, doesn't make sense to have one huge object of 46G [15:14:09] it's that big? I though it was about 5GB for some reaosn [15:14:26] what's the error you're getting? [15:16:24] According to this: https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended) you can do line-by-line. [15:17:23] dsaez: you may want to take an inspiration from a PHP library: https://github.com/JeroenDeDauw/JsonDumpReader [15:29:18] bmansurov I'll double check [15:29:48] it would be nice to add a package to the media wiki utils lib for doing this, I'm not sure what is the process for that [15:30:11] dsaez: pull request? [15:30:21] hehe yes [15:30:31] but, I mean, who has done the previous packages? [15:30:41] here comes: [15:30:42] git log [15:39:08] amazing [15:49:54] dsaez reading wikidata dumps in python was very smooth for me [15:50:08] really, do you have any example? [15:50:35] dsaez: are you using these? /mnt/data/xmldatadumps/public/wikidatawiki/entities/latest-all.json.g [15:50:36] it's very likely that I'm doing something stupid :D [15:50:40] dsaez: are you using these? /mnt/data/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz [15:50:40] yep [15:51:44] with gzip.open(filename, "rb") as file: [15:51:44] f=file.readline() [15:51:44] for line in file: [15:51:44] d=json.loads(line[:-2]) #escape the last /r/n characters [15:52:45] dsaez ^ [15:55:00] ok [15:55:04] I'll try [15:55:05] thanks [15:55:39] de nada, it worked on a version from say 6 months ago, if they didn't change it it should be fine :) [16:00:00] maybe the trick is the -2, I was trying with strip [16:00:12] and with eval, that works sometimes. [16:33:09] hey miriam_ are we meeting or do you prefer to postpone? [16:38:58] hi lzia! [16:39:45] dsaez: hi. [16:39:50] dsaez: what's up? [16:40:05] isaacj: I just got to the office. I moved our meeting to 5 min from now. Let's meet then/there. [16:40:19] lzia: perfect -- see you then [16:41:52] dsaez: I managed to not come to email of almost any kind for the past 3 days. I can't believe I managed it so easily.;) [16:42:20] dsaez: we ended up going to Pasadena to visit a couple of friends who moved there recently. The trick with me is to get out of the area, I think. ;) [16:42:22] lzia, excellent [16:42:28] nice [16:42:29] dsaez: did you have a good weekend? [16:42:47] and don't have any notification on the phone, so you to the email and not the email to you [16:43:17] Good, I'm becoming a bit geek, (I was sure that I wasn't) I spent my sunday reading a deep learning book, for fun. [16:43:38] dsaez: yeah. so the thing with my phone is that the apps somehow die after a few min of inactivity, and unless I actively click on them I don't see any notifications. The bug is a real feature. ;) [16:43:59] dsaez: that is incredible. You're my role-model. [16:44:06] haha [16:44:11] dsaez: I promise to do better after TheWebConf. ;) [16:44:22] dsaez: which book is that? [16:44:27] https://livebook.manning.com/#!/book/deep-learning-with-python/chapter-1 [16:44:31] * leila looks up [16:45:04] dsaez: right. I see how you can get attracted to reading it. [16:45:21] nothing new there, but it's the best explanation for AI / ML / Deep Learning that I've read, it's very helpful for organzing ideas [16:45:37] bmansurov, if you have time, have a look to that link [16:51:43] dsaez: OK, I'll take a look. [17:06:32] dsaez: re Scoring Platform meeting: I'm thinking we suggest to move it to some time earlier in one of the days in the week. otherwise, the conflict is not resolvable. [17:06:36] dsaez: what do you say? [17:07:52] sure [17:09:47] btw, leila do we have money for buying books? [17:10:09] dsaez: yes. talk to Dario. [17:10:21] good. [17:11:26] bmansurov: re T208622 , Balazs and Daniel Zahn will help with it. We have some homework though. :) [17:11:27] T208622: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 [17:12:17] leila: what is it? [17:12:25] bmansurov: It's not entirely clear to them what should be done though. Can you provide some additional details on the task? From Mark: For example, are we talking about just a single, manual one-time data load into MySQL, or is more needed (automation, Puppet work)? [17:13:04] leila: OK, I'll update the task. [17:13:27] bmansurov: thanks. [17:13:49] bmansurov: I'm going to assume this is going forward. let me know if you need my help. [17:14:06] leila: yes, thanks for helping. I'll get in touch with Daniel. [17:14:32] yup [17:14:51] leila: also, thanks for pinging Marko. [17:15:21] bmansurov: just remember that Daniel is for Service Operations and Balazs for DBA. You will likely need help from both, but they're aware of the task. [17:15:34] ok, great [17:15:40] bmansurov: re Marko: is everything resolved now or do you want me to do anything else? [17:16:02] leila: good for now [17:16:12] bmansurov: perfect. I shall close that thread then. [17:16:42] leila: OK [17:22:27] leila, I'll be 5 mins late for our 1:1, [17:22:33] halfak: o/ [17:22:59] halfak: what is the IRC handle of James Hare? :) I can't find it in the welcome email or https://meta.wikimedia.org/wiki/User:Harej_(WMF) (and I know I should know it:) [17:23:08] dsaez: no worries. I'll update the calendar. [17:23:48] leila: harej [17:25:31] miriam: welcome to the office. ;) [17:25:37] dsaez: it's an overview article, not a textbook, but the one that i find useful to go back to from time to time as a quick reminder and to share with people newer to ML: https://cacm.acm.org/magazines/2012/10/155531-a-few-useful-things-to-know-about-machine-learning/fulltext [17:35:51] isaacj that link requires login [17:36:24] gah - sorry. try this: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf [17:36:43] btw that reminds me a good meme about AI: "Differences between AI and ML: if ML is probably written in Python, if is AI is probably written in PowerPoint" :D [17:37:05] :thumbs up: [17:37:45] niiiiice. [18:07:42] bmansurov: is it expected that I get bad gatway in http://gapfinder.wmflabs.org/v1/en.wikipedia.org/section/article/Nelson_Mandela ? [18:08:25] leila: yes, when the server is under load. [18:09:00] leila: we recently increased server timeout for Tiziano's demo, but the problem is there. [18:19:54] miriam: Can we start our meeting 30 mins late? [18:26:02] bmansurov: sure, depends on when leila wants to go for lunch :) [18:26:21] let's move it and worst case I leave a little earlier [18:29:55] miriam: done [19:06:49] bmansurov: re the server being underload: is this a problem we have a task for somewhere? [19:07:03] bmansurov: it would be good to have some minimums in place to be able to keep this kind of output up. [19:07:21] bmansurov: or, is this not the right approach and we have to host it somewhere else? [19:08:10] dsaez: I gave a heads up to ebernhardson just now that you may reach out to him about morelike improvements and testing. (and now you can meet each other here) [19:09:36] hi :) [19:21:15] leila: I'm not sure if a task exists. The server is setup to do a computationally expensive operations on the fly. I think one way to solve the issue is to pre-generate some of the data. [19:21:43] leila: increasing the server capacity may help a little bit, but I doubt it's the right approach. [19:22:07] leila: I'll create a task and we can talk about prioritizing it. [19:22:13] o/ ebernhardson [19:22:18] bmansurov: I see. pre-generating it makes sense, and updating it on a monthly basis. Let's see what the results of the new line of research in section recommendation will be and we can decide how to proceed here. [19:22:24] bmansurov: sounds good. [19:22:29] ok [19:26:04] Hey there, as a really blunt question, is Wikimedia Research hiring at all? Or how does one become a Wikimedia Research researcher? [19:28:58] alexrudnick: Hi. At the moment, we don't have any open position in the Research team. [19:29:26] alexrudnick: All openings in our team will go to https://wikimediafoundation.org/about/jobs/ so I suggest you keep an eye on that page. [19:31:47] alexrudnick: all open positions will also be posted on wiki-research-l, so if you're on that list, you will receive a note about them: https://lists.wikimedia.org/mailman/listinfo/wiki-research-l [19:31:57] alexrudnick: let me know if this helps [19:31:57] leila: Thanks so much! [19:32:23] miriam: meeting? [19:33:26] * leila reads up alexrudnick's page [19:35:47] alexrudnick: do you work on http://hltdi.github.io/ ? [19:38:58] leila: That's my old research group from graduate school, yes! [19:39:10] ' [19:39:13] 'got you. [19:39:20] * leila reads up [19:51:15] ebernhardson hi [19:51:45] I'll ping you tomorrow [19:51:53] dsaez: sure [19:52:03] Thx [20:05:14] dsaez: are you still working tonight? [20:05:21] tonight<->now [20:26:15] leila, partially, why? [20:27:44] dsaez: ok. I have two questions for you re quiddity's feedback about the sections he saw when doing synonym labeling. One is that: are the "ambiguous" sections the ones that are frequent? (I remember that you used only frequently used sections and it's slightly surprising that there are ambiguous section titles in enwiki there). [20:28:11] Heyo :) [20:28:12] dsaez: my second question is, what's the deal with section titles such as X or numbers? Are these frequent, too? :D [20:28:18] quiddity: o/ [20:28:38] Yes, they are [20:29:17] dsaez: hmm. fascinating. if you run into article examples for those, please share sample. (like for X;) [20:29:27] But for the synonyms we sampled a bit different, I need to check the code to recall [20:29:57] dsaez: ok. and when you check it, let's put the description in the paper, too, if it's already not there. [20:29:57] I think these comes from indexes [20:30:07] dsaez: what index? [20:30:58] There are pages with list of things [20:31:02] Cities in [20:31:20] Let me see if i find an example [20:32:02] I don't have the code in front [20:32:18] dsaez: totally fine to look at it tomorrow [20:32:25] Ok [20:49:09] leila, Probably these https://en.wikipedia.org/wiki/Category:Wikipedia_indexes but I'd think of them as an edge-case in mainspace. -- I'm mildly surprised if they were included in the data? -- Although, I think I'd (assumed? read?) that the headings that we were given had been excerpted from a specific subset of articles? But if they were taken from all-of-wikipedia then the ambiguous headings (in my list) do make sense as things that [20:49:09] would occur frequently. :) [20:50:39] quiddity: I'll respond when I'm back from lunch. [20:51:06] no rush :) [20:52:32] similarly the "1930s" headings probably would've come from lists like Timelines, e.g. https://en.wikipedia.org/wiki/Timeline_of_LGBT_history#1930s [20:54:52] In summary: they make more sense, now that I've thought about it more, and checked my assumptions! But they were still confusing to encounter in the spreadsheet. That's all :) [21:00:58] (Ah, and "Palmarès" is simply used in a bunch of old cycling articles! e.g. https://en.wikipedia.org/w/index.php?title=Frans_Verbeeck_(cyclist)&oldid=169448875 - and it's probably considered a normal-usage of a loanword per https://en.wikipedia.org/wiki/Glossary_of_cycling#palmar%C3%A8s [21:02:39] * quiddity also does lunch, + errands. bbl. [21:08:53] Krenair: o/ I was wondering if you could give me access to create pages in the MediaWiki namespace at https://en.wikipedia.beta.wmflabs.org ? [21:09:21] bmansurov, what's your username? [21:09:30] Krenair: Bmansurov (WMF) [21:10:37] bmansurov, try now [21:11:20] Krenair: works. Thanks a lot! [21:34:31] halfak: I forgot that we have a meeting. heading there now. sorry [22:12:46] quiddity: I added some of your notes to our etherpad for this research and we will discuss them next time we meet. We may want to consider removing some of the article types prior to running the algorithm on (for example the index type ones) https://etherpad.wikimedia.org/p/stubsExpansion [22:12:55] quiddity: thanks for the feedback. [22:28:29] dsaez: just FYI, you asked when is WSDM's camera-ready submission deadline. It's on December 2nd. [22:58:10] halfak: when you get a chance, can you check https://www.mediawiki.org/wiki/Wikimedia_Research/Collaborators and make sure the collaborations listed with you as point of contact are updated? If any of them can be archived, please move them to https://www.mediawiki.org/wiki/Wikimedia_Research/Collaborators/Archive or let me know and I'll do it. [22:58:40] halfak: not sure, for example, if we should keep Amir and Morten's work there as collaborations. :) [23:31:19] leila, roger. I'll get it squared tomorrow :) [23:31:27] halfak: thanks! :)