[01:39:08] halfak, you around? [14:50:34] morning [14:51:49] Hey Ironholds [17:21:30] hey all! I won't make it to today's standup, conflicting schedule. Am making progress and plan to have more to report on Friday [17:23:12] Nettrom, cool. Gonna be up for some squash tomorrow? [17:23:25] yeah, I can play [17:23:35] you thinking right after class at noon? [17:23:52] Yup [17:24:31] got it on my calendar, should be no problem [17:24:43] will make a note to reserve a court later tdoay [17:25:01] Thanks :) [17:26:02] can even send you an invitation [17:26:05] hi-tech! [17:26:44] ok, gotta go, see y'all later [18:04:09] Ironholds, leila, standup! [21:20:19] so are we doing the staff meeting or was it cancelled? [21:20:51] Ironholds: there was an email from Toby about it yesterday that it's cancelled. [21:21:04] okie-dokes [21:24:46] halfak, got a minute? [21:25:54] fhocutt, might be in an out, but I'll read scrollback if I get pulled away. [21:27:03] would you be willing to take a look at the bot I've been writing? [21:27:11] https://github.com/fhocutt/matchbot [21:27:32] Ahh... That'll take a bit to get to. I can promise to give you notes by EOD. [21:27:43] thanks very much! [21:27:47] Anything in particular to look at? [21:30:15] hm--it works, and I'm going to be fixing up the PEP8ification. I'd like feedback on any potential trouble points you see, and if you have general comments on the structure those would be helpful [21:30:52] pull requests welcome? [21:31:25] oh, and anything you notice that could make it easier to test (granted, it is going through the API so that's inherently more complicated) [21:31:56] if you feel like it, yes, but the more commentary there is the more useful it will be for me [21:32:10] thanks very much. [21:33:54] I'm going to be adapting this to work in the IdeaLab and haven't done too much on that yet, so even if something's not worth fixing in this one it'll be easier to do it in the IdeaLab version [22:30:44] halfak: I just noticed that the PMID dataset has half the records of the original snapshot from October, which sounds suspicious [22:31:04] Yeah, there were dupes in that old one. [22:31:06] No more dupes. [22:31:10] I dug into that too :) [22:31:46] ahh [22:31:51] phew [22:32:15] :D BTW, I am looking into ways to *not* use mwparserfromhell for detecting DOIs. [22:32:24] It's a great library, but it is slowwwww [22:32:42] I think we can do a good job. [22:33:00] But one think I'm struggling with is how to detect DOIs in raw citations. [22:33:39] APA says "doi:0000000/000000000000 or http://dx.doi.org/10.0000/0000" at the end [22:34:41] MLA says "doi:10.0000" at the end too [22:35:13] SO I think I'm going to try to match "doi=", "doi:" and "http://dx.doi.org/" as prefixes for a DOI-ID. [22:35:45] * halfak feels bad for saying DOI-ID -- digital object identifier identifier [22:36:18] DarTar, do all DOI's start with "10."? [22:36:21] DarTar, got that WSC diagnosis done [22:36:32] It seems that notconfusing's regex has that. [22:36:48] Ironholds, WereSpielCheckers? [22:36:56] WebStatsCollector [22:37:08] Ahh. [22:38:48] yes, they start with 10 [22:38:55] and then a dot [22:39:00] and then some digits [22:39:07] notconfusing, do you match any prefix before the ID? [22:39:09] a nonzero amount of digits [22:39:31] no, i just start with 10, that's the very first part of the pattern match [22:39:36] Gotcha. [22:40:01] oh, im back scrolling and i see that you dont want to use mwparserfromhell [22:40:03] So you might match "I'm over 900010.10derpfoo" [22:40:13] Yeah. I would if it was faster. [22:40:52] but your task is parallelizable [22:40:52] I use mwp for a bunch of other stuff though :) [22:41:16] Indeed. So I parallelize, but I still would like to have that order of magnitude back :D [22:41:44] I wouldn't match "I'm over 900010.10derpfoo" [22:41:50] halfak: sorry in meeting, yes they all start with “10.” [22:41:51] but i would match "I'm over 900010.10/derpfoo" [22:42:23] add a magnitude of cores [22:42:29] Maybe a "\b" at the beginning would be in order. [22:42:52] does is "|" in "\b" [22:43:05] i mean is "|" in "\b"? I don't think so [22:43:34] Hmm.. I don't see why it wouldn't be. [22:43:40] \b is "word boundary" [22:43:59] Ironholds: awesome [22:43:59] In [1]: import re [22:43:59] In [2]: re.findall(r'\b','|') [22:43:59] Out[2]: [] [22:44:25] and that is often before them in wikitext [22:44:50] You can't find "\b" [22:44:51] :) [22:45:10] It's not an actual char. [22:45:57] oh, i guess i am mixing it up with the the behaviour of '\w' [22:46:17] or "\W" rather [22:46:45] This works as expected: re.compile(r"\b10\.[0-9]+[^\s\]]*").findall("I'm over 9000|10.10/derpfoo") [22:46:51] In [7]: re.findall(r'\W','|') [22:46:52] Out[7]: ['|'] [22:47:01] And this doesn't match: re.compile(r"\b10\.[0-9]+[^\s\]]*").findall("I'm over 900010.10/derpfoo") [22:48:10] halfak, ok that's correct, point taken [22:48:43] Oh say, do you guys have a project description for what you're doing with realtime references? [22:49:35] not quite yet, its on the todo-list [22:49:54] kk. Will be interested in reading up when you do :) [22:49:59] but in essence we read RCstream, get the text of every diff or new page and run that regex over it [22:50:07] and the republish the stream [22:50:37] eventually we will generalize, i hope to a service with which you can register a regex with us to run over every diff and then feed to you [22:50:59] Sounds like a great idea. [22:51:17] I hope to have a lot of these sort of services at some point. [22:51:24] right now we have a promise and a lot of redis-based headaches [22:51:29] There's been some movement on a proper event bus. [22:52:01] https://phabricator.wikimedia.org/T88459 [22:52:27] I want to have a good strategy for performing interesting computations based on that. It's a ways out though. [22:53:26] I'd be really interested to hear what parts of the system that you guys build was the most difficult. [22:56:24] * halfak is not good at wording today. [22:56:58] staying connected to a public redis server [22:57:18] * notconfusing will learn halfak the wording for great good [22:57:25] :) [23:02:44] halfak, I don't suppose there's a dumb way of specifying a custom field in an SQL query? [23:03:05] i.e., I want a field that just contains "webstatscollector", or something, in every row. [23:03:24] SET @wcs = "webstatscollector"; [23:03:36] SELECT @wcs; [23:03:39] Bah. wsc [23:03:48] http://dev.mysql.com/doc/refman/5.0/en/set-statement.html [23:04:01] Oh wait. [23:04:14] You just want to have a field be "webstatscollector"? [23:04:18] You can just say that. [23:04:38] e.g. SELECT "webstatscollector" AS type, * FROM ; [23:04:42] huh! [23:04:44] cool; thanks :) [23:04:54] Each row will have a field called "type" with "webstatscollector" in it. [23:05:09] np :D [23:05:52] doing this QAwerk [23:26:13] notconfusing, how do you handle "[...] doi: 10.1177/0002764212469365"? [23:26:26] That could literally be part of the ID! [23:38:06] halfak, i know, that's why i use mwparserfromhell, it handles that for me [23:38:30] I see. It will strip out such tags before your regex? [23:38:41] yup [23:39:13] you can do, wikitext.strip_code(), or wikitext.get_tree() which will just put the code on different lines [23:51:41] * halfak runs a performance test. [23:55:55] notconfusing, halfak: catching up with the IRC log [23:56:16] halfak: PMIDs are making the round of the interwebs :) [23:57:36] \o/ [23:57:44] I'm working on getting DOIs added ASAP [23:58:15] sweet [23:58:44] notconfusing: very interested in hearing about any progress on the real-time diff project [23:59:58] hey halfak, when you have a moment, you said you wanted to talk about the weird redirect issue