[11:14:02] halfak: Luis suggested at wikimania '14 to 'always keep a license header in each file, because people tend to copy single files instead of complete distributions' [11:14:32] https://upload.wikimedia.org/wikipedia/commons/8/81/Open-Source-Hygiene-Wikimania-2014.pdf [13:09:39] valhallasw`cloud, while I understand the idea, I'm much more of a WTFPL guy than a GPL guy. [13:10:11] * hare wonders if wtfpllicense.org is registered [13:10:18] or simply 'wtfpl.org' [13:10:34] Generally, I find it impractical to have a header in each file that takes up nearly the entire screen and provides no documentation for the code beneath. [13:10:39] wtfpl.org is taken but in the meantime there's mitlicense.org [13:10:55] If we want to have a single line license on the top, then I'd be interested. [13:10:59] halfak: my header is only three lines long; the third line is a link to mitlicense.org [13:11:08] mitlicense.org serves the mit license in human and machine readable formats [13:11:10] But as it stands, the MIT license is about 40 lines. [13:11:36] halfak: a single line # This file is part of . See for licensing information probably achieves the goal Luis tries to achieve [13:11:45] Yeah. If we decided to agree on "License: MIT (mitlicense.org)" at the top, I'd be OK with that. [13:11:49] halfak: also, Luis didn't suggest putting the complete license there :-) [13:12:03] although I did suggest that, it seems [13:12:16] valhallasw`cloud, still is a pain to add one more non-code-thing into my code. [13:12:21] # This file is subject to the license terms in the LICENSE [13:12:21] # file found in the [PROJECT] top-level directory and at [13:12:21] # https://git.wikimedia.org/blob/[PROJECT]/HEAD/LICENSE. No [13:12:21] # part of [PROJECT], including this file, may be copied, [13:12:21] # modified, propagated, or distributed except according [13:12:21] # to the terms contained in the LICENSE file. [13:12:21] # [13:12:22] # Copyrighted by the Mediawiki developers. See the CREDITS [13:12:22] # file in the [PROJECT] top-level directory and at [13:12:23] # https://git.wikimedia.org/blob/[PROJECT]/HEAD/CREDITS [13:12:23] aaah [13:12:31] No. [13:12:32] ok, that's what he suggested [13:12:32] Too much [13:12:35] sorry for not pastebinning that [13:16:17] * halfak reads slide deck [13:16:37] Meh. I disagree that adding boilerplate to every file is a great idea. [13:17:06] If people find it "expensive" and "costly" to look in a standard file called "LICENSE" at the base of the repo, I'm not sure I'm going to convince them of anything. [13:18:40] I think that making the argument is interesting even if I don't like their proposed solution though. [13:18:55] halfak: and now I copy a single file from the repo [13:19:02] and suddenly there is no information anymore [13:19:06] Yup. [13:19:27] You made that decision even though you knew where to look for the license. [13:19:36] Also, who copied a filed from someone else's repo? [13:19:40] *copies [13:19:45] Everyone. Ever. [13:19:47] I don't think I have ever done that. [13:20:04] 'oh, I just need this part, so I'll just copy this file' [13:20:30] Yeah. I can contrive a situation, but not imagine a realistic one. [13:20:52] Luis mentioned several examples in the talk. [13:21:00] It's much more common than people think [13:21:06] yes, unfortunately that's very common... [13:21:16] and then afterwards someone has to figure out if that code should actually have been used [13:21:21] Have you guys ever copied a file from one of my repos? [13:21:23] see also: linux vs SCO [13:21:41] I guess a single line that says 'this file is from project X ' might be good enough? [13:21:48] but for me a LICENSE itself is good enough... [13:22:03] being a more WTFPL than GPL kind of guy myself as well, although that might be slowly changing [13:22:06] Yeah. Single line, I can get behind [13:22:23] But I'm going to not put it in the __doc__ block because I don't want it showing up all over in my HTML docs. [13:22:44] I don't think it was supposed to be in the __doc__ block [13:22:47] when I put license headers [13:22:51] they get put in as comments [13:22:54] and not doc blocks [13:22:54] no one said it should be in the __doc__ [13:23:02] So a # See https://github.com/.... for licensing information [13:23:11] Right after the __doc__ block [13:23:20] for example. [13:23:20] You guys are all excited about one little note :P [13:24:05] *shrug*. If Luis says 'this is a good idea because it makes people's life much easier downstream' when it comes to licensing then I happily follow his advice [13:24:54] valhallasw`cloud: I think a pointer to the license / project is good enough since it solves that use case, and perhaps is more useful than just a plain license header... [13:25:04] valhallasw`cloud, I'm not one to make an argument on authority to [13:25:13] Even if you have a good authority [13:25:34] well, then go watch his talk [13:26:06] YuviPanda: yes. That's what is suggested in the talk. [13:26:42] ok, so we all agree on just putting a '# See for project and license information? [13:26:47] and link points to the repo? [13:26:52] halfak: valhallasw`cloud ^ [13:26:52] sure. [13:27:19] I thought we already had agreed on that :P [13:27:35] I'm not sure why we're still talking about Luis' talk -- which already happened? [13:27:48] I wish people would just trust me on authority. [13:27:50] :S [13:27:54] Actually, I don't [13:28:31] halfak: because he's much better at making the point in a talk than I am over IRC? And this was 2014, so there's video of the talk. [13:28:34] I don't think it was meant as a 'luis said this, you must listen!' but as a 'so in this talk it explains the reasoning behind that suggestion' [13:29:08] * YuviPanda goes to do as Romans do (aka eat pizza) [13:29:09] and yes, I /also/ sincerely meant saying 'if a license lawyer says something related to licenses is a good idea, then maybe it's not such a bad idea to trust his judgement' [13:30:20] valhallasw`cloud, I don' [13:30:27] t think he needs to make a point. [13:31:18] valhallasw`cloud, that does not mean that I can not challenge the practicality of his suggestions. [13:31:41] As I have pointed out before, "I think that making the argument is interesting even if I don't like their proposed solution though. " [13:32:05] We can all agree that Luis raises an interesting problem and then challenge the quality of his solution. [13:32:14] This does not call his expertise into question. [13:32:17] Fair enough. [13:49:05] API design question: I want to expose "Give me the pages that fall in this category." As you know, this is tricky, but WikiBrain has good APIs for this. Namely, it calculates pagerank distance in the category graph. [13:50:07] However, "pages in category" actually means "pages that are closer category a than any of these other categories C", where, C is typically (but not always) a set of top-level categories. [13:50:45] Two questions: 1) I'm planning on asking the user to supply a and C. Does that make sense? [13:51:14] Shilad, at first glance, I think there should be a default C [13:51:21] But otherwise, totally. [13:51:27] I'd like to specify C sometimes. [13:51:35] Hmm... It's a little tricky to do that cross-lingually... [13:51:44] I can understand. [13:51:59] Can you identify "top-level categories" algorithmically? [13:52:00] Do you know if there's some way that's not hard-coded? [13:52:29] https://en.wikipedia.org/wiki/Category:Main_topic_classifications [13:53:08] I wonder if each wiki has something like this category. [13:53:31] I think most do, but there are often several choices. E.g. https://en.wikipedia.org/wiki/Category:Fundamental_categories [13:54:42] Gotcha. Hmmm. [13:55:09] So, in this case, I'd like to be able to specify a parent category for "C". [13:55:21] Aha. That's a nice approach. [13:55:24] It seems that it would be a good idea if "a" is in "C" [13:56:08] So, a=Life&C=Fundamental_categories [13:56:10] Also, I notice there are Wikidata pages for these topic classifications that seem robust multi-lingually: https://www.wikidata.org/wiki/Q4587687 [13:56:17] Oooh [13:56:41] Maybe default to that parent, but allow either a parent category or a set of siblings. [13:56:51] +1 [13:57:02] * halfak has a use for this as soon as it is ready [13:57:02] :) [13:57:07] :) [14:04:39] o/ YuviPanda [14:04:50] Ping me when you have finished pizzaing [14:05:58] halfak: 'sup [14:06:10] Can you ssh to ores-compute.revscoring.eqiad.wmflabs? [14:06:16] I'm struggling. [14:06:40] I really want to see the status of precached script I have running there. [14:06:47] So I don't want to reboot unless I have to. [14:07:19] checking [14:08:14] halfak: looks dead to me... http://tools.wmflabs.org/nagf/?project=revscoring [14:08:25] hasn't been reporting any data for a while [14:08:38] Why does that happen? [14:08:47] * halfak reboots [14:11:16] halfak: multiple reasons, common one is a runaway process waking up the kernel OOM killer [14:11:22] which of course kills indiscriminately [14:11:30] after a while of that not much is left to run [14:11:45] Gotcha. [14:11:52] Could have been precached leaking memory? [14:12:31] could be [14:12:45] were you running anything else on it? [14:12:51] Hmm... Was working on Sunday when ToAruShiroiNeko was using it. [14:13:05] precached was going for more than a week by that point. [14:14:13] * halfak starts up a new precached [14:14:57] I think we might need to suspend code review for ORES [14:14:58] https://github.com/wiki-ai/ores/pulls [14:15:11] Pull requests from more than 3 weeks ago just sitting there. [14:16:02] * halfak closes one defunct one. [14:17:13] \o/ [14:25:31] halfak: did ores-compute come back up? [14:25:38] It did. [14:25:43] precached is back online [14:25:45] :) [14:31:16] \o/ [14:31:37] Now to fix up ORES for the revent revscoring refactoring and do a new deploy. [14:31:49] I'm curious to see what fitness we get with multilingual models. [14:31:58] And better badwords detection [14:54:02] YuviPanda, if you give me merge rights on mwapi, I'll close https://github.com/yuvipanda/python-mwapi/issues/1 [14:54:31] done [14:55:45] * halfak likes how clean and simple this library is. [14:59:13] YuviPanda, at some point, we should talk about how we're going to do the mwoauth/mwapi dance [14:59:17] halfak: can you merge https://github.com/yuvipanda/python-mwapi/pull/12 [14:59:20] ? [14:59:35] {{merged}} [15:06:58] halfak: and https://github.com/yuvipanda/python-mwapi/pull/13 just made all checks pass [15:23:21] halfak: +1, am ok to merge if you rebase? [15:23:39] or, we need to find another way to trigger travis checks! [15:39:31] YuviPanda, will do. [15:39:43] halfak: thanks :) [15:39:55] awight: you're my hero for takling the sklearn package :) [15:40:11] I'm getting my arse whupped [15:40:26] Hopefully it ends up being helpful, but first it needs to work! [15:40:51] I think I'm making progress, at least. Keep unblocking tests only to find other tests failing. [15:41:04] awight: *hug* [15:41:31] thanks, I see you're good at dealing with PTSD cases [15:42:00] heh, been there several times myself :) [15:42:34] Can you explain the next steps, btw? I'm not comfortable with just creating a branch somewhere on github that includes the debian packaging files. [15:42:50] Should I be creating gerrit mirrors under operations/debian or something? [15:43:38] awight: I've just been doing it the git-buildpackage way now - https://github.com/yuvipanda/mwparserfromhell [15:43:45] we can sync it to gerrit later if necessary [15:43:57] but mwparserfromhell, for example, I'm planning on getting into debian itself [15:44:26] so you build a .deb, and then where does that go? [15:44:43] awight: ah, *that*. I'm experimenting with ores-misc-01 for that [15:44:48] awight: aptly.info [15:44:55] there's role::aptly applied on that host [15:45:22] so eventually, when this goes to prod, we'll build them on copper (a deb builder host) and put it on carbon (our internal debian repo) [15:45:43] And it's okay that the sources aren't under our control? [15:46:11] awight: they are, right? they're just on github *right now*, and we can put them on gerrit later if we want - just mirror github into gerrit... [15:47:08] Hmmm sketchy [15:47:24] Rebasing two masters is weird. [15:47:50] I'm fine with whatever though, just wondering what the actual "completed" finish line is for the packaging task [15:48:20] Arg! Looks like it still didn't work. [15:49:30] Weird. Travis can't find ".travis.yaml", but it is most certainly in my PR. [15:50:10] awight: I think it doesn't matter where they are, really... bringing them into gerrit is trivial [15:50:21] halfak: ah yes, because it looks for .travis.yml [15:50:25] halfak: https://github.com/yuvipanda/python-mwapi/pull/13 [15:50:26] has fixes [15:50:40] Yeah... It's in my PR [15:50:51] So, I don't know what the problem is. [15:50:59] halfak: sorry, the Travis changes I made are just annoying, I should turn off GH integration until we've fixed the build. [15:51:03] halfak: yml vs yaml [15:51:06] OH! [15:51:10] OK. SO it's all good then? [15:51:26] ah, different repo I guess. I was talking about ores [15:51:31] so https://github.com/yuvipanda/python-mwapi/pull/13 needs to be merged and then your PR needs rebasing again, I think - to actually run the tests? [15:51:33] awight: ah yes [15:51:56] Gotcha [15:51:58] * halfak fixes [15:55:53] * halfak rebases and pushes [15:56:13] "All checks have failed" [15:56:14] OK [15:56:52] Looks like we have some flake8 errors. [15:57:26] my demo script is a big problem ;) [15:58:29] halfak: :D [15:59:02] halfak: running 'tox' on your local machine should dtrt and give you alerts [16:00:01] So, we get some warnings that should probably not be errors. [16:00:23] are you talking about the 1 line / 2 line things [16:00:34] I must admit to quite liking those :D [16:00:57] No. Those are fine [16:02:52] Here: https://github.com/halfak/python-mwapi/blob/master/mwapi/__init__.py [16:03:04] We get warnings about these imports because they are "unused" [16:03:10] but they are supposed to be there anyway. [16:03:20] They set up the module structure for easy reference by a user [16:03:23] I have to run. [16:03:31] Jenny locked keys in car == Aaron bikes fast [16:03:32] o/ [17:25:53] halfak: are you back in minnesota now/ [17:25:54] ? [17:26:30] Yup [18:00:31] YuviPanda, I think these are all the flake8 issues I'm going to solve in https://github.com/yuvipanda/python-mwapi/pull/11 [18:00:39] The remaining issues are actually good code. [18:01:11] halfak: oh yeah, there's a way around that that I used someplace... [18:02:30] # noqa [18:03:13] actually no [18:03:15] __all__ [18:03:16] https://docs.python.org/3.4/tutorial/modules.html#importing-from-a-package [18:03:29] that's what I ended up doing in tools-webservice as well [18:04:01] https://github.com/wikimedia/operations-software-tools-webservice/blob/master/toollabs/common/__init__.py [18:11:48] halfak: heh, did you put in the note about ORES being used by huggle in the SoS? [18:33:33] Yup [18:33:46] YuviPanda, ^ did it get back to you? [18:34:11] I also noted that you an awight were working on debian packaging for this step of productionization [20:35:03] halfak: Realizing the API I'm providing for closest top-level category isn't quite what people may want. Can you verify... [20:35:18] I now have "which of these categories is closest to this page?" [20:35:28] I need "which pages are closest to this category?" [20:35:51] The latter seems like a more common use case, I think. [20:37:07] I agree, but I want both. [20:37:28] e.g. I might have a small sample of pages that I need to learn the categories of. [20:37:38] Or I might want to dig into a particular category. [20:38:09] so, when we were comparing category "a" to category set "C" it seems like that was the latter? [20:38:21] Yeah. That's right. [20:39:25] I think this is more computationally taxing than what I have right now. It will require bfs on the entire category / page graph. [20:39:47] But it should be doable... [20:40:40] bs? [20:40:40] *bfs? [20:40:43] And it's a weird BFS with lots of simultaneous starting points... [20:40:48] Breath First Search [20:43:50] Yeah. I figured as much. Then the latter is still very useful to me. [20:44:08] Sorry. The former. [20:44:23] How many requests/second do you think you can serve for answering, "Which category is closest to this page?" [20:45:12] It's a very good question. Java is VERY fast for this kind of thing if you have the data structures in memory. I think it will be doable in a few millis, which would be fine. [20:46:03] I'll code it up, time it on EN and report back :) [20:46:39] I'm definitely going to make this an example in my algorithms class this semester! [21:14:43] A few milliseconds and I can process all the articles in 5 hours. [21:14:51] *all the article in English [21:15:03] * halfak rounds up to 5MS [21:15:35] Bump that up to 10MS and it would take 10 hours [21:15:51] 50MS and we're looking at two days [21:22:32] I remember when I had a script that was going to take two months to run. I changed it to use the Labs database replications instead and then it only took 18 hours. I think in its current incarnation (I abandoned the old script and wrote a new one) it takes around 30 hours to do 100 times as many projects. [23:32:08] * halfak gets on bike again [23:32:10] VRRROM!