[08:04:02] halfak: defintely. [08:04:04] :) [09:12:41] halfak: I harvested features and revert status of 8.5M revisions :) [09:12:53] zipping it [09:13:14] (the csv file is about 1.5 GB) [09:29:23] zipped file: 85 MB [09:58:54] halfak: https://tools.wmflabs.org/dexbot/res_aaron.zip [14:19:23] Woo! New data! [16:28:24] halfak I will update them right now [16:28:45] Thanks. [16:28:51] I might be able to pick one up today. [16:29:15] Do you want to continue work on your PR? https://github.com/wiki-ai/revscoring/pull/189 [16:29:21] Or should I pick that up? [16:33:17] ToAruShiroiNeko, ^ [16:56:15] halfak sure I do want to focus on programming tomorow [17:01:58] no one is in the call [17:02:03] ToAruShiroiNeko: halfak [17:02:46] Woops. Joining. [17:02:56] For some reason my calendar notifications got turned off today. [18:00:40] BRB must let dog outside. Will be right back. [18:04:28] halfak: sorry, I had trouble connecting to IRC [18:04:32] let's talk [18:04:45] tools.dexbot@tools-bastion-02:~/pywikibot-core$ cat res_dump_aaron.csv | grep -P "True,\d\d\d+" | wc -l [18:04:46] 39236 [18:04:48] tools.dexbot@tools-bastion-02:~/pywikibot-core$ cat res_dump_aaron.csv | wc -l [18:04:49] 10044001 [18:04:54] 39K out of 10M [18:05:30] Just let the dog out. Back and ready to chat. [18:05:56] So. Let's set some constraints. [18:06:02] What's the date on the dump files? [18:06:51] for every revision [18:07:02] ? [18:07:03] it's rev id, user id, page id [18:07:10] time stamp [18:07:11] text [18:07:16] What date was the dump extracted? [18:07:23] usually in the form of YYYYMMDD [18:07:28] First of September [18:07:34] there is no dump older than that [18:07:35] 20150901? [18:07:40] (they are broken) [18:07:56] They are incremental? [18:08:07] /public/dumps/public/wikidatawiki/20150806/wikidatawiki-20150806-pages-meta-history2.xml.bz2 [18:08:17] sorry 6 of Jule [18:08:20] *July [18:08:46] http://dumps.wikimedia.org/wikidatawiki/ [18:08:53] That's august :P [18:09:05] oh and there is another dump for 26 August [18:09:16] I'm terrible about months, we use different months [18:09:18] :D [18:09:31] I can extract from that [18:09:52] Different months!? Now i need to read https://en.wikipedia.org/wiki/Iranian_calendars [18:09:55] * halfak shakes fist [18:10:14] Different calendars, [18:10:25] OK. So I propose that we constrain our sampling to the edits that were saved in a whole year. [18:10:53] So if the timestamp is between "20140806000000" and "20150806000000" then we'll consider including it. [18:11:02] from 20140806 to 20150806? [18:11:17] That would be pretty easy to implement [18:11:32] but still it would be at least 33M edits [18:11:38] * halfak runs query to check that it would be enough [18:12:00] third anniversary of Wikidata is close [18:12:11] so three years and about 200M edits [18:12:21] Yeah. Should be plenty [18:12:33] what about six months? [18:12:54] https://www.wikidata.org/wiki/Special:Statistics [18:12:55] So, within that year, we want to randomly sample edits, but I propose we do it based on stratification. [18:13:15] So, we'll have one strata for reverted edits and another strata for non-reverted edits. [18:13:38] For reverted edits, we sample at 100% [18:14:26] ok [18:14:37] So I extract features and then we sample it? [18:14:42] For non-reverted edits, we want to sample at the percentage we need to keep the numbers down to ~ the same as the # of reverted edits [18:14:49] Nope. We'll sample and then extract features. [18:14:54] ok [18:14:59] This'll work great in python with a generator. [18:15:11] even though we'll need to do a look ahead. [18:15:15] * halfak pulls up pairjam [18:15:30] btw, 88.8 million revisions in that year [18:15:30] I see [18:16:06] so 5% sampling of reverted edits would be okay [18:16:10] what do you think? [18:16:16] *not-reverted [18:16:21] *way* too high [18:16:25] 2% [18:16:27] I think 0.07% [18:16:32] * Amir1 negotiates [18:16:34] :D [18:16:37] ha [18:16:38] :D [18:16:55] okay [18:17:02] pairjam seems broken [18:17:02] I do with 0.07% [18:17:36] https://pairjam.unicornalpha.com/#h61t2t [18:17:40] Found an alternative [18:19:47] If I can grasp the concept I can do all of them easily [18:22:19] halfak: I got it [18:22:25] You will have it by tomorrow [18:22:29] :) [18:22:45] Oh! Cool! [18:22:53] I think the revert detection part might be a little funny though. [18:22:59] I'd like to flesh it out a bit more. [18:23:07] I wrote it another way [18:23:25] I will show it to you [18:23:32] it's nasty but works very well [18:23:49] I got to go. Be back soon [18:24:05] bye and thanks halfak [18:24:16] OK GOod luck and have fun. [18:24:24] I'll put this in a gist for you :) [18:55:17] Amir1: https://gist.github.com/halfak/f7625f75a432a30b5a35 [18:58:35] OK. With that out of the way, I'd like to dig into the edit quality campaigns. [18:58:43] Time to get some PRs together for Wikilabels. [19:06:12] kenrick code is merged. [19:06:21] Now to look at those prelabeling bits [19:12:21] ToAruShiroiNeko, bot is not trusted in nlwiki? [19:15:09] umm [19:15:11] to be honest [19:15:18] I think bot and admin are by default trusted [19:15:33] they dont mention it but I would add it to our trusted never the less [19:16:00] buroucrats too [19:16:02] +1 [19:16:19] if these guys arent trusted you have bigger problems way beyond statistical bias in a machine leanring algorithm [19:16:25] basically its wiki doomsday [19:17:04] "editor" [19:17:09] What is that group? [19:19:21] Hi all [19:19:42] Hi jenelizabeth [19:20:43] can you help me understand some deep learning/lstm maths? [19:21:28] your best bet might be aetilley [19:21:36] What is lstm? [19:22:21] long-short-term-memory [19:22:36] Gotcha [19:22:36] https://en.wikipedia.org/wiki/Long_short-term_memory [19:22:48] Regretfully, I have no experience there. [19:22:55] What sort of prediction tasks are you working on? [19:23:54] ToAruShiroiNeko, should I be waiting for https://phabricator.wikimedia.org/T114502? [19:24:12] "autoreview and editor are trusted per AS but there is an ongoing discussion." [19:25:20] jenelizabeth, forgot to ping when I asked the question. What sort of prediction tasks are you working on? [19:25:44] halfak yeah [19:25:48] it seems to be a group [19:26:05] halfak yeah [19:26:09] discussion sparked yesterday [19:26:14] I want to give it a few days [19:26:26] meanwhile I will try to get them to translate the UI form [19:28:45] thanks, also regarding the lstm and rnn stuff, why are variables defined as e.g. h_t for a given node? [19:28:55] why not n_t? or cell_t; etc? [19:29:35] what does the h actually represent as an acronym? [19:36:23] ToAruShiroiNeko, OK. prelabeling running of id and nl. [19:36:32] They'll be ready in a few minutes. [19:37:00] jenelizabeth, just guessing, but usually H would be used to represent entropy or information content? [19:37:35] here is a variable defined as h_t? [19:37:58] I mean h_t where t is the subscript of h [19:38:09] * halfak curses at mathematicians and their penchant for single symbol variable names. [19:38:20] Yeah, I figured latex style, right? [19:38:25] So where is this "h" discussed? [19:38:35] not used latex before, so complete noob to that language [19:40:10] Latex is the language style that MediaWiki uses between tags. [19:56:54] I'm thinking that prediction should be in the range of reactive behavioral responses of nodes/networks/systems [19:57:54] whereas thoughtful responses should manifest as being calculative and patient, focused and sharply handling their tensors of patience/general stress of some means... [19:58:41] but both should be on the same plane - scale - with prediction being among the most reactive (even if its indexed from a thoughtful process response) [19:59:38] prediction shouldn't require much computational effort at all, and should be at least one step in weight-fulness above random-thoughtless response [20:00:34] But... what to predict? [20:00:46] Is this a theoretical or applied project? [20:00:52] proactive behavioral response should be on the side of requiring +w^n computational effort, where each step of n manifests the depth for computating (or calculating rather) a thoughtful response within a deep network [20:03:07] e.g. prediction that manifests naturally should be basically a k-means response (i.e. look at neibour responses to this input all the way back to steps^n, so it would have some computational effort regarding the depth in searching for the deepest prediction (i.e. lengthiest)), else it should be an index memory of some proactive thoughtful response to save computational time and effort for reproducing that [20:03:07] desired behavioral response [20:03:11] if that makes sense [20:03:58] predict anything lol, it doesn't matter what: the point is in finding optimal use of these functions/operators of nodes/networks/systems [20:05:30] just observe how little time a predictive response is required amidst group behavior, of any species! [20:06:49] or how little effort a brain may put into producing such response (e.g. manifestation of a routine behavior: even though such behavior (e.g. an n-monthly occasion) requires a lot of time before being triggered, the computational effort for producing this response is at most, minimum) [20:08:17] Ahh. I see. So this is more theoretical. [20:08:21] thoughtful and proactive behavioral responses require actual transformations to be performed and to learn efficiency regarding the target of such problem, e.g. solving of a rubiks cube for first time would require more computational effort [20:08:35] or solving any high level problem for that matter would also [20:08:45] Simulated annealing [20:09:09] indexing of these solutions would not be instant either, since they would require learned indexing of nodes/networks which helped produce these kinds of solutions [20:11:08] so I suppose, indexed predictions of thoughtful/proactive behavioral responses would require at least +w^n_tp where tp accounts for any thoughtful/proactive response -- for the weighted increase to at least be 1 step/power greater than an ordinary k-means response of autonomous thoughtless prediction [20:11:15] yeah, theoretical sure [20:13:02] do you understand what I mean? [20:14:29] thoughtful and proactive responses are difficult and therefore require more computational effort, which requires more organic energy converted into the right form of the right place (e.g. might require more dopamine or serotonin, or other neurotransmitter amount) [20:15:24] interestingly, from personal experience... regarding serotonin... I don't think it's anything like dopamine, but rather something that finds a means in which the overall shape/structure of our mind can be acquired, as if it was a hot air balloon [20:15:55] I've had the same trippyness effect from SSRIs and 5HT-2a/etc antagonising psychedelics [20:16:30] albeit the lattermost produced a complete change in how structure manifested and acquired perception... [20:16:32] >.> [20:17:14] yeah... neurotransmitters != brain energy, but I think I see what you're saying. [20:17:33] Dopamine is more about reward/learning [20:18:07] Serotonin levels can substantially affect perception. [20:20:22] This is in a direct that I like thinking, but I don't have the capacity for it while I'm multitasking other stuff. [20:20:28] jenelizabeth, what lit are you drawing from? [20:20:32] agreed! but what about them representing brain energy of their relative functions? e.g. depression manifesting as a lack of nt production/absorption of a relative nt function of its receptor binding [20:21:41] so perhaps we can have depression that is differentiated by relative nt binding, and different kinds of depression overall (e.g. whether its a result of a lack of production of some nt or a lack of absorption of some nt by those target cells/networks of a brain) [20:22:51] Still, those are just characteristics of a system-level pattern that we don't really understand. [20:23:15] We can replicate some of the effects, but not all of them with introduction/reduction/inhibiting nts [20:23:38] you mean, we can try to replicate their effects lol [20:23:46] Source: wife is neuroscientist using models to study schizophrenia and epilepsy [20:24:05] I don't think that matters at all... but rather, that it's more important to find a way to acquire desired function of your neural network [20:24:20] does she come on here? [20:24:53] Negative. She's a non-internet-person [20:27:36] crazy [20:31:25] so if i understand this correct, can an AI have psychiatric disorders? [20:37:53] Yes! [20:37:57] They usually do! [20:38:14] halfak: did you hear back about the security review? [20:38:43] I did not. I walked csteipp et al through the code and commented on some diagrams in phab [20:38:51] That was a week ago, I guess. [20:39:29] ah cool [20:39:35] yeah it takes about that time or more I guess [20:39:42] let's poke 'em if we don't hear back middle of next week [20:43:01] The part that csteipp reacted most strongly to was the pickled model files [20:43:12] I've been looking into alternatives and they aren't great [20:43:33] right, since pickling is arbitrary code execution [20:44:38] yeah :/ [20:44:39] which psychiatric disorders do AIs usually come down with? [20:44:42] But we control the repo [20:44:47] other than the obvious intellectual disability disorder? [20:45:13] (well, *that* would depend what age of adult you're comparing the AI to) [21:54:37] (03PS1) 10Legoktm: build: Updating development dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/248535 [21:55:30] (03CR) 10jenkins-bot: [V: 04-1] build: Updating development dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/248535 (owner: 10Legoktm)