[01:01:32] halfak: do you know of any studies or just quick-and-dirty calculations for translating from characters to words in an English Wikipedia article? [01:01:45] like, the standard for written English is 4.5 letters per word. [01:02:11] ragesoss, I know a few studies that did some work on readability metrics applied to Wikipedia [01:02:17] but what would the equivalent be in terms of article bytes per prose word? [01:03:11] https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests [01:03:36] ^ things like that [01:03:58] halfak: I'm less interested in that, and more interested in: what's a good constant to use to estimate the number of words in an article, given that I know the number of bytes. [01:04:20] Oh! I see. Sorry nothing like that. [01:04:34] But I could generate you a dataset of bytes, word_count pairs. [01:04:37] or put another way, what is the ratio of prose to markup in an average article? [01:04:58] ragesoss, that's going to be different, but I have an answer for it. [01:05:53] https://lists.wikimedia.org/pipermail/wiki-research-l/2013-August/002987.html [01:06:10] Keep reading in the thread. A debate ensues about measurement strategies. [01:06:42] "correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. " [01:06:51] https://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correlation.enwiki.png [01:07:00] ragesoss, yeah. That's wrong and I show why [01:07:14] It's actually very correlated. [01:07:20] yeah, that's what I'd expect... [01:07:24] Unless you bound it unreasonably :) [01:07:34] small articles could be infoboxes with no content, or pure unformatted prose, or anywhere in between. [01:07:48] but as you grow, there's only so much markup that will get added. [01:07:57] and it tends to settle out torwards a typical ratio. [01:08:13] My regression suggests that, if you multiply content by 1.14991, you get bytes in general. [01:08:18] But this only applies to enwiki [01:08:42] Other languages have charsets that require more bytes to encode. [01:08:49] multiply 1.15 by visible characters. [01:09:05] so, take bytes, and divide by 1.15 to estimate visible characters... [01:09:11] Yes [01:09:24] and then divide by 4.5 to get the standard English words estimate. [01:09:39] awesome. that's quite good enough for my purposes. [01:09:48] :D [01:20:27] okay, after skimming the thread... I'm left more confused about why, in particular, looking only at articles in the 6000 bytes range had such different results than the unbounded analysis. [01:21:13] what 6000 is more than a stub. [14:35:00] o/ [14:58:24] \o [16:09:05] halfak, good news! [16:09:12] Hey! [16:09:14] What's up? [16:09:24] Remember the researchers who spoke to us about modelling pageviews and the weird variations they were seeing between years? [16:09:33] Yeah! [16:09:40] And I offered to generate some timestamp-localised data for them? [16:09:42] David Laniado et al. [16:10:04] * halfak waits in anticipation [16:10:11] when you localise the timestamps the patterns look almost the same, year after year, month after month. The model works! [16:10:33] the differences they're seeing are probably because we're seeing changes in _where_ traffic comes from which correspondingly means changes in when traffic appears _on UTC_ [16:10:47] (I had them analyse the data, not me, that way I could be sure it was right ;p) [16:11:04] :P Also saving Irontime for other stuff. [16:11:21] That's great. I imagine there is a pub in the works? [16:11:57] Also, this reminds me of your proposal to evaluate product interventions by implicit timezone (what time UTC users are active). [16:12:13] If it works for this, it ought to work for that too. [16:12:17] no idea, if there is they haven't involved me, just thanked me and said they'll keep me posted on their noodlings, which is enough :) [16:13:04] and yeah, that would be hella cool. I have no idea if any teams would be interested in working on it, mind! The Discovery AB testing isn't really intervention-centred [16:13:59] Ironholds, do people who are active when THE WEST is asleep get the same zero result rate? [16:14:10] ....ooooooooooh [16:14:13] :D [16:14:27] * halfak is really just riffing on Ironholds past ideas [16:14:33] I bet its higher [16:14:35] I bet it's a LOT higher [16:14:46] because we're dealing with Asia and we SUCK at multibyte chars [16:14:49] That sounds like a /HYPOTHESIS/. [16:14:52] * halfak puts on sunglasses [16:14:59] hahaha [16:15:14] I'm already working on 3 papers! 4 if Nate's silence is "I got distracted" and not "we decided to do the thing without ya" [16:15:26] but I submit 1 before christmas break so that should help. [16:16:25] Cool! I've been super slow on papers recently. Lots of engineering work. [16:16:27] But soon. [16:16:40] Got a doosey about algs and protected classes. [16:16:55] DId I show you this thing? http://socio-technologist.blogspot.com/2015/12/disparate-impact-of-damage-detection-on.html [16:17:06] you did not! *reads* [16:17:18] and yeah, I've been feeling pretty slowpoke myself [16:17:32] I've been working to characterize and try the obvious remedies. Once I fail or succeed, I'll have lots to write about. [16:20:27] halfak, that is an excellently cool writeup [16:21:26] I was originally going to do start a conversation about it right away with my users, but I figured they deserved a thorough investigation before I proposed cutting the fitness of the models. [16:21:42] So, most likely, we'll end up having a hard discussion after the holidays. [16:28:10] yeah. That's one of those topics I would *love* to see covered at a conference [16:28:36] I'd probably title it "How Do You Solve A Problem Like Algorithmic Racism" just for the pop culture reference [16:30:25] Wait. What is that referencing? [16:32:45] https://www.youtube.com/watch?v=M1HwVmY28Pk [16:33:14] The Sound of Music! [16:33:33] Oh! How do you solve a problem like Maria! [16:33:47] yeah! [16:34:04] I've been on a real musical kick recentl-wait. [16:34:09] :P [16:34:12] say, halfak. You like imaginative and lyrical hip-hop. [16:34:16] do you like musicals? [16:34:17] I do [16:34:25] do you like history? [16:34:25] I tolerate musicals [16:34:29] I like history [16:34:33] That's 2.5 our of 3 [16:34:43] are you aware that there is a hip-hop musical about the life of Treasury Secretary Alexander Hamilton? [16:34:55] I was not [16:35:08] okay, let's try this out [16:35:23] https://www.youtube.com/watch?v=MEm2lx2YD3M is Washington and LaFayette working on strategy prior to Yorktown [16:35:38] it currently holds, at 6.5 words a second, the record for fastest song ever on broadway [16:35:45] ha. [16:35:48] if you like it, I know a guy who can get you the whole original cast recording ;) [16:35:52] * halfak starts listening [16:41:37] * halfak needs a playlist [16:41:49] Oh. here: https://www.youtube.com/watch?v=Zp9HUc9HraQ&list=PLUSRfoOcUe4avCXPg6tPgdZzu--hBXUYx [17:05:40] halfak, its kinda amazing [17:05:50] Hey Guerillero|BNC. [17:05:54] I have data for you! [17:06:04] * halfak moves data to public server [17:06:09] :D [17:06:18] how many gigs? [17:07:06] 15 MB. 176,150 anonymous edits scored with three models (reverted, goodfaith, damaging) [17:07:23] ok, that is a reasonable size [17:21:32] halfak, yay! [17:23:35] OK. The dataset will be in here some time in the next 30 minutes: http://datasets.wikimedia.org/public-datasets/enwiki/eq_studies/ [17:23:46] There's an rsync cycle for uploading that I can't control. [17:23:57] there's a million bits that haven't crossed [17:24:01] but just you wait! just you wait! [17:24:11] https://www.youtube.com/watch?v=Zp9HUc9HraQ&list=PLUSRfoOcUe4avCXPg6tPgdZzu--hBXUYx this video is unavailable [17:24:18] and that's with my wmf account too [17:24:23] halfak: [17:24:40] Guerillero|BNC, see dataset link above. [17:24:58] apergos, weird. It's just a playlist of the musical that Ironholds has been discussing. [17:25:27] so bizarre [17:25:33] apergos, aw :(. Geo-blocking? [17:26:12] bummer [17:26:14] could well be [17:26:23] halfak, it 404s [17:26:40] Guerillero|BNC, it will for a few more minutes. Sorry for the trouble. [17:26:47] okay [19:14:56] leila: [19:15:03] we had a question in the tech channel: [19:15:08] https://www.irccloud.com/pastebin/dqiXwnFl/ [19:15:25] kevinator: I'm on it. thanks.