[17:40:14] I've been playing around with some login stuff lately and realized our captcha can get pretty explicit often. I put some screenshots on Twitter (obviously). https://twitter.com/niharikakohli29/status/854025939717545984 [17:40:42] I've been testing on Beta cluster. Not sure if it's same on English/other wikis. [17:50:44] Niharika, just add everything you see to the blacklist, it will actually take effect now that we're regenerating captchas monthly [17:51:14] MaxSem: But "boob" has been in the blacklist since 2014-06-24 [17:51:16] if you are worried about "shag" ... you are worried about too many things :) [17:51:24] uh? [17:52:04] https://gerrit.wikimedia.org/r/#/c/141731/ [17:52:09] Um, I certainly won't like a 10 year kid creating an account to see that. :) [17:53:11] That last one might have been "snag" rather than "shag"? [17:53:24] Reedy, I thought we were regenerating captchas now [17:53:32] anomie: Na, it was shag. That logged me in. [17:53:43] ok then [17:56:03] MaxSem: I'm sure there are plenty of lists for known explicit words out there. Can't we just use one of those to start with instead of starting one from scratch? [17:56:34] patches welcome :) [17:56:52] :) On it! [17:59:08] Hmm. From https://phabricator.wikimedia.org/T159581#3084791 it looks like we're not even using that blacklist anyway. [18:01:58] although "boob" should still have been blacklisted. [18:02:30] I can't find the blacklist it appears to be using. [18:04:03] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/mediawiki/manifests/maintenance/generatecaptcha.pp;143f1bc15e98a9e679268b548a4e0b465288ef3a$23 makes me suspect it's not in public Gerrit anywhere. [18:04:44] Interesting. I wonder why. [18:13:52] anomie, however even that list (which is much longer) has "boob" in it [19:16:35] MaxSem: How do you submit patches to the private gerrit repo? [19:16:44] it's in the private puppet repo [19:17:19] And we do. If mediawiki/swift interfacing doesn't crap itself [19:39:17] Reedy: only roots can see/patch the private puppet AFAIK. When I need something there I email a root usually [19:39:56] Yeah, exactly [19:40:15] ah. I get the snark in the question now [19:40:21] *whoosh* [20:12:09] Is there a good reason for having that list private? [20:13:53] not sure [20:13:58] the blacklist makes less sense [20:14:23] the actual wordlist makes more sense.. As it's giving a list of potential values it could be... So it helps narrowing [20:18:07] Should I file a ticket to make it public? [20:20:52] Maybe [20:21:37] Niharika: I think, it might be better to just do like you said before and reuse a 3rd party list [20:22:01] Reedy: But we'd still have to maintain it. [20:22:11] Well, we don't really maintain ours now [20:22:12] So... [20:22:18] That's the problem, right? [20:22:28] I don't know how much of a problem it actually is [20:22:30] If it was public, it would have been maintained better. [20:22:37] We do get the odd word that pops up [20:22:50] But I don't think we get too many that are problematic [20:25:30] Reedy: Why did I see "boob" if that's already in the blacklist? Is it possibly because I was testing on Beta? [20:25:39] Yeah [20:25:44] Beta has a tiny blacklist [20:26:11] And a small wordlist too, I think [20:26:39] Reedy: And does every language have their own wordlist and blacklist for captcha? [20:26:43] Nope [20:26:47] There's just one, the english one [20:27:02] I see. [20:27:08] Captchas suck [20:27:16] We really need to decide longer term what we're doing about them [20:27:22] There are some suggestions that we should remove them completely [20:27:35] Totally. I like the nice image captcha most websites have now. [20:27:43] Yeah [20:27:52] There's been a few like that, or googles nocaptcha stuff etc [20:42:45] nocaptcha relies on tracking cookies [20:43:41] the core captcha problem is that to do anything "nice" is completely non-trivial and likely to be privacy invading [20:44:56] the image captchas at google are built on a large corpus of classified image data. if we did that via commons then you'd just have to use tineye to find the image on commons and look at the classifications [20:45:24] nocaptcha things mostly use session/tracking cookies with other services [20:45:33] And of course, our solution currently is very non accessible [20:45:46] google's always gives me images to classify because I shed their cookies [20:46:08] accessibility and captcha are pretty incompatible [20:46:49] I have heard from several people that when we tried turning the captcha off on mw.o it grew spam bots at an alarming rate [20:47:06] but I don't know that there is published data on that anywhere [20:58:32] The image captcha I was talking about is not real images, but those tiny fontawesome icon like images. "Click on the umbrella" etc. Can't remember where I have used it recently... [21:15:40] bd808: It did. My theory is this.... The bots that are sophisticated enough to spam us are focused on more important things. The ones that weren't are run by people dumb enough to care about spamming mw.org. [21:16:23] (sophisticated enough to break a rudimentary captcha, that is) [21:17:17] I'm pretty sure most /serious/ linkspam bots don't actually target us a ton, community tools/monitoring/bots clean them up too quickly for it really to help much. [21:17:30] But if we turn off the captcha, we get the idiots who don't know/care. [21:17:51] There's probably a better way than captchas to detect linkspammers :)