[05:22:49] 10Scoring-platform-team, 10Gadgets, 10Code-Health, 10artificial-intelligence: Detect/flag potentially malicious gadget/javascript edits - https://phabricator.wikimedia.org/T208140 (10Bawolff) So its impossible to tell if a program is evil or not, definitively (Assuming you have some technical definition of... [10:23:34] 10Scoring-platform-team, 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for Galician Wikipedia - https://phabricator.wikimedia.org/T201146 (10Theklan) Actually you don't have to evaluate the REAL content of the article, but the structural completeness: if it... [13:14:21] o/ [15:00:55] afk for lunch, will be back soon [15:27:42] back now [15:48:34] Forgot about that. [15:48:38] awight, o/ [15:49:19] I want to take a "comp day" today. Been working through the last couple of weekends :\ OK if I move our 1:1 to the same time tomorrow? [15:50:25] I'll still be at sync :) [15:50:51] halfak: I want to get celery four on beta cluster. Do you think it's okay [15:51:26] Amir1, probably. I don't know if I've reviewed your test documentation on wmflabs cluster. [15:51:30] Is that written up? [15:51:37] halfak: That works. I some terrible RL shit happen this morning so I'm gonna be a mess anyway. [15:51:41] hmm, I also need to deploy wikilabels today to fix the z-index issue [15:51:51] oh no. Sorry to hear that awight [15:51:54] halfak: it's a response on the PR [15:52:00] Thanks, just life [15:52:00] Gotcha. [15:52:40] * Amir1 offers a hug [15:58:20] awight: hello! running 5 mins late for the meeting you invited me to, chomping down on dinner fast! [16:01:43] Uh... not allowed to join my own hangout [16:01:44] halfak: Just checking, I'm not fired right? "Error: You're not allowed to join this video call" [16:01:45] Working on it. [16:01:47] hahaha [16:01:50] yeah. wtf [16:02:14] looks like ewhit is in there somehow [16:02:27] halfak: I can't join [16:02:36] Here's a random new one: https://meet.google.com/otv-auwr-yus?hl=en&authuser=0 [16:02:55] Can someone ping ewhit to notify of the new meet? [16:03:21] https://meet.google.com/jzp-npps-khm [16:03:25] argh [16:03:26] ^ This is the new official one. [16:03:28] hey [16:03:45] ah [16:03:58] harej: https://meet.google.com/jzp-npps-khm [16:04:27] saurabhbatra: ^ also FYI, something happened to the hangout so here's the new one [16:33:24] 10Scoring-platform-team (Current), 10Wikilabels, 10User-Ladsgroup: The skip button doesn't work on fullscreen - https://phabricator.wikimedia.org/T208022 (10Ladsgroup) Deployed [16:53:18] PROBLEM - ssh on ORES-worker02.experimental is CRITICAL: connect to address ores-worker-02.ores.eqiad.wmflabs and port 22: No route to host [16:53:21] PROBLEM - ping4 on ORES-worker01.experimental is CRITICAL: CRITICAL - Host Unreachable (ores-worker-01.ores.eqiad.wmflabs) [16:53:22] PROBLEM - check load on ORES-web02.Experimental is CRITICAL: connect to address 10.68.23.111 port 5666: No route to hostconnect to host ores-web-02.ores.eqiad.wmflabs port 5666: No route to host [16:53:26] PROBLEM - puppet on ORES-worker02.experimental is CRITICAL: connect to address 10.68.22.195 port 5666: No route to hostconnect to host ores-worker-02.ores.eqiad.wmflabs port 5666: No route to host [16:53:32] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: connect to address 10.68.23.111 port 5666: No route to hostconnect to host ores-web-02.ores.eqiad.wmflabs port 5666: No route to host [16:53:37] PROBLEM - check load on ORES-worker01.experimental is CRITICAL: connect to address 10.68.19.132 port 5666: No route to hostconnect to host ores-worker-01.ores.eqiad.wmflabs port 5666: No route to host [16:53:40] PROBLEM - Host ORES-worker01.experimental is DOWN: CRITICAL - Host Unreachable (ores-worker-01.ores.eqiad.wmflabs) [16:53:42] PROBLEM - check users on ORES-web02.Experimental is CRITICAL: connect to address 10.68.23.111 port 5666: No route to hostconnect to host ores-web-02.ores.eqiad.wmflabs port 5666: No route to host [16:53:43] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:44] PROBLEM - ping4 on ORES-worker02.experimental is CRITICAL: CRITICAL - Host Unreachable (ores-worker-02.ores.eqiad.wmflabs) [16:53:46] PROBLEM - ssh on ORES-web02.Experimental is CRITICAL: connect to address ores-web-02.ores.eqiad.wmflabs and port 22: No route to host [16:53:49] paladox, ^? [16:53:53] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:53] PROBLEM - check load on ORES-worker02.experimental is CRITICAL: connect to address 10.68.22.195 port 5666: No route to hostconnect to host ores-worker-02.ores.eqiad.wmflabs port 5666: No route to host [16:53:54] uh [16:54:02] PROBLEM - check users on ORES-worker02.experimental is CRITICAL: connect to address 10.68.22.195 port 5666: No route to hostconnect to host ores-worker-02.ores.eqiad.wmflabs port 5666: No route to host [16:54:05] i wonder why that's happening [16:54:13] oh network down? [16:54:16] PROBLEM - ping4 on ORES-web02.Experimental is CRITICAL: CRITICAL - Host Unreachable (ores-web-02.ores.eqiad.wmflabs) [16:54:18] Seems like it might be. [16:54:23] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:25] PROBLEM - check disk on ORES-web02.Experimental is CRITICAL: connect to address 10.68.23.111 port 5666: No route to hostconnect to host ores-web-02.ores.eqiad.wmflabs port 5666: No route to host [16:54:26] ores.wmflabs.org doesn't load. [16:54:32] PROBLEM - Host ORES-web02.Experimental is DOWN: CRITICAL - Host Unreachable (ores-web-02.ores.eqiad.wmflabs) [16:54:35] PROBLEM - check disk on ORES-worker02.experimental is CRITICAL: connect to address 10.68.22.195 port 5666: No route to hostconnect to host ores-worker-02.ores.eqiad.wmflabs port 5666: No route to host [16:55:20] PROBLEM - Host ORES-worker02.experimental is DOWN: CRITICAL - Host Unreachable (ores-worker-02.ores.eqiad.wmflabs) [16:56:46] halfak is it on labvirt1015? [16:57:41] 10Scoring-platform-team: Make a wikilabels view to see a single task. - https://phabricator.wikimedia.org/T208239 (10notconfusing) [16:59:22] RECOVERY - Host ORES-worker02.experimental is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [16:59:23] RECOVERY - puppet on ORES-worker02.experimental is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [16:59:33] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 442 bytes in 6.575 second response time [16:59:41] Good Q paladox. /me looks [16:59:42] RECOVERY - ping4 on ORES-worker02.experimental is OK: PING OK - Packet loss = 0%, RTA = 1.08 ms [16:59:47] halfak apparently it is [16:59:48] RECOVERY - Host ORES-worker01.experimental is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [16:59:48] RECOVERY - check load on ORES-worker02.experimental is OK: OK - load average: 0.00, 0.00, 0.00 [16:59:50] saurabhb: Neat! So, I wanted to share a small thought about the negative examples. [16:59:51] since they just fixed it [16:59:53] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.542 second response time [17:00:00] RECOVERY - check users on ORES-worker02.experimental is OK: USERS OK - 0 users currently logged in [17:00:12] haha. Sorry to ping you unnecessarily then and thanks for chiming in, paladox [17:00:23] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 442 bytes in 0.547 second response time [17:00:28] halfak :) [17:00:30] RECOVERY - Host ORES-web02.Experimental is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [17:00:32] RECOVERY - check disk on ORES-worker02.experimental is OK: DISK OK [17:00:54] I have to be afk for a couple of hours, will be back soon [17:00:56] saurabhb: The positive examples are probably very atypical, e.g. large edits and so on. We might want to increase the number of negative examples which share some of those basic characteristics. [17:01:09] awight: ping you back in 5 if that's okay? [17:01:17] sure! [17:01:47] o/ [17:03:12] awight, I was thinking the same thing. [17:03:37] I'm not quite sure how to re-extrapolate that in evaluation, but I think the model will learn more useful things. [17:03:51] Also, I think we need to model at the level of user -- and not edit. [17:03:56] Maybe editsession. [17:03:57] hmm [17:05:50] awight: sorry about that! [17:05:57] Woops. Just saw that saurabhbatra jumped out. [17:06:19] I was agreeing with awight and suggesting that we make the focus of prediction either an entire user's history or a specific edit session. [17:07:33] that's a great point to think about [17:08:09] i like the idea of making the predictions based on the user's history [17:08:40] Seems like various patterns of edits would indicate things more clearly. [17:08:52] E.g. working on only one page or closely linked pages. [17:09:14] that might complicate things more though [17:09:56] i'm trying to think of a high-level way to classify the aggregate of edit scores in a meaningful way [17:11:06] anyway, i think whichever path we chose, we will have to classify individual edit sessions either way [17:11:36] i think it might be easier to make this decision when we reach that stage [17:17:14] saurabhbatra, fair point for sure. I've been working on how ORES might do session/user-based predictions. [17:17:25] So maybe I can get that worked out in the meantime. [17:18:13] ah, I just saw awight's comment about the negative samples [17:19:08] random sampling here might not work. afaik a lot of text classification problems have a bias against/for lengthy texts [17:19:48] the easy solution being that we clip the text at N no. of words [17:20:32] We could also have features that are normalized by length. [17:20:57] halfak: agreed [17:21:23] text classification is hard even when it's easy *sigh* [17:21:29] ^ +1 [17:21:37] OK I think it's time for me to head out (didn't get a weekend, need some time AFK) [17:21:49] halfak: have a great one! o/ [17:22:00] o/ [17:22:06] harej, awight, Amir1: ^ [17:22:15] o/ [17:23:22] halAFK: saurabhbatra: As a side note about editor histories, another thing we might be looking for is an account with very little editing experience which creates a fully-compliant article with citations etc., out of the blue. [17:24:50] awight: i'm making a note of all these ideas [17:24:57] will share the doc w you [17:25:02] :) [17:26:17] awight: i think the first step is getting the dataset for the negatives [17:26:31] That does sound fun! [17:26:55] so i was looking at the dataset for the positives [17:27:10] https://figshare.com/articles/Known_Undisclosed_Paid_Editors_English_Wikipedia_/6176927 [17:27:12] It might be worth extracting some of our features for the positives, to get an idea of average length, variance in length, etc [17:27:37] ahh [17:27:40] excellent point [17:28:04] so the dataset mentions a case_page_name [17:28:19] i'm assuming we can use that as a reference to get to the actual edit? [17:29:30] oh, I didn't realize! Looking... [17:32:42] saurabhbatra: I think we proceed using user_name [17:33:22] ahh [17:33:24] https://en.wikipedia.org/w/index.php?title=Special%3AContributions&contribs=user&target=Annakoppad&namespace=&tagfilter=&start=&end= [17:33:33] so edits by a user are publicly available [17:33:52] Some of the edits might be deleted, which probably requires an NDA refresh btw. [17:34:28] darn, that might take some time... [17:35:06] is it fair to classify all the edits made by these users as paid promotion? [17:35:23] I'm ignorant about most of the behind-the-scenes stuff, but I imagine that it's easier if you're not touching fundraising data. [17:35:39] saurabhbatra: Good question, it probably isn't fair [17:37:05] I have to run for 10 min, sorry [17:37:22] We might need to hand-label edit sessions for these users, I'm not sure yet [17:37:28] It's worth spot checking [17:38:00] yup i'm looking at that rn [17:38:27] ping me whenever you're back on! [17:52:30] awight: looks like it's going to be a darn tough task [17:58:15] saurabhbatra: I'm back [17:58:48] Yeah this is a several-month task for sure, but IMO there are plenty of incremental steps to take us forward. [17:59:03] awight: so i've been looking at the edits made by the people implicated [17:59:52] it seems like there are suspicious edits made by a user [18:00:06] and then a lot of other sockpuppets of the same user [18:00:33] ah yeah that's a nasty bit of wiki [18:00:55] how do they ascertain that a person is a sockpuppet of xyz? [18:01:13] manually going through their edits and finding the same COI? [18:01:23] I think it's a big deal, takes a lot of research, and it can even involve IP addresses. [18:01:43] https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry [18:01:48] https://en.wikipedia.org/wiki/Wikipedia:Sockpuppet_investigations [18:01:58] ah, yes i would imagine [18:02:07] it is a rather serious allegation [18:02:23] Yeah, sockpuppeting can get you kicked to the curb even if you otherwise have an acceptable record. [18:02:26] (03CR) 10EBernhardson: "As mentioned in the ticket, implementing the content handler side only indexes the data, it doesn't implement any method of retrieving tha" [extensions/JADE] - 10https://gerrit.wikimedia.org/r/470061 (https://phabricator.wikimedia.org/T206352) (owner: 10Awight) [18:02:37] The message it sends is "I'm not engaging your community in good faith" [18:02:49] There's a detail to keep in mind, when we're solving this: there are legitimate reasons a person would use a secondary account, https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry#Legitimate_uses [18:02:52] harej: agreed [18:03:18] We actually don't want to build a machine that can profile and match users AIUI [18:03:47] awight: do you think there might be an easy programatic to determine in which edit the COI arose for the flagged users? [18:03:59] we might be able to get our dataset from there [18:04:14] No such thing exists AFAICT [18:04:42] awight: https://github.com/wikimedia/wikilabels/pull/249 When you have time :D [18:05:39] for ex. a user makes a wiki about a not so well known Indian athlete and a couple of Indian professors before making changes to a wiki about a hospital running corporation (the one which he/she is flagged for) [18:06:23] https://en.wikipedia.org/w/index.php?limit=50&title=Special%3AContributions&contribs=user&target=Annakoppad&namespace=&tagfilter=&start=&end= [18:06:36] Sometimes they make it easy for us -- the username editing the article is the same as the article name [18:06:52] hahaha [18:06:56] It makes me sad when this happens because it's obvious they're unaware of the rules. It's not even like they're trying to pull a fast one on us. [18:07:09] 10Scoring-platform-team (Current), 10editquality-modeling, 10revscoring, 10artificial-intelligence: Create a newcomerquality meta-model for revscoring - https://phabricator.wikimedia.org/T205926 (10notconfusing) Starting repo with ipynb documenting work so far: https://github.com/notconfusing/newcomerquali... [18:07:23] harej: agreed, I too was unaware of that rule until a recent time :P [18:07:31] although it made sense the moment i read it [18:08:02] Wikipedia is frustrating because the software lets you do nearly unlimited things, but there are social conventions that are not obvious. [18:08:38] So learning Wikipedia is as much socializing norms as it is learning how to use a piece of technology. [18:09:05] wikimedia/wikilabels#428 (master - 0295bb0 : Amir Sarabadani): The build was fixed. https://travis-ci.org/wikimedia/wikilabels/builds/447942242 [18:09:06] yes, i get that now. it's a lot more community driven than i thought it would be [18:09:34] saurabhbatra: It's possible that the initial edits are to make it look like less of a single-purpose account, or just an innocent and prolific editor. [18:10:31] It does seem like we need to do work in order to find the COI article, though. It's possible we could get help with that, from Doc James and the other people involved in compiling the original list. [18:11:39] If I wanted to reach out to all the users who are signed up for edit-quality-labelling (23 users https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality). Would you say I should talk-page theme individually, or {{ping}} them from the project talk page? [18:11:39] 10[1] 1010https://meta.wikimedia.org/wiki/Template:Reply_to - Redirección desde 10https://meta.wikimedia.org/wiki/Template:ping?redirect=no [18:11:52] another example - https://en.wikipedia.org/w/index.php?limit=50&title=Special%3AContributions&contribs=user&target=Bengaloorugirl&namespace=&tagfilter=&start=&end= [18:12:25] a good comms question for harej? [18:12:50] There are limits for pinging, so I would go to each of their talk pages, notconfusing. [18:13:10] Thanks. I believe the ping limit is 10 users per ping. [18:13:20] But I'll do the leg work. [18:13:25] so this person makes an account, corrects some grammar in a random article, and then goes to the COI notice of Annakoppad (the original person who's involved in the COI) and starts the discussion on behalf of him/her [18:13:28] I also once overheard a conversation that leads me to believe the limit isn't easily worked around, either. [18:13:34] and then gets flagged and taken down [18:14:16] so in this case, there is no implicating edit! [18:15:17] There's also this concept of a single purpose account. https://en.wikipedia.org/wiki/Wikipedia:Single-purpose_account [18:15:31] If someone starts engaging in politicking on their second edit they might be flagged as such. [18:15:50] In my opinion it's one of the more problematic labels we apply to people (people sometimes have narrow interests, it's a thing) but I think it's something to be aware of as a concept. [18:16:47] ah, yes i get it now [18:17:23] tough to code such policy to get our edits though :\ [18:17:34] to put in code* [18:18:10] saurabhbatra: FWIW, Annakoppad actually did some extensive editing outside the article we know is a COI. I'm not sure how we can tell whether these are also contracted, or just a labor of love. [18:19:09] some of them are fishy I agree, for ex. some of the articles about the Indian professors contain details that would be hard to come by if you were not closely linked to that person [18:19:22] extensive quotes and whatnot [18:20:33] And flowery language [18:21:12] I'm starting to think that the editor's relationship with each article is on a per-article basis, so we can look at their history editing just that article and its talk page (and maybe "articles for deletion" discussion about that page?), but that our prediction for one (editor, ariticle) pair shouldn't affect our prediction for any other pair? [18:21:12] harej: yup, the grammar is not impeccable [18:22:11] awight: interesting concept [18:22:41] but in the end we want the blame to fall on a user [18:23:14] saurabhbatra: oh hey! https://en.wikipedia.org/w/index.php?title=Wikipedia:Conflict_of_interest/Noticeboard&oldid=815686276#Annakoppad [18:24:35] i think i'll manually go through some other user histories [18:24:40] and try to find the pattern [18:25:13] That noticeboard is a gold mine fwiw [18:25:47] rn i have only found 2 cases where people have been flagged for COI - 1. if their article has a COI issued against them, 2. if they try to advocate for the people judged guilty in 1 [18:26:47] if we can get a limited no. of such sub-categories, we can then make some progress on getting the "positive" edits out from wikipedia and into a dataset [18:28:36] once we have positives and an analysis of trends on them (doc. length etc.) we can go on to get a dataset of corresponding negatives [18:31:25] awight: how much do you know about on-wiki gadgets using labels.wmflabs.org as an external dependency? [18:31:36] awight: catch ya tomorrow, i'll work on this some more and update you on the progress o/ [18:31:39] harej: Hadn't heard of it until now [18:31:45] saurabhbatra: Thanks! [18:32:04] I'll bother a.aron when he returns to work. [18:32:22] Either he's responsible or he know who is. [18:32:45] (Security people are getting antsy about gadgets loading non-production resources) [18:35:50] harej: What breadcrumbs do you have? [18:36:56] harej: Gadgets are defined in wiki pages, right? [18:36:56] https://en.wikipedia.org/w/index.php?search=labels.wikimedia.org&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%2C1%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C13%2C14%2C15%2C100%2C101%2C108%2C109%2C118%2C119%2C710%2C711%2C828%2C829%2C2300%2C2301%2C2302%2C2303%5D%7D&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12 [18:37:02] =1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns108=1&ns109=1&ns118=1&ns119=1&ns710=1&ns711=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1 [18:37:14] ^ https://tinyurl.com/yb3vfw6p [18:37:49] May not be on English Wikipedia [18:38:02] oh I used the wrong URL anyway [18:38:38] There are user JS scripts [18:38:47] No gadgets that I see (on enwiki), https://en.wikipedia.org/w/index.php?search=labels.wmflabs.org&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B2300%2C2302%5D%7D&ns2300=1&ns2302=1 [18:39:17] but here's the user JS, https://tinyurl.com/yc8os2op [18:39:33] lots of this: [18:39:34] mw.loader.load( '//labels.wmflabs.org/gadget/loader.js' ); [18:39:46] https://en.wikipedia.org/wiki/User:SPQRobin/common.js [18:40:15] Interesting that we take this approach; couldn't we get away with productionizing the gadget? [18:40:25] (03CR) 10EBernhardson: [WIP] Index some data extracted from judgment page content (033 comments) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/470061 (https://phabricator.wikimedia.org/T206352) (owner: 10Awight) [18:41:49] yeah it's not much of a script, https://labels.wmflabs.org/gadget/loader.js [18:42:04] We can easily make that into a module in Extension:ORES [18:47:48] harej: ooh there's more, https://labels.wmflabs.org/gadget/WikiLabels.js [18:48:34] The big question I guess, is whether it's set up this way to work around cross-site scripting. [18:49:21] Well, it's mildly urgent in that Security wants to enforce a Content Security Policy that forbids cross-site scripting. [18:49:27] Or at least restricts it heavily. [18:53:02] I have the gadget installed now, but not seeing any UI. Wonder what it goes. [18:53:05] *does [18:53:58] https://phabricator.wikimedia.org/T102333 [18:54:37] declined [19:02:10] https://trello.com/c/lNSuKllQ/205-deploy-wikilabels-to-labelswmflabsorg-and-update-docs [19:04:35] Yeah let's check with A.aron, but it seems that the old quasi-gadget was deprecated by the labs site, doesn't work any more. and we're free to disable in a way that doesn't break mw.loader. [19:04:39] *, [19:37:55] harej: Interesting responses from ebernhardson in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/470061/ [19:38:09] harej: I could use some help prioritizing our search use cases BTW [19:39:39] o/ [19:39:49] harej, awight: we don't use that anymore. [19:39:59] Now we send people directly to labels.wmflabs.org [19:40:07] We're looking at editors doing GUI searches, for stuff like "not damaging" or "Huggle", but I couldn't explain the scenario yet. [19:40:20] halAFK: Perfect, we'll just work on deprecating then. [19:40:52] Or um helping users remove these references. [19:41:53] awight, assign me the search use-case task and I'll spend some time on it later. [19:42:06] Oh. Or harej! [19:42:08] Of course. [19:42:38] 10JADE, 10Scoring-platform-team (Current), 10Advanced-Search, 10Discovery-Search, and 4 others: Extract judgment data for end-user search indexing - https://phabricator.wikimedia.org/T206352 (10awight) [19:44:07] 10JADE, 10Scoring-platform-team (Current), 10Advanced-Search, 10Discovery-Search, and 4 others: Extract judgment data for end-user search indexing - https://phabricator.wikimedia.org/T206352 (10awight) @Harej I came up with some use cases off-the-cuff, plus some input from @halfak above, but I'd like to ha... [19:44:10] noted [19:50:21] harej: ^ I left some notes, but what I'm thinking is that we need explicit scenarios to understand whether there's an obvious workaround, and how to keep the search syntax intuitive. [19:55:50] awight: i might have misinterpreted what's wrong with your patch, if you want "foo bar" typed into search without any qualifiers to match origin it needs what i mentioned, re-reading over it though you only want the keyword to work which should be possible with our current setup. I'll nede to pull it down to see why it's not happening. It might only be that you need to re-create the search [19:55:56] index after defining your new fields [19:56:41] ebernhardson: I did rebuild the index using forceSearchIndex.php [19:57:27] hmm, that should have been all that's necessary. I'll setup jade and figure out why [19:57:57] We need to clarify our use cases anyway, but one issue is that we probably want to search for either "damaging" or "nondamaging" values, so the search syntax is going to be non-trivial no matter what. [19:58:21] Ah, there's also a language issue, do we want the keywords to be translated... [19:58:41] ebernhardson: Don't bother debugging yet IMO, we'll have a better-defined failure shortly ;-) [19:58:46] :) [19:59:42] The earlier JADE search patch is working nicely BTW, I was able to render our data into its wikitext form and then into HTML, and use that to populate the result summaries and fulltext field that will be searched by default. [19:59:55] awight: currently that would be fairly straight forward, all keywords can be easily negated to damaging and non-damaging are the same query but in a different spot (must vs must not) [19:59:55] That knocks out 90% of our needs right off the bat. [20:00:31] basically negation comes for free with filtering keywords [20:01:11] Inlining the keywords with "all" data might be a problem though, the text itself will have another copy of that information but rendered differently, let me show rather than tell... [20:01:25] For example: https://en.wikipedia.beta.wmflabs.org/wiki/Judgment:Diff/376901 [20:01:46] You can see that "damaging" and "not damaging" already appear in the text. [20:02:14] awight: I'm not sure that search is that important in this case, but I'm eager to hear more about your thoughts. [20:02:27] My (very imaginary) use case is that we might want to search for "preferred judgment is 'damaging', but ORES damaging score is < 0.2" [20:03:05] Thinking of the research use case, won't they have access to more specialized tooling? [20:03:11] harej: Agreed, I'm thinking it's just a convenience. [20:03:11] awight: right, but the presense of damaging or not damaging in that text isn't definative, it could be in the notes for example. [20:03:22] for research, i imagine they want a csv dumped into pandas or whatever [20:03:34] harej: yes, researchers will be going through a stat* replica, I believe. [20:03:41] yeah or a dump, exactly. [20:04:49] ebernhardson: Yeah that's what I mean, there's no good workaround for searching for "*damaging" in the text, except perhaps a fragile regex that I wouldn't want to recommend. [20:05:55] harej: ebernhardson: Another possibility is that Extension:JADE will provide a specialized interface for finding the sort of content I'm thinking about. [20:06:55] awight: for your mentioned use case, "is damaging with score in range", that's basically the primary thing we support in CirrusSearch for extensions to do things with [20:07:09] awight: in my mind that specialized interface would still call full text under the hood, but hide the keywords into an interface [20:07:13] oh good! [20:07:30] "full text"? [20:07:58] awight: we think of search as essentially two things, either autocomplete or full text. Full text is where all the fancy processing happens [20:08:16] From how I understand it, that seems like a nice approach. We let Cirrus do all the indexing and then tap into that from custom UIs. [20:08:44] That sounds about right to me. [20:08:57] The ORES score part might be tricky since it requires a join with custom tables, but we can cross that bridge when we come to it. [20:09:18] Are we talking about search in the context of the search bar? [20:09:36] Can't we put ores predictions in elastic search? [20:09:55] you can, and then do range searches on them. [20:10:02] or really range filters [20:11:00] halAFK: We can, but there's already an expensive infrastructure to do that, it would be nice if we didn't need to build and maintain it twice. [20:11:15] harej: full text search is the search bar, but there is no reason the user has to interact directly with that. They could interact with something that constructs full text search queries under the hood and hides it from them [20:11:21] harej: Exactly, search bar, search page, advanced search toolbar. [20:13:26] awight, +1 for not duplicating stuff [20:15:15] I don't think we can consider moving the ORES indexing to be managed by the search engine, since it needs to be available for joining in pager queries... [20:15:45] * awight has to run for a most unfortunate errand, back < 1hr [20:22:54] 10Scoring-platform-team (Current), 10editquality-modeling, 10revscoring, 10artificial-intelligence: Create a newcomerquality meta-model for revscoring - https://phabricator.wikimedia.org/T205926 (10notconfusing) Note: I sent this talkpage message to everyone on `en:Wikipedia:Labels/Edit quality` == Invita... [21:05:02] back... [23:23:11] Want a sexist AI? My phone’s voice mail transcriber referred to my psychiatrist “Erin” as “Aaron” [23:31:47] Professional name... male [23:31:49] gross [23:32:35] Did I already mention that our drafttopic model is categorizing articles as European history and Culture when the word "von" appears? [23:57:59] Usually a good guess, yes? [23:58:52] This is the part where you tell me of a case where "von" has another meaning.