[01:36:33] 06Revision-Scoring-As-A-Service, 10wikilabels: [Investigate] Intermittent performance issues with wikilabels - https://phabricator.wikimedia.org/T130872#2262312 (10Ladsgroup) I'm closely montoring this. In cases that took more than one second, we had several cases today and yesterday. Most of them were strange... [13:19:02] o/ [13:29:34] o/ [14:12:23] halfak: o/ [14:13:17] o/ [14:16:13] halfak: I've got some statistics for the intermittent issue: https://phabricator.wikimedia.org/T130872#2262312 [14:16:51] I would really appreciate if akosiaris send me logs or give me some direction [14:17:06] Awesome. I was just experiencing it yesterday. [14:17:07] (in the mean time it seems one of the tables doesn't have proper index) [14:17:21] But in my case, the server would 502 when trying to return workset stats. [14:17:38] Looks like this is what showed up in the logs! [14:18:47] yeah [14:19:04] one of the things is that workset stats seems to be slow [14:19:18] the index on the workset seems okay [14:19:25] but maybe something else is missing [14:20:39] Yeah. Was thinking that there is something going on with that query that makes it slower than expected. [14:23:26] So weird. [14:23:37] The query finishes in less than a tenth of a second right now. [14:23:40] getting the query plan. [14:23:45] Amir1: I 'll look into the postgres issue today. in an interview right now [14:24:06] thanks akosiaris [14:24:29] 06Revision-Scoring-As-A-Service, 10wikilabels: [Investigate] Intermittent performance issues with wikilabels - https://phabricator.wikimedia.org/T130872#2264079 (10Halfak) Here's the query plan for an example query (the first workset stats query to fail in @Ladsgroup's paste above: ``` Aggregate (cost=452.4... [14:24:52] * halfak fixes missing paren [14:25:43] 3 index-only scans and an aggregate. [14:25:49] halfak: one of the things that can improve it performance (but a very little) is making an index called workset_user_campaign [14:26:04] Why is that? [14:26:09] on workset, based on user and campaign [14:26:17] Yeah, but what query would it help? [14:26:28] the same query here [14:26:55] it does 'user_id' AND campaign_id [14:26:55] Why do we need an index that covers campaign? [14:27:08] because where has both of the [14:27:11] *them [14:27:40] Would you put that index on the workset table? [14:27:50] so basically every time we asks for stats, it geos through all worksets in all campaigns and then search for the campaign [14:28:10] I don't think that's right. [14:28:58] sorry [14:28:59] It goes directly to the workset that is requested via a Primary Key lookup. [14:29:22] Then it looks up all tasks in that workset. then it looks up all labels for those tasks. Then it aggregates. [14:29:32] the query that fails actually goes through all worksets in a campaign [14:29:54] http://labels.wmflabs.org/users/43607616/32/?callback=jQuery111305071653840132058_1462215727833&worksets=stats&_=1462215727834 [14:29:58] this is one of them [14:30:24] all worksets in a certain campaign that are assigned to a certain user [14:30:26] Gotcha. Checking that out now. [14:30:53] i.e. for_user function in workset class [14:31:57] OK. I think i see what you mean for that index. [14:32:49] as i said it's not very helpful since we do already index user on workset [14:33:21] and a user doesn't have so many worksets in so many campaigns [14:36:05] I do. i think that might be where I feel the pain. [14:36:26] okay [14:36:36] so I'm going to implement it [14:36:58] Let me work out the query plan first. [14:37:13] sure [14:40:31] OK. Testing some things out on the staging server [14:40:39] I think we might need to apply the index manually. [14:46:35] 06Revision-Scoring-As-A-Service, 10wikilabels: [Investigate] Intermittent performance issues with wikilabels - https://phabricator.wikimedia.org/T130872#2264124 (10Halfak) Looks like we can improve the basic scan for relevant worksets with an index on user_id and campaign_id on workset. ``` u_wikilabels_test... [14:47:18] Amir1, https://phabricator.wikimedia.org/T130872#2264124 [14:47:23] Looks good. Let's do it. [14:48:34] I can quickly just apply the index to prod [14:48:40] awesome [14:48:44] I do it [14:48:47] Can you submit a PR for the schema doc? [14:49:01] sure [14:49:05] wait a sec [14:49:07] and I do it [14:50:14] halfak: nice [14:50:49] I'm going to run this in prod after I snap a backup [14:50:51] CREATE INDEX workset_user_campaign ON workset (user_id, campaign_id); [14:51:07] halfak: one thing: since we do index based on user and campaign, we already have an index on user [14:51:14] we can just delete that [14:51:21] Seems like it, yeah [14:51:26] \o/ [14:51:31] let me do it [14:51:39] On prod or in our schema? [14:52:11] the PR [14:52:19] and if you want on the prod [14:56:46] halfak: I haven't test it [14:56:58] let me do run a quick test in the staging server [14:59:38] Taking a long time to drop workset_user. [15:00:39] ORES is down [15:00:45] it whaatt [15:01:09] up? [15:01:15] Hmm... Well I got an incinga alert [15:01:21] So it must have been down for a moment. [15:02:00] halfak: we can simply set up icigna to report here too [15:02:13] Probably, yeah [15:02:49] it's some settings in operations/puppet [15:03:00] if you want to have it, tell me and I do it [15:03:25] Let's do it [15:04:02] sure [15:04:05] :) [15:05:08] Just got the recovery notification :) [15:07:07] it seems I need to go sooner, the university is closing sooner today [15:07:14] but I'll be back very soon [15:07:41] btw. halfak the load_schema worked really soomth [15:07:48] *smooth [15:08:06] Nice [15:08:14] but I haven't checked that we have the index or not [15:08:28] The index is there :) [15:12:13] {{merged}} [15:12:14] 10[1] 04https://meta.wikimedia.org/wiki/Template:merged [15:13:09] Wiki labels is down :( [15:13:23] Can't ssh to the vm [15:14:23] Yup. Fully down. [15:18:47] and we're back! [15:45:10] argggg [15:45:16] I'm back now [15:45:24] semi-afk for dinner [15:51:05] kk no worries. Looks like I'm getting a little bit better performance from wikilabels :) [16:00:09] \o/ [16:16:04] halfak: What is the plan for today? mwparserfromhell performance? [16:16:34] Yeah. I'm planning to work on that and to organize a meeting for the ORES paper. [16:17:17] cool [16:17:31] halfak: do you have anything in your mind for me? [16:18:02] 06Revision-Scoring-As-A-Service, 10revscoring: Apply regex performance optimizations to badwords/informals detection - https://phabricator.wikimedia.org/T134267#2264623 (10Ladsgroup) [16:18:15] sorry, my mind is stretched a little thin recently. I'll take a look at the backlog board [16:18:32] no worries [16:19:00] I can search and find something on my own but I thought you might find something more urgent, more useful [16:19:13] Oh! I have a great one! [16:19:24] * Amir1 is all ears [16:19:27] So, I a lot of our features don't have that much predictive value. [16:19:32] Yet we generate them anyway. [16:19:46] I was meaning to run some tests on edit quality model to see what features we could drop. [16:20:14] that's pretty straightforward [16:20:36] So, there's a member variable of gradient boosting classifier models called "feature_importances_" [16:20:46] That should give you a sense for which features you can likely remove first. [16:21:03] nice [16:21:05] on it [16:21:20] :) [16:22:06] I haven't made a task for this yet [16:22:06] halfak have you got any requested for loading a campings in arwiki ? [16:22:18] GhassanMas, not sure I understand [16:23:05] when user asks to load campaigns in wikilabels to label them as good faith, damaging...etc [16:24:33] GhassanMas, there is an arwiki campaign for labeling things damaging/goodfaith right now. [16:24:39] So I'm not understanding. [16:24:48] Maybe you are asking if the campaign has progressed? [16:24:57] http://labels.wmflabs.org/campaigns/arwiki/?campaigns=stats [16:25:07] Looks like we have 76 of 4977 labels [16:26:53] yeah I meant the portion of the campaign which is about 50 samples [16:27:58] uhh, I don't know hot to motivate the arwiki community about it [16:28:37] I post about the project in the news page that related to the editors [16:28:55] I think we need a local. Amir1 has a lot of experience with motivating work on campaigns for wikidata and persian wikipedia [16:28:56] you once asked to collect a groups of editor and admins for that purpose [16:29:00] Amir1, Any pro tips? [16:29:36] I used the anti-vandal bot [16:29:40] and scored revisions [16:29:57] and it wasn't very well since it was using the reverted model [16:30:07] and I told them if you want to get this better [16:30:10] label here [16:31:47] them refers to whom [16:33:10] people in the wiki [16:33:14] patrols [16:33:16] etc. [16:34:36] I tried to inform these kind of editors in their talk pages but i got no thing [16:34:51] I will try it again [16:35:24] Maybe if we had a good explanation of the value of the system. [16:35:40] E.g. https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/ in Arabic [16:36:21] I'd definitely encourage people to try out ScoredRevisions and tell them that the system gets better if you do the Wiki labels campaign. [16:37:53] yeah I though of translating the documentation page related to the project in Arabic. [16:38:34] we also may refer to the blog page in "Arabic" [16:43:05] I think it may better to start with the documentation page since it has instructions on how to install the gadget [16:44:22] halfak: okay, I ran some tests on enwiki.damaging [16:44:30] I'm trying to increase the work [16:44:49] but for now. the most unimportant feature is feature.revision.user.has_advanced_rights [16:45:06] Yeah. Probably already covered by is_curator and is_admin. [16:45:09] I think that's because we catch the signal by user.age and stuff like that already [16:45:17] But that features is really cheap to compute. [16:45:28] In order to get a performance boost, we need to remove whole classes of features. [16:45:29] let me try the next one [16:45:50] I wonder how much value we get out of using mwparserfromhell at all. [16:45:53] feature.revision.page.is_draftspace [16:45:56] Or dictionary stuff. [16:45:59] that's the second one [16:46:12] I bet it almost never even comes up -- draftspace. [16:46:14] can you tell me which features are using mwparserfrom hell [16:46:32] I must say, I checked the most important ones [16:46:39] and these came up [16:47:02] we could use SVM to check which features is most important [16:47:06] >>> enwiki.damaging[4] [16:47:06] El búfer 4 está vacío. [16:47:06] ) + 1))> [16:47:06] >>> enwiki.damaging[11] [16:47:06] [16:47:06] >>> enwiki.damaging[8] [16:47:06] El búfer 11 está vacío. [16:47:06] El búfer 8 está vacío. [16:47:06] [16:48:31] halfak: the third most unimportant feature: [16:48:37] 10Revision-Scoring-As-A-Service-Backlog: Translating the documentation page of Wikilabels to Arabic - https://phabricator.wikimedia.org/T134405#2264693 (10Ghassanmas) [16:49:46] Amir1, are any of the badwords features useful? [16:50:26] We should probably sum their importance. Once we have the badwords extracted, all the dependent features are blazing fast to generate. [16:50:52] this is the forth one [16:51:21] fifth one and afterwards are actually not really bad [16:52:01] informal ones are actually the fifth and the sixth [16:52:21] but they are more important by an order of magnitude higher [16:56:50] https://gist.github.com/Ladsgroup/ed92b16d3e67ef9cdd8a8f7e4d509682 [16:56:54] halfak: ^ [16:57:37] list.sort() <3 [16:58:07] Maybe also print("\n".join(str(v) for v in list)) [17:00:55] okay [17:01:01] wait a sec [17:06:01] halfak: https://gist.github.com/Ladsgroup/ed92b16d3e67ef9cdd8a8f7e4d509682 [17:16:47] I'm surprised at how useful dictionary is. [17:17:39] Yikes. Look at the power of is_anon and seconds since registration :( [17:34:48] so new user tend to cause vandalism than the older, may be because lack knowledge on the wiki rules [17:35:50] GhassanMas, indeed. this is the difference between "goodfaith" and "damage" [17:36:11] heh, now I really understand the difference ! [17:36:26] A new user who is trying to contribute, but doesn't know the rules will cause "damage", but not "vandalism" [17:36:34] "Vandalism" is purposeful :) [17:36:44] The opposite of "goodfaith" is "vandalism" [17:43:00] yeah the thing with not knowing the rules what I was missing :) [17:55:19] Got stuck working on yearly review things for WMF. [17:55:35] Will pick up performance stuff re. mwparserfromhell again in a little while. [17:56:13] 06Revision-Scoring-As-A-Service, 10wikilabels: [Investigate] Intermittent performance issues with wikilabels - https://phabricator.wikimedia.org/T130872#2265026 (10akosiaris) @Ladsgroup : Logs are in P3000 [17:57:29] 06Revision-Scoring-As-A-Service, 10wikilabels: [Investigate] Intermittent performance issues with wikilabels - https://phabricator.wikimedia.org/T130872#2265029 (10Ladsgroup) thanks :) [18:01:53] 10Revision-Scoring-As-A-Service-Backlog, 10bwds, 10revscoring: Language assets for Norwegian - https://phabricator.wikimedia.org/T131855#2265051 (10Ladsgroup) [18:47:19] 10Revision-Scoring-As-A-Service-Backlog: Translating the documentation page of Wikilabels to Arabic - https://phabricator.wikimedia.org/T134405#2265143 (10Ghassanmas) Main and the important part of the page including the instructions to install the gadget have been translated, we may still need to translate the... [18:47:42] need to go home [18:47:46] leave [18:48:14] o/ [18:48:18] darn [20:06:18] halfak: I just made running bwds much easier for new languages [20:06:47] I need to run a test and after that we can do all languages altogether [20:12:11] \o/ [20:12:14] That's awesome! [20:12:22] What do you think of moving it over to editquality? [20:28:55] halfak: that's not a bad idea [20:29:02] I made a phab card already [20:29:09] but I haven't moved it yet [22:26:39] 10Revision-Scoring-As-A-Service-Backlog, 07Documentation, 07I18n: Translating the documentation page of Wikilabels to Arabic - https://phabricator.wikimedia.org/T134405#2266035 (10Danny_B)