[00:06:07] 06Revision-Scoring-As-A-Service, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2596409 (10Halfak) I did some work yesterday in #paws but I kept running into memory issues that made extracting the hash delta tables impossible. See http://paws-public.... [13:38:55] Arg. Coming back to my JSON issues now [13:38:59] * halfak grumbles [13:41:41] Well... I guess I can convert dicts to lists of pairs [13:42:07] E.g. {1: "foo", 2: "bar"} --> [[1, "foo"], [2, "bar"]] [13:42:08] 10[1] 04https://meta.wikimedia.org/wiki/1%2C_%22foo%22 [13:42:17] Hey AsimovBot [13:42:30] So, this shouldn't result in any sort of serious performance cost [13:42:40] A dict needs to be rebuilt during serialization anyway. [13:43:01] And nearly all of my operations end up iterating through the dict anyway. [13:43:22] Still, I can't have my datasources assuming that all operations will involve this iteration pattern. [13:44:22] I could switch to pickle, but that'd be gross. [13:45:06] I could use python's serialization with repr/expr, but that's (1) not safe and (2) uncommon [14:21:52] 06Revision-Scoring-As-A-Service, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2598281 (10Halfak) OK. So thinking about the serialization problem. I think that it might be easy to just use `pickle` or something like it. But if we do, then we'll lo... [14:39:38] 06Revision-Scoring-As-A-Service, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2598328 (10Halfak) No good options. Let me describe the two I am seriously considering. # 1. Just use pickle, dill, msgpack, etc. This will make lines of observations... [14:48:18] Amir1, I'm going crazy [14:48:26] https://phabricator.wikimedia.org/T132580#2598281 [14:48:36] This will definitely need some review. [14:48:40] * Amir1 calms down halfak [14:50:00] I'm reading :) [14:51:32] halfak: the cache part in https://phabricator.wikimedia.org/T132580#2598328 [14:51:45] do we want to return it in responses to user? [14:51:57] (why we return cache?) [14:52:33] Amir1, this is for training models and fitting feature selectors (part of vectorization) [14:52:56] the cache is placed in "observations" that are used in training/testing [14:53:54] hmm, It's more clear to me now [14:54:17] I'm rewriting the "extract" utility so that it only populates the cache with requested values. [14:54:39] Then those requested values will be directly accessed for all sorts of training/testing/fitting/etc [14:54:58] halfak: regarding Pure JSON, if the dict already sends string as key, I guess it works just fine [14:55:09] Amir1, indeed, but this is not guaranteed. [14:55:10] but you need to convert keys to string before converting to json [14:55:14] In fact it is undesirable. [14:55:21] Amir1, but how to know when to convert back? [14:55:54] E.g. is "1" a number that appeared in the text or is 1 a hash? [14:56:14] I was thinking if it's possible to convert and use them as string everywhere [14:56:34] so we don't need to convert back [14:57:35] sys.getsizeof("1") --> 50, sys.getsizeof(1) --> 28 [14:58:21] Okay, now I'm convinced :D [14:58:38] Even more so: sys.getsizeof("Longword") --> 57, sys.getsizeof(900065) --> 28 [14:58:42] :) [14:58:56] 900065 is the 2^20 hash of "Longword" [14:59:23] * halfak punches JSON in the face [16:47:20] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: Redundant results in ORES review tool - https://phabricator.wikimedia.org/T144233#2598820 (10Ladsgroup) I was able to reproduce it, I'm getting a sense why it's happening but I'm not sure how I can fix it. [17:35:24] o/ [17:35:26] Just got back [18:29:36] legoktm: hey, do you have a minute to check this? https://gerrit.wikimedia.org/r/#/c/307624/ [18:29:36] IT WORKS [18:29:51] (Referring to my work on the new caching pattern) [18:32:02] And our feature extractor is about twice as fast [18:32:09] Because we're using batching now ^_^ [18:32:48] \o/ [18:33:40] These should really be separate pull requests. I'll try to see if I can get that worked out. [18:34:50] Amir1: lemme see [18:40:05] (03CR) 10Legoktm: [C: 04-1] Improve CheckModelVersions.php (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [18:42:09] (03CR) 10Legoktm: [C: 04-1] "Per CR" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/306316 (owner: 10Catrope) [18:56:48] (03CR) 10Ladsgroup: Improve CheckModelVersions.php (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:01:31] (03CR) 10Legoktm: Improve CheckModelVersions.php (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:01:41] Amir1: is there a bug filed for having ORES purge varnish cache? [19:01:52] (or, have varnish not cache that page) [19:01:58] Good Q. ANd no [19:02:07] legoktm: https://phabricator.wikimedia.org/T144193 [19:02:15] Oh... Yes :) [19:02:25] halfak: You made it :D [19:02:36] forgot [19:02:39] Super scattered. [19:02:40] heh [19:02:51] for all ORES responses? or just some? [19:03:11] all responses [19:03:16] legoktm, for some. Most of the interesting ones [19:03:16] :P [19:03:28] The homepage and other html/js assets can be cached. [19:03:41] we have redis cache, varnish one is basically redundant [19:03:46] Anything that generates a score or model_info is no-cache [19:07:25] ok, mind if I take a stab at working on that? [19:07:42] legoktm, is done [19:07:52] heh :P [19:07:52] Thanks though! May I entice you into looking at something else? [19:08:04] (03PS2) 10Ladsgroup: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) [19:08:26] sure [19:08:42] * halfak digs for something that's in the same general class of thing [19:09:57] is it just me or every one does "ssh vagrant" instead of "vagrant ssh" sometimes too? [19:10:50] legoktm, if you want to look at an ORES thing, this might be fun. https://phabricator.wikimedia.org/T140364 [19:11:07] Otherwise, I'd try to get you to review the status of the WikiLabels extension [19:11:15] I'd really like your input on next steps there [19:11:50] what is WikiLabels? [19:12:01] https://meta.wikimedia.org/wiki/Wiki_labels [19:12:11] It's how we get data for training models for ORES [19:13:39] and you want to productionize all of that? [19:14:05] legoktm, if possible. In the meantime, make it easier to maintain. [19:14:12] And easier for users to work with [19:15:20] ok, so what's the current status? :P [19:15:49] OK For the running system, the base code lives here [19:15:52] https://github.com/wiki-ai/wikilabels [19:16:08] This implements a service that hosts JS and API necessary for the gadget on-wiki to work [19:16:23] This is our configuration for WMF labs https://github.com/wiki-ai/wikilabels-wmflabs-deploy [19:16:49] You could load this JS in your global.js https://labels.wmflabs.org/gadget/WikiLabels.js [19:17:04] But we copy it to https://meta.wikimedia.org/wiki/MediaWiki:Gadget-WikiLabels.js on deployments [19:17:40] bmansurov_away started work on an extension. See https://www.mediawiki.org/wiki/Extension:WikiLabels [19:18:56] (03PS3) 10Ladsgroup: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) [19:19:27] https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/WikiLabels heh [19:20:31] Labels and tasks are arbitrary JSON blobs. We store them in postgres DB now (native types for them), but we could store them as strings in MySQL since we don't usually query based on their values. [19:23:31] how does this differ conceptually from things like change tagging? [19:25:08] legoktm, not sure that it does [19:25:08] I confirm that upsert works just fine [19:25:24] legoktm, what is change tagging? [19:25:42] https://en.wikipedia.org/wiki/Special:Tags those things [19:25:42] I tested this patch on my lab rat (mw-revscoring.wmflabs.org) and worked [19:25:52] legoktm, oh... well... in lots of ways [19:26:03] legoktm: https://gerrit.wikimedia.org/r/#/c/307624/ for when you're done :D [19:26:17] We have a notion of campaigns -- pre-sampled set of things (could be users, user-sessions, revisions, pages, etc.) [19:26:41] Users work on randomly sampled subsets of a campaign called "worksets" -- this allows us to have statistical validity [19:26:51] We can request N labels per item [19:26:59] Labels (unlike tags) can be complex objects [19:27:15] Which is sometimes necessary. Though many are boolean or categorical [19:27:21] (03CR) 10Legoktm: [C: 04-1] Improve CheckModelVersions.php (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:28:27] alright [19:29:47] (03CR) 10Ladsgroup: Improve CheckModelVersions.php (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:31:15] (03PS4) 10Ladsgroup: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) [19:32:22] Amir1: do you want to get rid of $dbr entirely to avoid confusion? you can use $dbw->addQuotes(), it's the same. [19:33:28] okay [19:33:29] one it [19:33:31] *on [19:34:59] (03PS5) 10Ladsgroup: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) [19:35:49] sorry, last nitpick :P still call the variable "$dbw" [19:35:53] so it's obvious its a master [19:36:15] okay :D [19:36:57] (03PS6) 10Ladsgroup: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) [19:38:12] 06Revision-Scoring-As-A-Service, 10revscoring: Update yamlconf so that import_path can handle deep attributes - https://phabricator.wikimedia.org/T144430#2599592 (10Halfak) [19:39:39] 06Revision-Scoring-As-A-Service, 10revscoring: Update yamlconf so that import_path can handle deep attributes - https://phabricator.wikimedia.org/T144430#2599608 (10Halfak) https://github.com/halfak/yamlconf/compare/ac8aec1d223d9574b0c61f801232fb78d63d09bd...master [19:39:48] 06Revision-Scoring-As-A-Service, 10revscoring: Update yamlconf so that import_path can handle deep attributes - https://phabricator.wikimedia.org/T144430#2599609 (10Halfak) a:03Halfak [19:45:08] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 07Schema-change, 15User-Ladsgroup: oresm_model index should not be unique - https://phabricator.wikimedia.org/T144432#2599649 (10Ladsgroup) [19:46:04] btw. until https://phabricator.wikimedia.org/T144432 is not resolved we can't update any models that ores review tool usees [19:46:07] *uses [19:47:21] legoktm: https://gerrit.wikimedia.org/r/307624 :D [19:47:37] Amir1, woops [19:50:26] I'm happy we are finding all bugs in model update procedure [19:50:40] I can sleep tonight [19:54:50] (03PS7) 10Legoktm: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:55:03] (03CR) 10Legoktm: [C: 032] Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [19:55:57] (03Merged) 10jenkins-bot: Improve CheckModelVersions.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307624 (https://phabricator.wikimedia.org/T144195) (owner: 10Ladsgroup) [20:07:18] wiki-ai/revscoring#796 (feature_vector - 7b7ea17 : halfak): The build was broken. https://travis-ci.org/wiki-ai/revscoring/builds/156624575 [20:18:36] I <3 wheels. [20:19:09] pip install ores was super fast [20:20:56] \o/ [20:21:00] Agreed [20:23:25] halfak: what's the best way to switch from a pip install to running out of a git clone so I can test my changes? [20:24:10] pip uninstall ores && python setup.py worked [20:24:18] legoktm, if you execute tests from the base of the git repo, it will use the content of the repo as the module [20:24:27] you can download our wheels repo [20:24:39] (also you can build your own wheels and keep it somewhere) [20:24:47] If, for example, you made changes to ORES and wanted to test those, run ORES via "./utility" [20:24:48] I don't want to run the tests, I want to make changes and run the server [20:25:01] legoktm, "./utility applications.wsgi" [20:25:04] ah [20:25:06] ok [20:25:08] :) [20:27:07] legoktm: https://gerrit.wikimedia.org/r/307077 [20:27:22] I also have this one in core that I need for some bugs in ores review too [20:28:37] uh, I'll look at that later? I just got into messing with ores :P [20:29:36] I want to distract you from ores mwhaaaa [20:29:45] don't worry, It can wait :) [20:30:27] oh and when halfak is not around, you can ask questions from me (If I'm around) [20:46:02] I'm about to run, but what are the guidelines on using external libraries? there are a few that I've found that do auto-swagger documentation, and if they fit our needs, would be nice to use them instead of writing our own [20:51:43] (03PS1) 10Ladsgroup: Not including results when oresm_is_current = 0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/307870 (https://phabricator.wikimedia.org/T144233) [21:57:07] Working to speed up the hashing/gramming process right now [22:03:05] It looks like we take about 0.5 seconds to fully gram & hash the most recent revision of [[:en:Biology]] [22:03:05] 10[2] 04https://meta.wikimedia.org/wiki/:en:Biology [22:17:13] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 13Patch-For-Review, 15User-Ladsgroup: Redundant results in ORES review tool - https://phabricator.wikimedia.org/T144233#2592543 (10Legoktm) Can you explain what was wrong and how you fixed it? [23:01:14] hash deltas extracted [23:01:21] I'm now training the tfidf selector :) [23:12:01] OMG I have an 11k pickle file with my trained TFiDF selector! [23:13:12] ooooooh [23:13:33] Oh... So other than some massive performance issues, this works! [23:13:36] VICTORYT [23:13:39] (For today) [23:54:21] "only" detected as 82.64% likely to be bad [23:54:30] https://ores.wikimedia.org/v2/scores/eswiki/reverted/93317612?features [23:54:45] quite impressive given that it didn't detect anything special about the words themselves [23:56:04] (text means "he raped all his maids") [23:56:14] congrats