[05:25:53] 10ORES, 10Scoring-platform-team, 10Analytics, 10Dumps-Generation, and 3 others: Decide whether we will include raw features - https://phabricator.wikimedia.org/T211069 (10ArielGlenn) >>! In T211069#5229037, @awight wrote: > Another production feature store framework we might learn from, > https://www.logic... [07:55:22] PROBLEM - puppet on ORES-web01.Experimental is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 2 minutes ago with 19 failures. Failed resources (up to 3 shown): Package[ldap-utils],Package[libnss-ldap],Service[nscd],Service[nslcd] [08:18:17] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Puppet has 26 failures. Last run 2 minutes ago with 26 failures. Failed resources (up to 3 shown): Package[ntp],Service[systemd-timesyncd],Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [08:24:33] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:38:21] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 19 minutes ago with 27 failures. Failed resources (up to 3 shown): Service[ssh],Package[ntp],Service[systemd-timesyncd],Service[prometheus-node-exporter-ipmitool-sensor.timer] [11:46:17] RECOVERY - puppet on ORES-web02.Experimental is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:38:30] o/ [13:56:20] 10Scoring-platform-team (Current), 10editquality-modeling, 10Chinese-Sites, 10artificial-intelligence: Train/test zhwiki editquality models - https://phabricator.wikimedia.org/T224481 (10Shizhao) [14:28:16] groceryheist, ping me when you are in [14:28:25] I have some thoughts on building the revert models. [14:48:45] PROBLEM - puppet on ORES-worker02.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:07] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:53:17] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:59] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/ORES [14:58:57] halfak: ping [14:59:23] Hey groceryheist! [14:59:36] So I was wondering why you wanted to train the reverted models using data after ORES was deployed. [15:00:08] Wouldn't it be more appropriate to train on revert dynamics *before* ORES is deployed. [15:00:13] *? [15:00:21] PROBLEM - puppet on ORES-web01.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:09] it's a little bit complicated [15:03:23] PROBLEM - puppet on ORES-worker01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:37] Indeed it is. [15:04:34] in brief, we want to measure the difference (caused by ores) in the difference in probablity of being reverted between anons / non anons [15:04:42] hence the "differences in differences" model [15:05:12] so wwe at least want to /predict/ the probability of being reveted before and after ores [15:05:43] if we only had a smple from after ores, well maybe it doesn't generalize to before ores [15:06:14] and it would also complicated the estimation process [15:06:30] since we'd need some wierd two stage approach [15:06:43] so it's much simpler from an econometric perspective if we have edits before and after [15:07:00] I had a thought yesterday I'd like to share with you [15:07:22] it raises a relatively simple analysis that we might want to do before we proceed any further [15:07:40] here it is: [15:08:00] Aren't we using this model as a propensity baseline to control for the possibility that vandalism rate changes? [15:08:08] ok [15:08:14] i'll hold that thought [15:08:31] so the propensity score part is different [15:08:45] I don't see a clear explanation for why we would train on data after ORES deployment for such a propensity model. [15:08:52] that's only about trying to correct for a non-random treatment assignment [15:08:59] Right. [15:09:06] Isn't that our primary goal with this reverted model? [15:09:18] I'll try again more formally [15:09:23] we want to estimtae [15:10:35] [P(reverted | newcomer, ores, ...) - P(reverted | ~newcomer, ores, ...)] - [P(reverted|newcomer, ~ores, ...) - P(reverted|~newcomer, ~ores, ...)] [15:11:23] so we could assume that ores is independent of everything [15:11:34] in which case we can do as you suggest [15:12:35] but if we don't assume that then there isn't any way to get ~ores out of the conditions on the right hand of that difference [15:13:27] if we have a logistic regression model like: [15:13:55] reverted ~ newcomer + ores + newcomer&ores + ... [15:14:24] then that difference is just the coefficient for newcomer&ores [15:15:40] there are still some problems with that approach, which we could discuss, but I'll stop there for now and go make coffee [15:19:31] * halfak is thinking it through [15:19:35] sorry the difference in probability is 1/(exp(B_newcomer&reverted)). [15:20:32] This doesn't quite answer the question in a direct way. [15:21:15] You say, "if we had a logistic regression model like..." but I was expecting something like, "If we train on data before and after ORES is deployed" [15:21:27] But I think I get it because you want to have ORES as a predictor. [15:21:37] ORES is going to be a messy variable. [15:22:05] Because there's the ORES service deployment that gets picked up by various tools. Then there's the RCFilters deployment that picks up the ORES service as well. [15:22:21] Then independent of that, there's people using ORES-powered tools. [15:22:34] They might have been using those tools beforehand without ORES-support. [15:23:34] i see [15:23:47] in practice I"m using "ores" to mean "ores powered rcfilteres" [15:23:54] those are the dates I'm using [15:24:06] yeah it's because ores is a predictor [15:24:18] you can't have that without data from before then [15:24:48] i'm mainly thinkng about rcfilters [15:26:25] groceryheist, I think I might have not been clear. I was originally wondering why we wanted to train on data *after* ORES deployment. [15:26:35] It's clear to me why we would train on data *before* ORES deployment [15:27:14] oh i see [15:27:24] we want before and after [15:27:28] But it seems like the answer to my question is: we would like to use "ORES enabled RCFilters"'s Beta in order to measure the effect. [15:27:47] yes [15:27:47] So training when ORES was present and when it was not would allow us to learn the effect of ORES as part of the model. [15:27:54] yes [15:28:07] that's right [15:28:10] Thus we won't be using it to estimate propensity independently of our target model. [15:28:18] It will be part of the target model. [15:28:24] yeah so earlier I wasn't sure if this would be right way to do it [15:28:44] Which addresses my concerns about using a logistic model. We want to inspect the coefficients in order to learn about direction and significance of effect. [15:28:45] or if we would estimate P(reverted | ...) and then compare the predictions of that model [15:28:55] The Logistic model fits very poorly compared to an RF or GB. [15:29:21] so I don't really care so much about predictive performance, but rather having unbiased coefficients [15:29:42] RF and GB do not give much thought to bias in the bias / variance trade-off [15:30:05] it's fine if the model doesn't predict well, as long as the coefficient we care about is well estimated [15:30:28] which is going to be a big assumption [15:30:29] ^ Wouldn't be true if we were using it to model propensity. [15:30:54] won't the errors be correlated with the predictors? [15:31:07] Going to update current work then will join stand up [16:04:42] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10revscoring, and 2 others: ORES deployment: Early June - https://phabricator.wikimedia.org/T224484 (10Halfak) a:03Halfak [16:06:44] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10Epic, 10artificial-intelligence: ORES bias analysis - https://phabricator.wikimedia.org/T224901 (10Halfak) [16:06:54] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10Epic, 10artificial-intelligence: ORES bias analysis - https://phabricator.wikimedia.org/T224901 (10Halfak) a:03Groceryheist [16:08:03] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10artificial-intelligence: Fit models for revert prediction - https://phabricator.wikimedia.org/T224902 (10Halfak) [16:08:52] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10Epic, 10artificial-intelligence: ORES bias analysis - https://phabricator.wikimedia.org/T224901 (10Halfak) From @Groceryheist: Look for threshold effects. It is more "simple and sensitive" compared to doing the DID analysis. [16:10:23] halfak: does errors being correlated with predictors make sense as a rationale for logistic regression compared to RF or GB? [16:14:21] Yeah. I think that the interpretability of a logistic regression is a critical distinction. [16:15:15] I was previously under the impression that we would use the probability estimate of the "revert" model as a propensity estimator in a logistic regression. [16:25:19] sorry about that confusion, I probably said something to that effect earlier on [16:26:04] i'm going to focus today on creating nice visualizations for each wiki that should show how scores influence (or don't) whether edits are reverted [16:40:59] Cool. I look forward to that :) It's gonna be hard to visualize, I suspect, because of the noise. I'm working on something similar now. [16:41:09] I wonder if a timeseries model might help with the noise. [16:41:15] Vandalism seems to have a weekly period :) [16:44:53] I have an even simpler idea that doesn't require accounting for time [16:45:25] y axis is probability of being reverted [16:45:30] x axis is ores score [16:45:36] looking for jumps at the threshholds [16:45:56] post ores there should be jumps [16:46:02] pre ores there should not be [16:46:08] again by "ores" i mean rcfilters [16:46:17] RECOVERY - puppet on ORES-web02.Experimental is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:46:22] still need to flip that bit in my brain [16:46:45] RECOVERY - puppet on ORES-worker02.experimental is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:53:22] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:56:23] RECOVERY - puppet on ORES-worker01.experimental is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:04:05] groceryheist, we might find the jumps are delayed. How will we know? [17:04:33] * halfak curses both puppet and icinga2 [17:04:47] There's got to be a better way to deal with this flood of warnings@! [17:05:49] how about I just look at a sample of edits from several months after the change? [17:24:16] Right on. I think that makes sense. [17:24:33] Might want to take a couple samples if it's tractable. [17:24:45] yeah [17:24:57] I'll work up to taking like a sample for each week [17:24:59] I bet on some wikis, there's some initial excitement that dwindles and on other, there's a slow ramp-up. And finally in others, no one cares and so ORES is unused. [17:25:09] so we can see if it takes place over time [17:25:15] right [17:25:26] which might explain why average effects are hard to observe [17:25:27] From what I heard from Dutch Wikipedians, people really don't even realize ORES is available! [17:25:54] They've got an open proposal right now to turn off Mobile editing entirely because of "too much vandalism" :( [17:26:06] yeah so maybe there needs to be something of a promotional campaign [17:26:12] :( [17:26:21] So I showed some of the patrollers that ORES makes that easier. Hopefully it will help, but it certainly showed me something about the need for better communication around our deployments. [17:26:31] hmm [17:26:32] Right. [17:26:47] I'm heading to lunch [17:26:50] Back in ~ an hour [17:27:01] which is a little surprising, since I discovered these just through my watchlist [17:27:08] later [17:28:03] 10ORES, 10Scoring-platform-team, 10Analytics, 10Patch-For-Review, 10Services (watching): Wire ORES recent_score events into Hadoop - https://phabricator.wikimedia.org/T209732 (10Ottomata) There is a lot of talk about building a generic 'ML Pipeline' over the next year or two. This would likely include a... [17:29:20] 10ORES, 10Scoring-platform-team, 10Analytics, 10Dumps-Generation, and 3 others: Decide whether we will include raw features - https://phabricator.wikimedia.org/T211069 (10Ottomata) Hopsworks looks really awesome; at least its claims do! [17:40:42] 10ORES, 10Scoring-platform-team, 10Growth-Team, 10MediaWiki-extensions-WikibaseClient, and 4 others: ORES/ChangesListHooksHandlerTest causing build failures in other repos (e.g. UploadWizard) - https://phabricator.wikimedia.org/T224672 (10kostajh) 05Resolved→03Open > AFAICT this is now Resolved. I wa... [17:54:11] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10artificial-intelligence: Visualize the relationship between the probability of reversion and ores scores - https://phabricator.wikimedia.org/T224918 (10Groceryheist) [17:55:34] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10Epic, 10artificial-intelligence: ORES bias analysis - https://phabricator.wikimedia.org/T224901 (10Groceryheist) I created a task T224918 for that analysis. [17:58:12] 10ORES, 10Scoring-platform-team (Current), 10editquality-modeling, 10artificial-intelligence: Fit models for revert prediction - https://phabricator.wikimedia.org/T224902 (10Groceryheist) [18:00:02] (03PS1) 10Kosta Harlan: (WIP): Re-enable test [extensions/ORES] - 10https://gerrit.wikimedia.org/r/514086 (https://phabricator.wikimedia.org/T224672) [18:50:50] harej, we should do one of these with the Growth team around RFilters: https://en.wikipedia.org/wiki/Responsibility_assignment_matrix [18:57:07] https://www.projectsmart.co.uk/how-to-do-raci-charting-and-analysis.php [21:35:43] we don't have historical values for the threshholds anywhere but gerrit? [23:16:40] it would be nice to have a central table with dates each model version is deployed [23:16:48] should I create a task to make one? [23:38:29] crap I can't find this database:https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/ORES/Recent_scores [23:41:36] it's slightly inconvenient