[00:06:03] 10ORES, 10Scoring-platform-team (Current): Non-root features no longer being injected. - https://phabricator.wikimedia.org/T222121 (10Halfak) 05Open→03Resolved [11:42:37] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 (10Gilles) I'm swamped with my main work on the Performance team, so this has been on the backburner, so... [12:45:36] PROBLEM - puppet on ORES-redis02.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:46:41] PROBLEM - puppet on ORES-worker02.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:14] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:53:06] ^ cloud related [12:54:59] PROBLEM - puppet on ORES-web01.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:57:59] PROBLEM - puppet on ORES-worker01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:00:25] ACKNOWLEDGEMENT - puppet on ORES-worker01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Zppix -cloud related [13:00:34] ACKNOWLEDGEMENT - puppet on ORES-web01.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Zppix -cloud related [13:01:01] ACKNOWLEDGEMENT - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Zppix -cloud related [13:01:13] ACKNOWLEDGEMENT - puppet on ORES-worker02.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Zppix -cloud related [13:01:26] ACKNOWLEDGEMENT - puppet on ORES-redis02.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Zppix -cloud related [13:01:45] (spam done) [13:22:59] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:27:59] RECOVERY - puppet on ORES-worker01.experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:32:35] o/ [13:32:42] halfak: o/ [13:43:36] RECOVERY - puppet on ORES-redis02.experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:44:41] RECOVERY - puppet on ORES-worker02.experimental is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:47:14] RECOVERY - puppet on ORES-web02.Experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:54:55] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 (10Halfak) No worries. I can work from this. Do you have any datasets extracted that I could work from... [13:56:40] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 (10Gilles) You can find the data in /home/gilles/articlequality/datasets on stat1007 [14:00:15] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 (10Halfak) Excellent! Thank you. [14:05:29] 10Scoring-platform-team, 10Serbian-Sites: Investigate srwiki goodfaith model, why is it so bad? - https://phabricator.wikimedia.org/T199355 (10Zoranzoki21) >>! In T199355#5143550, @Halfak wrote: > An etherpad is directly editable. You should be able to just type into it. Yes, thanks! >>! In T199355#5143703... [14:34:47] wikimedia/articlequality#161 (svwiki_init - 696b537 : halfak): The build passed. https://travis-ci.org/wikimedia/articlequality/builds/526466674 [15:35:07] so yesterday I got Wikidata ids from mariadb and looked up gender from Wikidata, [15:36:56] Jaol says I might be able to do it all on Spark [15:37:17] so I'm gonna take a look at that approach instead and then see how well that does [15:37:49] I'm also doing a back of envelop calculation about how many labels (if any) I want. [15:38:39] for the figure-8 idea [15:44:45] Async time. Yesterday and today and Wednesday is job interview stuff. Consolidation of my workflows is now done. Normal work resumes. [16:12:08] 10ORES, 10Scoring-platform-team, 10Operations: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani) [16:12:51] 10ORES, 10Scoring-platform-team, 10Operations: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani) [16:14:22] 10ORES, 10Scoring-platform-team, 10Operations, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) [16:16:11] 10ORES, 10Scoring-platform-team, 10Operations, 10Release Pipeline, and 2 others: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani) [16:20:53] Hey folks. Just got out of the tech management meeting. Yesterday I did hiring stuff and digging into srwiki weirdness. Today I'll be working on annual planning, hiring, srwiki stuff and the svwiki article quality model. [16:21:46] groceryheist, let me know how many labels you might need. [16:33:47] halfak last night I thought that 45,000 is probably enough, but we may need up to 225,000 [16:33:54] depending on how much statistical power we need [16:34:01] Woah. That's insane. [16:34:04] it might be in that range or somewhat less [16:34:09] Many of our wikis don't even have that many edits in a year! [16:34:09] yeah [16:34:16] Why is it so high? [16:35:31] 75 wikis, 3 time points, 4 strata (newcomers, anons, gender, geolocation), 50 samples for each [16:36:01] 75 wikis is based on having 25 wikis in the treatment group and 50 in the "control" group [16:36:35] 50 samples for each strata seems pretty ambitious [16:38:53] these is for the design that lets us make relatively strong causal claims [16:39:40] without doing any auxiliary modeling [16:40:27] halfak: new team member hype? [16:40:35] but yeah we'll only need 600 edits per wiki or so [16:40:57] groceryheist, maybe we can target a subset of wikis with this work. What do you think? [16:41:13] I don't think we have RC Filters for 75 wikis. [16:41:16] Probably more like 30 [16:41:34] this for a DID-style design, so there's a comparison group [16:43:33] I count 26 wikis with it enabled [16:43:38] with rcfilters [16:44:12] I'm guessing we build a synthetic control group with about twice that many [16:44:27] and that should give us enough statistical power [16:45:14] the ways around this are to measure bias using a counter-factual model (increases study complexity, and requires additional assumptions) [16:46:41] we could also cut some outcomes (i.e. just focus on gender and newcomers) [16:46:49] (or geoloc and newcomers) [16:48:53] Another option is to use the labels we already have to build well-calibrated (or well-balanced) (but less accurate) "damaging" models. We can then use these model to measure fairness over time in a within-wikis design (like an interrupted time series) [16:49:35] but if we do the counterfactual model or the within-wikis design, then we don't need new labels [16:50:31] since our conversation yesterday I think I prefer to get the new labels, I think it would provide comparable evidence, and reduce complexity. [16:50:50] evidence of a comparable quality (trading one acceptable limitation for another). [16:53:40] if it takes 10 seconds to label an edit and we pay 6$/hour then 45000 labels is a 7.5k buy. [16:54:40] iios [16:55:00] sorry that 7.5k number might be typo'd [16:56:02] 750$ [16:56:52] 10 seconds/edit * 45000 edits / 60 sec/min / 60 min/hour * 6$/hour [16:57:32] Why do we need labels for wikis where we haven't deployed? I thought we'd be using the labels to control exogenous changes. [16:58:17] we want to compare what happens on the wikis where we deployed to what happens on the wikis where we don't deploy [16:59:46] https://en.wikipedia.org/wiki/Difference_in_differences [17:00:42] Right. I was thinking in ITS. [17:00:45] ^ this is the basic idea, but with a third data point prior to treatment we can do better [17:00:49] gotchya [17:01:11] so if we do ITS then I don't think we need new labels, we can just use the labels we already have, but with a better model [17:01:11] "third data point"? [17:01:27] DID carries a "parallel trends assumption" [17:01:40] with two data points prior to treatment you can relax it [17:01:45] o/ Zppix. Just saw your Q. Yeah groceryheist is working iwth us for the next couple of months to study the effects of ORES. [17:02:14] halfak: dont forget how to destroy everything 101 its the course i teach its part of onboarding xD [17:02:22] Yeah. Sorry I don't see what you mean by "two data points". Do you mean "two samples"? [17:02:40] Zppix, also making icinga2 stop yelling at us. [17:03:27] So DID measures two samples (treatment group and control group) at two points in time (before and after treatment) [17:03:31] halfak: most of the alerts is -cloud related I think what needs to happen is better communication with cloud services when they deploy changes that could break things [17:04:16] I talked to AndrewBogott about one of them. I think we can fix it with a longer timeout. Any interest in doing that work? I'm referring to the puppet catalog fail. [17:04:20] Zppix, ^ [17:04:26] If you have three points in time (2 before treatment, one after treatment) then you don't have to assume parallel trends, you just have to assume that the trend is linear and you can model it using the two points before the treatment. [17:04:48] halfak: yeah thats easy to do i can have it deployed in about an hour [17:05:10] Zppix, awesome. I think 5 minutes or more of failure would be long enough to wait before alerting. [17:05:19] halfak: alright [17:06:08] above i said "we can just use the labels we already have, but with a better model" I just mean a "bias-corrected" (but less accurate) model. [17:06:26] Right. This makes the results more complicated. [17:06:34] But requires less manual labor. [17:06:47] The manual labor is a risk because there's a lot that could go wrong (funding, bad data, etc.) [17:06:57] okay [17:07:41] I think if we take an approach that requires more modeling, then we have two options (ITS using the labels we already have, vs the counter-factual approach) [17:09:02] getting new manual labels is a risk, but we don't have to do it. [17:12:08] Right. OK. So here's my proposal. The person I need to work with (head of product strategy) to get labels is OOO until late next week. Let's proceed as though we are going to do ITS or counter-factuals until then and I'll check then what it will take to continue with labels. [17:12:10] Sound OK? [17:12:13] groceryheist, ^ [17:12:37] halfak: sounds great [17:12:55] fwiw we only need this for the bias outcome [17:13:15] for the other outcomes (broadening participation, time to revert, macro-level changes) we don't [17:13:29] and we can do a really strong DID++ design [17:13:45] so I can just focus on that until we find out if we can use labels [17:14:54] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 (10Halfak) https://github.com/wikimedia/articlequality/pull/82 OK we have a model. Fitness isn't reall... [17:15:04] groceryheist, sounds great [17:15:08] I'm off to lunch then [18:34:12] 10Jade, 10Scoring-platform-team (Current), 10Cleanup, 10User-Ladsgroup: Archive "JADE" extension repository - https://phabricator.wikimedia.org/T221437 (10thcipriani) 05Resolved→03Open This is not done until it has been removed from make-wmf-branch. I don't know the status of JADE so I don't know what... [18:34:14] 10Jade, 10Scoring-platform-team (Current), 10MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), 10Patch-For-Review: Rename "JADE" extension to "Jade" - https://phabricator.wikimedia.org/T211046 (10thcipriani) [20:00:29] got some help from joal to get wikidata info in spark [20:00:36] Nice [20:00:42] now I'm gonna see how well it works [20:00:49] kaylea, FYI I missed the deployment window today for ORES, so that bug fix is likely to go out tomorrow. [20:36:57] hey halfak CDSC want me to write a version of that memo for the CDSC blog. cool? [20:37:15] CDSC? [20:37:21] Memo? [20:38:03] community data science collective [20:38:23] memo: https://meta.wikimedia.org/wiki/Research:Exploring_systematic_bias_in_ORES/Calibration_and_balance_for_newcomers_and_anons#Damaging_models [20:43:04] Gotcha. I think that's very cool. Rock on :) [20:47:03] halfak: what bit of ores- or jade-related work do you think would help you the soonest? [20:47:36] Jade design stuff most likely. Alternatively, if that is blocked, Jade pitch-deck. [20:48:21] 👍 [20:48:39] huh, IRCCloud has an emoji selector now. it's so subtle I don't notice it [20:49:14] 🚀 [20:50:32] 10Scoring-platform-team, 10Serbian-Sites: Investigate srwiki goodfaith model, why is it so bad? - https://phabricator.wikimedia.org/T199355 (10Halfak) OK it's clear that we would benefit from re-labeling these 500 revisions using Wiki labels. I'm working to get a campaign loaded. I'd like to call it somethin... [20:57:22] 10Scoring-platform-team, 10Serbian-Sites: Investigate srwiki goodfaith model, why is it so bad? - https://phabricator.wikimedia.org/T199355 (10Halfak) In the meantime, I added the campaign here: https://labels.wmflabs.org/ui/srwiki/ Please pick up these edits and re-label them as we were doing in the etherpad... [20:57:49] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Serbian-Sites, 10artificial-intelligence: Investigate srwiki goodfaith model, why is it so bad? - https://phabricator.wikimedia.org/T199355 (10Halfak) a:03Halfak [20:58:22] I want an emoji selector. My usual pattern is googling for "unicode " [20:58:34] 🚲 [21:06:13] 10MediaWiki-extensions-ORES, 10ORES, 10Scoring-platform-team, 10Growth-Team: Non-overlapping threshholds in ORESModels on lvwiki - https://phabricator.wikimedia.org/T221871 (10Harej) p:05Triage→03Normal [21:06:42] 10MediaWiki-extensions-ORES, 10ORES, 10Scoring-platform-team, 10Growth-Team: Why are there three Q-marks (???) in threshholds in Special:ORESModels? - https://phabricator.wikimedia.org/T221870 (10Harej) p:05Triage→03Normal [21:08:06] halfak: I triaged the untriaged. Feel free to move tasks into "Ready to Go" [21:31:20] Will do. Thanks. [21:31:26] Just pulled a couple onto the main board. [21:47:00] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/507471 (owner: 10L10n-bot) [21:59:36] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Serbian-Sites, 10artificial-intelligence: Investigate srwiki goodfaith model, why is it so bad? - https://phabricator.wikimedia.org/T199355 (10Zoranzoki21) >>! In T199355#5148701, @Halfak wrote: > OK it's clear that we would benef...