[02:03:29] (03PS2) 10Legoktm: build: Updating mediawiki/mediawiki-codesniffer to 0.10.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/360227 [02:03:31] (03CR) 10jerkins-bot: [V: 04-1] build: Updating mediawiki/mediawiki-codesniffer to 0.10.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/360227 (owner: 10Legoktm) [02:12:58] (03PS3) 10Legoktm: build: Updating mediawiki/mediawiki-codesniffer to 0.10.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/360227 [02:29:45] (03PS1) 10Legoktm: Remove redundant or useless PHPCS rules [extensions/ORES] - 10https://gerrit.wikimedia.org/r/363516 [02:59:29] 10Scoring-platform-team-Backlog: Use multithreading in test_model - https://phabricator.wikimedia.org/T169843#3410274 (10awight) [03:37:54] 10Scoring-platform-team-Backlog: Investigate whether Python regex is faster when non-capturing, use consistently. - https://phabricator.wikimedia.org/T169846#3410323 (10awight) [03:38:03] 10Scoring-platform-team-Backlog: Investigate whether Python regex is faster when non-capturing, use consistently. - https://phabricator.wikimedia.org/T169846#3410335 (10awight) p:05Triage>03Low [03:38:49] 10Scoring-platform-team-Backlog: Investigate whether Python regex is faster when non-capturing, use consistently. - https://phabricator.wikimedia.org/T169846#3410323 (10awight) [03:47:06] 10Scoring-platform-team-Backlog: Use multithreading in test_model - https://phabricator.wikimedia.org/T169843#3410339 (10awight) I don't understand this output from htop... there are 14 lightweight test_model threads, which don't seem to take up any extra memory, nor cpu time. ``` Mem[||||||||||||||||... [04:02:34] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3382708 (10awight) The gist looks perfectly "right", like you did it wrong in the right way. I only have really fringe gues... [04:32:36] 10Scoring-platform-team-Backlog: Use multithreading in test_model - https://phabricator.wikimedia.org/T169843#3410396 (10awight) I took a look at this to see if it'd be easy, and I found that `test()` is built into `Model`. Unfortunately, that means that it doesn't benefit from the worldly threading stuff in `S... [04:33:36] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10Wikidata, 10Wikimedia-log-errors: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" - https://phabricator.wikimedia.org/T169429#3410401 (10Krinkle) [04:33:53] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10Wikimedia-log-errors: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" - https://phabricator.wikimedia.org/T169429#3398107 (10Krinkle) [04:35:03] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3398107 (10Krinkle) [04:37:56] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3398107 (10awight) Possible duplicate of T168096? [05:40:28] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3410475 (10awight) s/unknown hero/akosiaris/ -- Sorry I didn't read the incident report first; and thank you! P5633 was al... [11:28:36] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3382708 (10akosiaris) >>! In T168965#3409787, @Halfak wrote: > https://gist.github.com/halfak/b31b8ddc38ca701c4c964478a53da7... [11:55:26] 10Scoring-platform-team, 10Wikilabels: [Discuss] Wikilabels routes refactor - https://phabricator.wikimedia.org/T165046#3411237 (10Pginer-WMF) >>! In T165046#3383143, @Halfak wrote: > The quick answer is **both**! There's a UX theoretical reason why people like doing work in chunks that take less than 5 minut... [12:30:11] 10Scoring-platform-team-Backlog, 10ORES, 10Patch-For-Review: Implement twemproxy for ORES in production - https://phabricator.wikimedia.org/T122676#3411277 (10akosiaris) >>! In T122676#3407572, @Halfak wrote: > Ok. This *looks* simple. > > Is tewmproxy known to be running on "localhost" and in a good state... [13:02:01] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10User-Ladsgroup, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3411332 (10Ladsgroup) a:03Ladsgroup [13:23:46] o/ [13:24:54] Saw a heroic effort to get fajne connected to one of the ORES hosts in the -cloud channel when I logged in this morning. [13:49:34] halfak: got 5 mins ? [13:49:39] Yup. [13:49:41] What's up? [13:49:42] I 've migrated codfw to use nutcracker [13:49:50] most things seem to be ok [13:50:07] we 've had no alerts, a 10 min period of 0 cache hit ratio [13:50:16] and here's the weird thing [13:50:24] increased response times for change prop [13:50:36] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-1h&to=now-1m&panelId=15&fullscreen [13:50:56] that's what I am looking into now [13:51:14] uhoh. Looks like things are now timing out. [13:51:30] To affect changeprop, I'd expect it's on write to redis [13:52:16] hmm [13:52:23] I guess it could be on read. [13:52:26] yeah no scores returned either [13:52:38] Each request will involve a read to redis, some work, and then a write, [13:53:05] Connection to Redis lost: Retry (2/20) in 1.00 second. [13:53:06] aha [13:53:16] ok lemme find out why and let you know [13:53:23] how come no alerts were issued ? weird [13:53:28] OK. Just checking. Are we having downtime right now? [13:53:47] cause it looks like it to me [13:53:49] yeah I 'll revert for a number of hosts [13:53:53] kk [13:54:38] [2017-07-06 13:54:32.714] nc_redis.c:1092 parsed unsupported command 'MULTI' [13:54:45] are you using multi by any chance ? [13:57:48] https://github.com/twitter/twemproxy/blob/master/notes/redis.md#transactions [14:00:50] back in business [14:02:14] hmm... I don't think so. [14:02:28] I guess it could be something our connector is doing. [14:03:47] Amir1, staff/backlog? [14:03:52] 10Scoring-platform-team-Backlog, 10ORES, 10Patch-For-Review: Implement twemproxy for ORES in production - https://phabricator.wikimedia.org/T122676#3411664 (10akosiaris) codfw has been migrated to use nutcracker and reverted. This has backfired majestically. The reason being ``` [2017-07-06 13:59:15.068] nc... [14:04:13] halfak: I thought it's not in two hours [14:04:19] didn't you move it? [14:04:34] anyway, I'll be joining in five minutes [14:04:39] Nope. We did discuss that. :/ [14:04:40] o/ [14:04:40] kk [14:04:45] o/ awight [14:04:53] https://hangouts.google.com/hangouts/_/wikimedia.org/staff-backlog [14:04:56] ^ new hangout name [14:05:02] oic [14:05:03] Just in case your cal had the old one [14:05:10] it does [14:33:41] 10Scoring-platform-team, 10Cloud-VPS, 10ORES: Set up larger ores-compute instance - https://phabricator.wikimedia.org/T169809#3411767 (10Halfak) a:03Halfak [14:35:23] 10Scoring-platform-team, 10WMF-Communications, 10Wikimedia-Blog-Content: Announce new team: "Scoring Platform" - https://phabricator.wikimedia.org/T169755#3411782 (10Halfak) a:03Halfak [14:54:27] 10Scoring-platform-team, 10draftquality-modeling, 10artificial-intelligence: Experiment with Sentiment score feature for draftquality - https://phabricator.wikimedia.org/T167305#3411823 (10awight) Finally, I have some progress to report. I've extracted the baseline feature list and the sentiment-laden featu... [14:58:15] 10Scoring-platform-team-Backlog, 10Wikidata, 10artificial-intelligence: [Spike] Use suggested properties to get signal for completeness - https://phabricator.wikimedia.org/T158430#3411853 (10Halfak) a:05Glorian_WD>03None [14:58:40] 10Scoring-platform-team-Backlog, 10draftquality-modeling, 10artificial-intelligence: [Discuss] Hosting the monthly draft quality dataset on labsDB - https://phabricator.wikimedia.org/T167697#3411855 (10Halfak) [14:59:07] awight would something like https://phabricator.wikimedia.org/P5690 fix puppet errors for ores? [14:59:18] notice the $ores_config_user and group params i added. [14:59:32] I will publish the change on your change if you like that :) [15:01:23] 10Scoring-platform-team-Backlog: Send celery logs and events to logstash - https://phabricator.wikimedia.org/T169586#3411884 (10Halfak) [15:03:53] paladox: It looks good--can we set ores::web::ores_config_user directly in hiera, even? [15:04:00] yes [15:04:15] it sets the default to the deploy user currently there [15:04:20] to prevent any problems in prod [15:04:36] and in labs we can add that to the hiera config section of horizion [15:04:46] ether globally in the ores project or idividually [15:05:01] I mean, set ores::web::ores_config_user = "%{hiera('ores_config_user')}" [15:05:23] uh yeh [15:05:27] hiera() will do it [15:05:30] 10Scoring-platform-team-Backlog, 10draftquality-modeling, 10artificial-intelligence: [Discuss] draftquality on a sample, humongous everything, or something else? - https://phabricator.wikimedia.org/T168909#3411912 (10Halfak) I'm a fan of a stratified sampling strategy. We should balance the # of OK and !OK... [15:05:32] so in hiera we set it like [15:05:39] ores_config_user: [15:05:52] $ores_config_user = hiera('ores_config_user', 'deploy-service'), [15:05:52] $ores_config_group = hiera('ores_config_group', 'deploy-service'), [15:06:12] I was thinking, in hiera we set [15:06:14] ores_config_user: [15:06:19] ores::web::ores_config_user = "%{hiera('ores_config_user')}" [15:06:32] but I'm fine with the way you did it, as well. [15:06:47] oh [15:06:59] yep, the way i did it is easier :) [15:08:19] 10Scoring-platform-team-Backlog: Design how we'll train models which depend on private data - https://phabricator.wikimedia.org/T168908#3380311 (10Halfak) I checked on the new incoming stat* boxes. They will be using Debian Stretch -- so we'll be able to use them to train model. We should use the labs boxes an... [15:09:30] 10Scoring-platform-team-Backlog: Design how we'll train models which depend on private data - https://phabricator.wikimedia.org/T168908#3411938 (10awight) We discussed a two-step solution, # For now, protect the files on labs by making them readable by a Un*x group including only NDA users. # Extract and build t... [15:10:43] 10Scoring-platform-team-Backlog: Bury horrors of the editquality makefile - https://phabricator.wikimedia.org/T168455#3364890 (10Halfak) Two processes that we follow: 1. Label the "needs_review" edits 2. Label a balanced sample of "needs_review" and not edits We started with (1) and then switched to (2) befor... [15:10:53] paladox: lol, lemme vouch for that in gerrit [15:11:02] ok :) [15:15:58] awight what do we want the user to be on labs? [15:16:55] paladox: How about nobody/nogroup [15:17:01] ok thanks :) [15:18:06] +1 to akosiaris's CR suggestion... [15:18:55] awight done :) [15:19:03] he suggests a more descriptive commit msg [15:19:15] not a bad idea either :-) [15:19:50] awight could you do the commit msg please? :) [15:19:55] sure! [15:20:14] thanks :) [15:25:43] 10Scoring-platform-team, 10Wikilabels: [Discuss] Wikilabels routes refactor - https://phabricator.wikimedia.org/T165046#3411962 (10Halfak) One benefit to presenting worksets is that it would make it easier for users to review old work. Other than the tasks in a workset, they can be identified by their tempora... [15:26:19] it's merged awight :) [15:27:19] awight could you re run puppet please on ores-web-04? [15:28:06] 10Scoring-platform-team-Backlog, 10draftquality-modeling, 10artificial-intelligence: [Discuss] draftquality on a sample, humongous everything, or something else? - https://phabricator.wikimedia.org/T168909#3411973 (10awight) Sounds like territory where we would want a dedicated data set balancing utility, wh... [15:28:15] paladox: wonderful, thanks! [15:28:23] your welcome :) [15:29:33] RECOVERY - puppet on ores-web-04 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:29:38] yay [15:29:40] :) [15:29:51] awight ^^ [15:30:20] Notice: /Stage[main]/Ores::Web/Ores::Config[main]/File[/etc/ores/99-main.yaml]/owner: owner changed 'root' to 'nobody' [15:30:23] Notice: /Stage[main]/Ores::Web/Ores::Config[main]/File[/etc/ores/99-main.yaml]/group: group changed 'root' to 'nogroup' [15:30:27] yes, that's rad [15:30:32] :) [15:30:42] Civic duties call, back in 90 minutes... [16:28:35] 10Scoring-platform-team-Backlog, 10draftquality-modeling, 10artificial-intelligence: [Discuss] draftquality on a sample, humongous everything, or something else? - https://phabricator.wikimedia.org/T168909#3412160 (10Halfak) Hmm... So far we have been getting away with storing that data in the Makefile itse... [16:34:21] Let me just air semi-publicly that I feel really weird about "staff community time". [16:34:35] A group of people that size should not have nothing to discuss... [16:35:12] Maybe it was my hosting, or maybe we have zero experience with self-governance and internal culture? [16:49:40] awight, seems it's mostly broadcast these days. [16:51:03] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3412341 (10Halfak) [16:51:05] 10Scoring-platform-team, 10ORES, 10User-Zppix: Extend icinga check to catch 500 errors like those of the 20170613 incident - https://phabricator.wikimedia.org/T167830#3412342 (10Halfak) [16:55:20] halfak: for sure, but what I misjudged was how much effort it would take to break out of that mold. [16:55:41] tonguetips are not exactly laden with Things to Say, it seems [16:55:49] Right. I can see. [16:55:51] What? [16:55:55] tonguetips? [16:56:01] Things to Say? [16:58:31] awight: while I remember it: we talked about networks and graph DBs and such a few days ago, and you asked if I had anything written up about the Wikidata network analysis I did for the article importance project. I do now: https://meta.wikimedia.org/wiki/Research:Automated_classification_of_article_importance/Wikidata_side_chain [16:58:40] WOOT [16:58:47] Nettrom: thanks for that! [16:59:30] awight: You’re welcome! Let me know if there are things I failed to describe well, or something. Not easy to know what exactly to write when I know the project too well. [17:08:20] 10Scoring-platform-team, 10draftquality-modeling, 10artificial-intelligence: Experiment with Sentiment score feature for draftquality - https://phabricator.wikimedia.org/T167305#3412482 (10awight) Here are the baseline model's statistics, ``` (.env)awight@ores-compute-01:/srv/awight/draftquality$ revscoring... [17:17:19] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10User-Ladsgroup, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3412538 (10Krinkle) @awight Is likely f... [17:18:37] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10User-Ladsgroup, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3412561 (10awight) @Krinkle You're righ... [17:45:16] 10Scoring-platform-team, 10Patch-For-Review: ORES puppet error on labs boxes, unable to set user to "deploy-service" - https://phabricator.wikimedia.org/T169164#3412646 (10awight) a:05awight>03Paladox [17:45:28] 10Scoring-platform-team, 10Patch-For-Review: ORES puppet error on labs boxes, unable to set user to "deploy-service" - https://phabricator.wikimedia.org/T169164#3389235 (10awight) 05Open>03Resolved [18:16:59] (testing) [18:17:45] o/ [18:17:49] intruder alert [18:18:15] :D [18:24:11] O/ [18:27:02] 10Scoring-platform-team, 10Wikimedia-Mailing-lists: Create scoring-internal mailing list for Scoring Platform team - https://phabricator.wikimedia.org/T169915#3412767 (10Halfak) [18:30:18] halfak: who would be able to access the interal list? [18:30:36] Just the staff who want to take sick days and that sort of stuff. [18:30:56] We'd only use it for that sort of thing [18:31:30] oh true... see if i want to take a sick day i just dont show up :P [18:34:39] :) [18:38:22] * halfak runs to get some lunch [18:38:24] brb [18:40:06] late lunch see u in a bit though halfak [19:11:07] 10Scoring-platform-team, 10Wikimedia-Mailing-lists: Create scoring-internal mailing list for Scoring Platform team - https://phabricator.wikimedia.org/T169915#3412968 (10RobH) 05Open>03Resolved This list has been created with the auto generate password option, so @halfak should have gotten an email with it. [19:22:06] ^ that was really quick [19:22:18] halfak can you teach me how you manage to have tasks done so quicky lol [19:22:49] \o/ Yay it's done! I guess I have a bit of social capital :) [19:23:07] I also try to read the docs so I put the task in the exact right place-- most of the time :S [19:24:55] but its fun to put tasks in everywhere but the right place.. jk i just suck at putting stuff where it gose [19:24:58] goes* [19:25:09] i guess gose could work lol [19:31:20] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3413099 (10Halfak) Just updated https://gist.github.com/halfak/b31b8ddc38ca701c4c964478a53da75f And confirmed that ORES tim... [20:14:05] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3413258 (10awight) I found that the regex can be reduced to this and still causes the (near-)infinite loop: > bad_re = re.co... [20:26:27] halfak: one-word patch FTW! [20:26:36] :D [20:26:55] Holy moley. You checked it out while I was still submitting it :) [20:27:19] \o/ [20:27:23] Is ORES CI broken? [20:27:27] I'll have some notes for the fab task soon. [20:27:39] Hmm... maybe. CI for ORES is newish [20:28:20] halfak: bad news... [20:28:21] celery.worker.job: ERROR: Task ores.scoring_systems.celery_queue._process_score_map[9999f061-faea-4269-84e0-a3d2afc52250] raised unexpected: ValueError('signal only works in main thread',) [20:28:37] Oh damn. lol [20:28:42] https://travis-ci.org/wiki-ai/ores/builds/250912787#L1106 [20:28:59] So it works fine in practice, but if you start a thread within a test to mimic a separate environment, it gets mad [20:29:00] Hmm [20:29:08] Good thing for tests! [20:29:20] Well in this case, the only thing that fails is the test [20:29:27] So I'm not sure I'm excited. [20:29:32] O_o [20:29:56] Are production workers running in the "main thread"? [20:30:07] Yes [20:31:21] Here's the thing: https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/tests/test_celery_queue.py#L23 [20:31:22] Not sure how that can be the case, aren't there some 100x workers per core? I don't think we're running a linux process for each one [20:31:29] We are [20:31:32] That' [20:31:38] s how concurrency works in python [20:32:09] * halfak thinks [20:32:36] awight@scb1002:~$ ps auxxww|grep ores | grep python | wc -l [20:32:36] 46 [20:32:39] mebbe so [20:33:45] OK so here's what I propose. We should have a real-world celery test in our CI -- not some weird thing wrapped in a nosetest function. [20:34:02] In the nosetests function, we should test some basic things that work fine without starting a celery server [20:34:22] And then we should start up ORES with celery running and throw some requests at it. [20:34:26] In our Travis VM [20:34:30] Sounds good in theory. [20:34:49] I've wanted to have a real ORES server running in travis anyway [20:35:16] Crap [20:35:25] Every little thing turns into a big thing :/ [20:35:33] Well.. let me get my notes together first :) [20:35:42] /o\ [20:39:59] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3413321 (10Halfak) OK! Figured it out. So a Threading-based timeout will fail to kill an issue like this because of python... [20:40:16] ores dev_server is no longer a thing? [20:40:30] Oh yeah. Should remove any references to that [20:40:36] will do. [20:40:54] but also, I want to have a dev_server. Is there still a way to do that? [20:41:08] ./utility applications.wsgi [20:41:13] ty [20:41:14] See ./utility -h :) [20:41:29] or if installed, "ores applications.wsgi" [20:42:18] The blog post about this is going to be fun :) [20:54:24] I have a nice test endpoint going, but now I'm getting greedy and want the whole celery soup. [20:54:55] You're not working on this CI problem are you? [20:55:00] nope [20:55:04] OK good :) [20:55:09] * halfak hacks away [20:55:31] I'm trying to understand more about the TaskRevokedError [20:56:20] gotcha [20:56:38] ... so I need a captive celery [20:57:06] I'll just follow the outline of our vagrant role, if there's no silver bullet laying around [20:57:23] "captive celery"? [20:58:24] Um I want to start up a celery broker and workers [20:59:24] Gotcha. I'm working on a nice clean update to the default ORES config that will do that. [20:59:48] ooh [20:59:52] want me to pick that up? [21:00:02] or is this part of the CI hacking? [21:00:07] https://gist.github.com/halfak/eb8a0338499bb93b6eb2fb1f05c4fa62 [21:00:16] LOL [21:00:35] Stick this in a file in your config dir or just make the edit to the stock config file. [21:00:52] copy [21:00:55] Oh wait. This will sort before the primary config (bad) [21:01:07] rename it to "zz-celery.yaml" [21:01:15] Needs to sort after it [21:01:30] which should come first? The stock config? [21:01:45] Yup [21:01:54] Later sort order get precedence [21:04:02] That's mostly working, but now I get a timeout. Perhaps I haven't started any workers? [21:04:19] Oh yeah! Of course. [21:04:23] ./utility applications.celery [21:04:45] Need to run both wsgi and celery :) [21:05:30] cool! My screwy steps accidentally caused me to have a TaskRevokedError [21:05:43] that was beautiful [21:06:21] nice :)_ [21:07:57] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413453 (10awight) I just stumbled upon steps to reproduce locally: * Use scoring_system: local_celery in config * ores applications.wsgi * Request a score via the web interface and le... [21:13:57] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413472 (10awight) I'm seeing a pretty mundane failure, the key `celery-task-meta-testwiki:revid:0.0.0:641962090` contains (pickled): > {'traceback': None, 'children': [], 'result': Tas... [21:21:27] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413497 (10awight) The example bad record in production has cleared itself. I recommend we close this task until it happens again... [21:22:02] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413498 (10awight) @Ragesoss Have you seen any of these errors in the last 24h or so? [21:23:36] halfak: Anything I can do to support the timeout work, or shall I go back to the flagged revs model for a bit? [21:24:22] flagged revs for a bit. I'll have something for you to look at in ~15 minutes, I think [21:25:13] sounds fun! [21:30:13] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413550 (10Ragesoss) @awight Yes, ~12 hours ago. I marked the spec as 'pending' so it doesn't break the build, but that test failed in one of the travis-ci jobs the last time I pushed... [21:31:07] ragesoss: o/ Nice to see you here! [21:31:17] Do you have the offending revid on hand? [21:32:41] awight, I think there's one in the task [21:32:48] awight: looking... [21:32:49] Is that still returning an error? [21:32:57] halfak: That one went good on us, sadly [21:33:03] Oh. bastard [21:33:06] ;) [21:33:30] * halfak wonders if akosiaris' redis work had anything to do with that [21:34:00] halfak: do you think you can review this? https://github.com/wiki-ai/ores/pull/214 [21:34:09] Sure! looking now [21:34:10] the eslint passes and added to travis \o/ [21:34:35] Nice [21:34:36] Merged [21:34:40] it was 641962088 [21:34:52] yess [21:34:52] Amir1: Great news, thanks! [21:34:54] Thanks! [21:35:02] awight: 675892696 was also having the issue at the time I reported it. [21:35:22] (if the first one is bad, my test fails before I can see if the second one did as well) [21:35:41] ragesoss: argh--those are both good now [21:35:50] I need to put the task on ice until it happens again [21:36:17] cos there just isn't enough evidence left [21:37:56] awight: do different data centers have different caches? [21:38:07] The fact that you were hitting those revids from an automated test is intriguing, though. That semi-confirms our guess that someone requested those scores during an outage on 6-23. We just have no clue why it would have persisted so long, should have cleared after 24hr. [21:38:08] ragesoss, they do [21:38:22] oho! Good point [21:38:42] one thing we (me and the interns) noticed is that it would pass for one person but not another. [21:39:23] they in India, me in US west coast, and CI in... wherever travis has its Amazon machines. [21:40:01] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10MW-1.30-release-notes (WMF-deploy-2017-06-27_(1.30.0-wmf.7)), 10Patch-For-Review, and 2 others: [Discuss] Make ORES Review Tool preferences more prominent - https://phabricator.wikimedia.org/T167910#3413571 (10Ladsgroup) @jmatazzoni I couldn't find a... [21:40:57] but also, often only one of two parallel builds would fail. [21:44:53] I just checked the cache and they're good in both places... lemme see if I can hit the codfw endpoint though [21:45:59] I restarted a CI build. first job passed, waiting to see about the second. [21:46:17] not currenty seeing the error locally either. [21:47:12] * awight despairs [21:47:43] * halfak bets it was a fluke issue anyway [21:47:49] i will keep an eye out and update if I see it again. [21:48:17] (I could unmark the test as pending... then I'll definitely notice if it breaks the build) [21:48:50] (then again, I could not do that) [21:49:02] (yeah, not sounds good) [21:49:12] That sounds perfect. I'll leave breadcrumbs to help diagnose faster next time. [21:49:28] hehe no worries if you don't want to break your build for us. [21:50:02] I can offer to do that remotely if you need ;) [21:50:15] don't mind breaking the build, but too lazy to do it now. [21:50:17] :) [21:50:52] awight: the dashboard has quote-unquote 100% ruby test coverage these days, btw. [21:50:58] :-) [21:51:59] API tests working [21:52:01] OMG [21:52:10] https://github.com/wiki-ai/ores/pull/216 [21:52:13] awight, ^ [21:52:19] Took a little longer than 15 minutes [21:52:26] 10Scoring-platform-team: On ORES, some revisions frequently return TaskRevokedError - https://phabricator.wikimedia.org/T169367#3413608 (10awight) Ragesoss mentioned that the failure was intermittent, which makes it sound like the bad data was only cached at one data center. In my earlier tests, I only investig... [21:52:34] halfak: looking [21:53:43] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10User-Ladsgroup, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3413610 (10Ladsgroup) It's not related... [21:54:09] In my other PR, I'll disable the threaded celery tests and add a timeout test to api_test [21:54:23] So we can be use that signal works in practice [21:56:54] im going to put scoring team on T168584 under monitoring okay halfak? [21:56:55] T168584: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584 [21:57:06] yes please [21:57:17] 10Scoring-platform-team-Backlog, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 10User-Ladsgroup, 10Wikimedia-log-errors: Fatal DBQueryError "Read timeout is reached" exception when using ORES on Special:Contributions - https://phabricator.wikimedia.org/T169429#3413646 (10Ladsgroup) Well, it's not to... [21:57:21] Still unclear about when our wikilabels DB is going to be restarted [21:58:06] Oh I see a recent reply [21:58:28] 10Scoring-platform-team, 10DBA, 10Operations, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3413650 (10Halfak) Yup! That works! [21:58:31] 10Scoring-platform-team, 10DBA, 10Operations, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3413651 (10Zppix) [21:58:51] done [21:59:14] 10Scoring-platform-team, 10Wikilabels: Notify Wikilabels users of short downtime on July 11 @ 1400 UTC - https://phabricator.wikimedia.org/T169933#3413655 (10Halfak) [22:01:42] halfak: want me to find a way to put an msg on wikilabels site and send a message to wiki-ai list and maybe wikipedia list? [22:01:55] 10Scoring-platform-team, 10Wikilabels: Notify Wikilabels users of short downtime on July 11 @ 1400 UTC - https://phabricator.wikimedia.org/T169933#3413677 (10Halfak) https://github.com/wiki-ai/wikilabels-wmflabs-deploy/pull/37 [22:02:12] Wikilabels site message is ready to go. [22:02:27] ill metge [22:02:28] merge [22:02:34] Feel free to email ai/wikitech and post on [[:m:Wiki labels]] [22:02:34] 10[2] 04https://meta.wikimedia.org/wiki/:m:Wiki_labels [22:02:45] want me to merge ? [22:03:33] halfak ^ [22:03:50] Hmm... Looks like last time we made a page on meta to discuss the outage. [22:03:58] Let me change the URL and then let's merge [22:04:02] ok [22:04:18] let me know when you get the meta page up and link me so i can send out emails [22:05:31] https://meta.wikimedia.org/wiki/Wiki_labels/2017-07-11 [22:05:33] It' [22:05:35] s really brief [22:05:40] But we can add details. [22:06:48] k [22:06:59] OK to merge if you don't see any typos [22:09:34] amir beat me to it [22:17:01] sent halfak [22:17:05] Thanks [22:18:36] np [22:19:07] awight, https://github.com/wiki-ai/ores/pull/215 is ready for another look [22:19:15] k [22:20:10] I've got to pack up and hit the road. I have a bored puppy waiting for me at home ^_^ [22:20:28] enjoy! [22:21:05] have a good day [22:22:35] o/ [22:45:22] 10Scoring-platform-team, 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Flagged revs approve model to fiwiki - https://phabricator.wikimedia.org/T166235#3413837 (10awight) @Zache Thank you for digging this up! Your queries also help, now I understand that "unapprove" isn't the same a... [22:46:39] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Why don't timeouts work during long regular expression matching? - https://phabricator.wikimedia.org/T168965#3382708 (10awight) a:03Halfak [23:56:40] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10MW-1.30-release-notes (WMF-deploy-2017-06-27_(1.30.0-wmf.7)), 10Patch-For-Review, and 2 others: [Discuss] Make ORES Review Tool preferences more prominent - https://phabricator.wikimedia.org/T167910#3414308 (10jmatazzoni) ===Please take a pause I sai...