[00:04:07] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [00:06:51] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 4.963 second response time https://wikitech.wikimedia.org/wiki/ORES [01:42:31] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [01:43:53] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/ORES [13:15:47] 10ORES, 10Scoring-platform-team: [Discuss] Future ORES architecture - https://phabricator.wikimedia.org/T226193 (10akosiaris) Hey, just saw this. I am around now. I have some minor ML pipeline experience so I am not sure of how much help I would end up being, but I wouldn't mind discussing. [14:57:06] It looks like we haven't done a deploy in wmflabs for a long time. I'm going to try getting everything up to date there. [15:04:23] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) Confirmed that even locally, I see: ` 0.00615000724792 0.00727391242981 0.00678396224976 0.00703287124634 0.00627303123474 0.00723099708557 0.219942092896 0.812291145325 1.290593... [15:26:41] o/ accraze [15:26:54] Looks like we have an allhands meeting today during standup so let [15:26:59] 's async OK? [15:27:14] sounds good halfak [15:29:32] accraze, BTW, I silenced icinga-wm here until I get to the bottom of this issue. We'll still get the regular notifications via email and in the -operations channel if ORES in prod has an issue. [15:32:32] My next step now is to bring our wmflabs install up to date and see if I'm getting the same weird behavior then. [16:30:56] (03PS1) 10Halfak: Splits the configuration files for easier editing. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/534638 [16:31:04] accraze, https://gerrit.wikimedia.org/r/#/c/mediawiki/services/ores/deploy/+/534638 [16:31:30] What do you think of this? It makes working on the config a lot easier for me and changes nothing practically. [16:31:38] Since they get merged anyway. [16:41:58] (03PS2) 10Halfak: Splits the configuration files for easier editing. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/534638 [17:04:37] taking a look now [17:06:44] halfak yeah this looks good, i'll +2 it [17:08:55] (03CR) 10Accraze: [C: 03+2] "LGTM" [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/534638 (owner: 10Halfak) [17:09:17] also here is my async update: [17:09:31] Thanks [17:10:26] Y: finally got the jade mw extension integration tests passing, fixed some mw namespacing issues, started on the action=jadecreateandpropose api module [17:11:16] T: continue on action=jadecreateandpropose, also one more interview for assoc SWE in ~1hr [17:27:39] OK update time. [17:28:38] Y: Mostly catching up on email. I've also been digging through a bit of ORES code (prompted by email) and deployment code (prompted by wmflabs ongoing incident). Eventually I just told icinga-wm to shuttup and it's been brilliant. [17:29:26] T: I'm working on getting fresh deploy of ORES to WMFlabs and doing some config cleanup to make my life easier. If this addresses the instability in WMFLabs, then I'll be moving on to goals work for WMF bureau. [17:29:37] Oh also, I reviewed and merged the autodocs PRs. [17:31:02] awesome! [17:53:31] 10Scoring-platform-team (Current): Automate docs build for model repos - https://phabricator.wikimedia.org/T230517 (10ACraze) Alright, looks like all model building libs have docs being built on RTD now: https://drafttopic.readthedocs.io https://draftquality.readthedocs.io https://editquality.readthedocs.io ht... [19:37:17] OMG, I keep running into weird problems with flask. [19:47:46] 10ORES, 10Scoring-platform-team (Current): Feature injection doesn't work when using "?revids=" param - https://phabricator.wikimedia.org/T232143 (10Halfak) [19:47:50] 10ORES, 10Scoring-platform-team (Current): Feature injection doesn't work when using "?revids=" param - https://phabricator.wikimedia.org/T232143 (10Halfak) a:03Halfak https://github.com/wikimedia/ores/pull/330/files [19:48:40] Was a pytest_flask and pytest version mismatch. [19:54:09] * halfak waits for our staging environment to respond at all... [19:54:28] Hmm. Works on the web, but not via the local machine... [19:56:29] Aha! We have something running on port 80 but it doesn't respond. ORES is on port 8081 on staging. [19:56:39] Somehow, I don't see any issue there. [20:01:50] wikimedia/ores#1359 (injection_fix - 10af57b : halfak): The build failed. https://travis-ci.org/wikimedia/ores/builds/581374245 [20:10:07] Confirmed that I see the issue on all web nodes but not staging. [20:10:09] Weird! [20:13:31] There must be something that the web nodes do that our staging node doesn't even when it doesn't need to talk to redis or celery. [20:13:38] Maybe it's metrics. [20:16:11] Aha! staging uses a local statsd. [20:18:57] I think we're blocking because we're trying to send stats to graphite and it isn't working! [20:23:21] ^ that would make sense halfak [20:28:56] I started looking at this bug because I just happened to notice it in the logs and now here I am concluding it is the culprit :) [21:17:07] 10ORES, 10Scoring-platform-team: ORES is creating a log of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 (10Halfak) [21:17:18] 10ORES, 10Scoring-platform-team: ORES is creating a log of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 (10Halfak) https://github.com/wikimedia/ores/pull/331 [21:17:27] 10ORES, 10Scoring-platform-team (Current): ORES is creating a log of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 (10Halfak) [21:19:39] wikimedia/ores#1365 (statsd_no_counts - 7abb07e : halfak): The build passed. https://travis-ci.org/wikimedia/ores/builds/581408001 [21:52:53] Why must we deal with floating point rounding errors? They are simply inhumane. [21:55:50] 10ORES, 10Scoring-platform-team (Current): ORES is creating a lot of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 (10Halfak) [21:55:59] 10ORES, 10Scoring-platform-team (Current): ORES is creating a lot of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 (10Halfak) a:03Halfak [21:58:06] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) So I've worked out that this issue does not happen on staging. I think that is because staging is sending metrics to localhost. So my newest and bestest hypothesis is that we're... [22:33:56] haha, use BigNumber if we have infinite computational resource [22:34:22] @halfak, I filed a bug about recent increasing of ORES trust score [22:34:31] @halfak, I filed a bug about recent increasing of ORES trust scoring API latency [22:35:17] oh, you already responded [22:36:49] 10ORES, 10Scoring-platform-team: ORES API latency too high - https://phabricator.wikimedia.org/T231776 (10Xinbenlv) Oh, I thought `ores.wmflabs.org` was the production API. I didn't know the existence of `ores.wikimedia.org`. I was using a single one API call that loads 100 rev score in that one request. Does... [22:37:51] I also like to chat about whether there is anything helpful that WikiLoop Battlefield can provide to help training data input of ORES [23:01:07] o/ xinbenlv! [23:01:42] Re. querying ORES, we'll implement a limit on the number of revids you can request at a time soon. You should consider limiting the revids per request to 50 [23:01:57] That maximizes our batch processing anyway. [23:02:13] When you request 100 revids, internally, we split it into 50 revision batches. [23:02:42] re. training data, if you can hang onto it, we'll be very interested in having you create a dump for us as some point in the future. [23:03:06] We're working on a collaboratively edited repository for training data right now ("Jade"). [23:03:40] Eventually, we want to make it easy for you to connect WikiLoop to it directly :D [23:23:35] xinbenlv, another thing. You can make up to 4 parallel requests to ores at the same time. So essentially you should expect to be able to run 4x50 revisions every 6 seconds or so. [23:23:49] Depending on a lot of things. But that throughput should be attainable. [23:33:07] You can get a dump one click away by clicking here: http://battlefield.wikiloop.org/api/markedRevs.csv [23:33:21] Or at the "Download" menu item of http://battlefield.wikiloop.org [23:33:48] @halfak. Cool, glad to know the new limit is 50. Let me know once that stabilized. [23:34:26] I was actually curious have you ever consider to do a ORES score on every individual revision, and then keep a database of that