[00:52:31] 06Revision-Scoring-As-A-Service, 10Edit-Review-Improvements-RC-Page, 10ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), and 2 others: Manage ORES preferences on Watchlist (and Contributions) - https://phabricator.wikimedia.org/T160475#3188566 (10jmatazzoni) [01:06:44] 06Revision-Scoring-As-A-Service, 10Edit-Review-Improvements-RC-Page, 10ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017): Damaging levels on Polish Wikipedia overlap too much - https://phabricator.wikimedia.org/T161655#3188616 (10jmatazzoni) [01:22:41] yay! I added lots of new functionality and somehow still reduced the # of lines by almost 1k [01:22:42] :D [03:39:51] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Configure deploy to include CODFW and use the new oresrdb - https://phabricator.wikimedia.org/T159397#3188799 (10Ladsgroup) The changeprop already sends events to codfw. Once it started to send them I checked grafana and it looked okay (note the codfw nodes in p... [07:36:51] 06Revision-Scoring-As-A-Service, 10DBA, 10rsaas-articlequality : [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718#3189093 (10Marostegui) >>! In T146718#3186337, @Halfak wrote: > Hi @Marostegui. > > * Requirements are 15.6GB with an additional 2GB p... [10:14:33] 06Revision-Scoring-As-A-Service, 10Collaboration-Community-Engagement, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), 06Community-Liaisons (Apr-Jun 2017): Communicate new beta prefs and changes to ORES users... - https://phabricator.wikimedia.org/T163153#3189418 [10:56:22] Amir1: yeah I know why scb1003 get's more precaching queries than the other hosts. changeprop is requesting all the precaching queries and is running on the same boxes as ores. Now, IIRC some kafka node driver does not play well so it's not fully parallelized yet across multiple hosts. The end result is changeprop on scb1003 is (probably?) the most active one and as a result it sends most requests to the local ores uwsgi. [10:57:55] Oh, Thanks [11:38:30] 06Revision-Scoring-As-A-Service, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: SpecialRecentChangesLinked::doMainQuery bad query bringing down database server - https://phabricator.wikimedia.org/T163063#3189505 (10jcrespo) >>! In T163063#3185032, @Ladsgroup wrote: > The reason that w... [13:38:05] 06Revision-Scoring-As-A-Service, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: SpecialRecentChangesLinked::doMainQuery bad query bringing down database server - https://phabricator.wikimedia.org/T163063#3189717 (10Ladsgroup) At that time, ORES wasn't deployed anywhere in production. [13:43:40] 06Revision-Scoring-As-A-Service, 10MediaWiki-Database, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: SpecialRecentChangesLinked::doMainQuery bad query bringing down database server - https://phabricator.wikimedia.org/T163063#3189733 (10jcrespo) Then the problem is not ORES. [14:12:43] akosiaris, does this effectively reduce our capacity? Since scb1003 will likely complain of overload errors sooner than when all servers are saturated with requests? [14:22:15] 06Revision-Scoring-As-A-Service, 10DBA, 10rsaas-articlequality : [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718#3189849 (10Halfak) @Marostegui, we've already experimented heavily with usage of this table by researchers at the above mentioned worksh... [14:29:58] halfak: I think the overall capacity is the same, I don't see how that changes the sum. Now it is unbalanced which is not a good thing, but funny thing is this is dynamic. I can stop changeprop on scb1003 and it will be another host that exhibits the same behaviour (see https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1 where scb1004 rushed in to do the work). It's a known issue with services hitting services cu [14:30:42] I suppose we do re-balance in the celery queue, but not with web workers. [14:30:57] Once a singe server's workers are occupied, it'll start to block new requests. [14:31:18] if a single server stops responding it will be depooled automatically [14:31:27] akosiaris, that'd be bad [14:31:28] :P [14:31:47] it will also be repooled automatically when it starts serving requests again [14:31:58] is that better? ;-) [14:32:17] Yeah, but not great, right? [14:32:27] Just asking these questions to help me understand the implications. [14:32:40] actually from the load balancer's POV, it's almost great [14:33:01] the one thing we would like to implement over there is gradually increasing the weight of an app worker [14:33:10] but that's more or less the only feature missing [14:33:14] akosiaris, but in this case, we'll never be able to use the available capacity of all servers. [14:33:58] One server will have nearly half its capacity taken up by precaching and the other three will only fill up half way before failures start to come and the LB makes some changes. [14:34:09] if we are in a situation where hosts get depooled because they can't serve requests we have way bigger problems than not using all available capacity [14:34:21] akosiaris, yeah. That's what I'm talking about. [14:34:29] I don't want to depool unnecessarily. [14:34:39] In this case, it seems like we might. [14:35:19] in this case, and the way the architecture is right now [14:35:34] what will happen is that 1 host will end up serving only precaching [14:35:51] akosiaris, oh. So the other hosts will preferentially get external requests? [14:35:55] yes [14:35:58] Oh that [14:36:00] s cool then [14:36:03] :) [14:36:05] :-) [14:36:21] So when the hordes of ORES users show up, they'll fill up the capacity of everything but scb1003. [14:36:32] HORDES [14:36:36] lol [14:36:54] fwiw it's very easy for us to change the weight of an appserver [14:37:19] if we notice a problem it's a one CLI command to make any server get a different % of requests [14:37:37] it's a weighted round robin [14:37:40] that's for external requests [14:38:07] unfortunately precaching has this weird thing with changeprop not currently honoring that [14:38:19] Gotcha. [14:38:22] we are aware, just not overly worried yet [14:38:28] but the plan is to fix it [14:38:32] * halfak shakes fist at changeprop, but not that hard. [14:39:16] unrelated note, ORES is going a/a today [14:39:26] active/active that is [14:39:37] requests will start flowing to both DCs soon [14:40:02] that's in preparation for the DC switchover which is tomorrow [14:40:17] akosiaris, \o/ [14:40:23] Cool. I'm looking forward to that :D [14:40:26] https://wikitech.wikimedia.org/wiki/Switch_Datacenter [14:41:43] Looks like this died: https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&panelId=15&fullscreen [14:41:45] Hmm [14:41:53] * halfak curses changeprops changing metric names [14:46:02] akosiaris, it looks like I have a metric in changeprop called ores-cache-1 and ores-cache-2. Is 1 eqiad and 2 codfw? [14:46:10] (they used to not be numbered) [14:50:22] I honestly don't know.. looking [14:50:37] it's quite possible [14:52:13] halfak: ah found it. yes you are right [14:52:29] cache_1 => eqiad, cache_2 => codfw [14:52:34] confusing naming I 'll admit [14:52:49] I wonder how it get's generated [14:54:35] halfak: ah https://gerrit.wikimedia.org/r/#/c/347980/1/scap/templates/config.yaml.j2 [14:54:37] there's the answer [14:54:57] hmm my patch at least had these clearly named... [14:55:11] \o/ thanks dude. [14:55:43] It's a bummer to not have them named, but for now I'll fix grafana. [15:00:09] 10Revision-Scoring-As-A-Service-Backlog, 10ChangeProp, 10ORES, 06Services (blocked): Change ORES rules to send all events to new "/precache" endpoint - https://phabricator.wikimedia.org/T158437#3189939 (10Halfak) [15:11:23] 06Revision-Scoring-As-A-Service, 10ORES: Update grafana to split metrics by eqiad and codfw - https://phabricator.wikimedia.org/T163212#3189973 (10Halfak) [15:11:25] 06Revision-Scoring-As-A-Service, 10ORES: Update grafana to split metrics by eqiad and codfw - https://phabricator.wikimedia.org/T163212#3189986 (10Halfak) https://grafana.wikimedia.org/dashboard/db/ores [15:11:31] 06Revision-Scoring-As-A-Service, 10ORES: Update grafana to split metrics by eqiad and codfw - https://phabricator.wikimedia.org/T163212#3189987 (10Halfak) a:03Halfak [15:15:13] 06Revision-Scoring-As-A-Service, 10Collaboration-Community-Engagement, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), 06Community-Liaisons (Apr-Jun 2017): Communicate new beta prefs and changes to ORES users... - https://phabricator.wikimedia.org/T163153#3190009 [17:14:19] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: On labels.wmflabs.org, make the buttons more visible when they have been selected - https://phabricator.wikimedia.org/T163222#3190410 (10Trizek-WMF) [17:16:33] 10Revision-Scoring-As-A-Service-Backlog, 10OOjs-UI, 10ORES: On labels.wmflabs.org, make the blue buttons more visible when they have been selected - https://phabricator.wikimedia.org/T163222#3190425 (10Trizek-WMF) [19:14:06] 06Revision-Scoring-As-A-Service, 10revscoring: Implement "thresholds", deprecate "pile of tests_stats" - https://phabricator.wikimedia.org/T162217#3191073 (10Halfak) I just pushed a bunch of stuff to the branch. Here's what model information looks like now: ``` GradientBoosting(max_depth=3, scale=false, min_... [19:17:49] 06Revision-Scoring-As-A-Service, 10revscoring: Implement "thresholds", deprecate "pile of tests_stats" - https://phabricator.wikimedia.org/T162217#3191091 (10Halfak) I'm thinking that we'll want to provide access into the formatted JSON information about a model so someone could select the specific data they w... [22:07:15] 06Revision-Scoring-As-A-Service, 10Edit-Review-Improvements-RC-Page, 10ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), and 2 others: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release - https://phabricator.wikimedia.org/T162831#3191956 (10Etonkovidova) (1) beta...