[01:50:53] <wikibugs_>	 10Scoring-platform-team, 10Edit-Review-Improvements-Integrated-Filters, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), and 2 others: PHP Warning: Attempted to serialize unserializab... - https://phabricator.wikimedia.org/T176236#3659452
[03:21:41] <wikibugs_>	 (03PS14) 10Awight: [WIP] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053)
[03:22:47] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) (owner: 10Awight)
[16:30:45] <awight>	 o/
[16:37:26] <halfak>	 o/ awight 
[16:46:47] <awight>	 eek.  Just took another look at T176456 with jmat
[16:46:47] <stashbot>	 T176456: ORES on Watchlist causes big slowdown—especially with 'Last revision' filter turned on - https://phabricator.wikimedia.org/T176456
[16:47:29] <awight>	 Unfortunately, parts of the query are getting cached so it’s really hard to tell how changing the query impacts performance.
[16:49:38] <halfak>	 Yeah that seems like a PITA. 
[16:53:33] <awight>	 Gimme a sanity check on this—when you have an expensive query but are limiting the results to a small number of rows, doesn’t MySQL do all the processing to create the full list, then return the requested size?
[16:54:09] <awight>	 especially when sorting results
[16:54:09] <awight>	 yeah.  There’s just no other way to do it (that I can think of)
[17:15:35] <awight>	 k setting that down now.
[17:15:51] <awight>	 Back to Extension:ORES
[17:21:34] <Zppix>	 o/
[17:38:35] <Keegan>	 halfak: awight: we have three signups for the working group updates already https://meta.wikimedia.org/wiki/Global_message_delivery/Targets/JADE
[17:38:43] <awight>	 Right on!
[17:38:53] <Keegan>	 Yay. I'll keep following up on those messages, get y'all the draft of the email to lists in a bit
[17:44:13] <wikibugs_>	 (03PS15) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053)
[17:44:31] <wikibugs_>	 (03PS16) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053)
[17:46:18] <awight>	 halfak: Amir1: ^ That should be pretty close to what we need.  The one important thing I’m not covering with a test yet is the new->old fallback itself, but there’s a unit test for the internals that the fallback relies on, so I’d be happy to just smoke test.
[17:46:20] <awight>	 Doing that now.
[17:47:36] <halfak>	 I'll get the wmflabs deployment out today.  Maybe we can do the beta deployment too?
[17:50:23] <awight>	 Happy to help with any of that.  We might get the extension merged by then, if you’re thinking that too.
[17:51:04] <halfak>	 Right.  I don't want to deploy to beta until we have that merged. 
[17:51:12] <halfak>	 I'll start on the deploy to wmflabs ASAP. 
[18:17:03] <wikibugs_>	 10Scoring-platform-team, 10JADE, 10Epic: Deploy JADE MVP API in labs - https://phabricator.wikimedia.org/T176333#3661878 (10Keegan)
[18:17:05] <wikibugs_>	 10Scoring-platform-team, 10Community-Liaisons, 10JADE: Set up working group for JADE - https://phabricator.wikimedia.org/T170954#3661876 (10Keegan) 05Open>03Resolved >>! In T170954#3656259, @Qgil wrote: > Moving this task back to the #community-liaisons backlog. Is this support request still active? Is i...
[18:17:40] <wikibugs_>	 10Scoring-platform-team, 10Community-Liaisons, 10JADE: Set up working group for JADE - https://phabricator.wikimedia.org/T170954#3661880 (10Keegan)
[18:17:42] <wikibugs_>	 10Scoring-platform-team (Current), 10JADE: Create list of ORES collaborators (focus on language asset helpers) - https://phabricator.wikimedia.org/T174685#3661879 (10Keegan) 05Open>03Resolved
[18:23:15] <wikibugs_>	 (03PS17) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053)
[18:23:42] <awight>	 RoanKattouw: ^ That’s ready for CR if you feel like it.
[18:25:02] <awight>	 Pretty nasty thing I did, but I use a magic empty result to allow us to cache a failure, so we only test the API endpoint compatibility once per minute.
[18:32:47] <awight>	 ^ halfak: Amir1: smoke tested and updated patch
[18:39:19] <awight>	 I’ll start on the blog work update
[18:39:33] * awight catches a cricket and puts it in my taco
[18:40:43] <halfak>	 Nice! 
[18:41:10] <halfak>	 Also ewwww.
[18:41:21] <halfak>	 working on WMFlabs deployment
[18:44:10] <awight>	 halfak: You need me to CR or merge anything?  Otherwise, I’m thinking of breaking for lunch in 10min
[18:44:28] <halfak>	 awight, one quick thing since you are here. 
[18:44:44] <halfak>	 https://github.com/wiki-ai/ores-wmflabs-deploy/pull/88
[18:44:54] <awight>	 kk
[18:46:12] <halfak>	 Just a little thing to make life easier that I'll lose if we don't do it now :) 
[18:46:48] <awight>	 It could be merged, but there’s at least one “deploy” in need of substitution
[18:46:58] <Keegan>	 halfak: when you get time https://www.mediawiki.org/wiki/User:Keegan_(WMF)/sandbox#Email_draft
[18:47:04] <Keegan>	 Short and to the point
[18:47:28] <awight>	 halfak: also search for the ‘deploy’ e.g. in the function default param value
[18:49:11] <halfak>	 awight, got it
[18:49:22] * halfak reads draft
[18:50:11] <awight>	 merged
[18:50:45] <halfak>	 thanks!
[18:50:55] <halfak>	 I'll have wmflabs ready by the time you are done with Lunch :D 
[18:51:05] <halfak>	 staging should be good now
[18:53:37] <awight>	 If you have down time, take a look at the horrors of my extension patch...
[18:53:59] <halfak>	 lol @ downtime
[18:53:59] <awight>	 I’m sad about how kludgily I approached the migration
[18:54:05] <halfak>	 ;) 
[18:54:25] <awight>	 on the bright side, all that stuff can be torn out as soon as we deprecate the old code and config
[18:54:31] <halfak>	 \o/  right :) 
[18:56:33] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) (owner: 10Awight)
[19:00:14] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:00:49] <halfak>	 Oh shaddap
[19:01:45] <icinga-wm>	 PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:01:45] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:03:13] <halfak>	 We had a weird oom on deploy
[19:05:38] <halfak>	 Yup.  Looks like we have outgrown our staging machine :/
[19:05:54] <halfak>	 Oh wait.  No staging is fine.  This is just a web node. 
[19:06:02] <halfak>	 hrrm...
[19:06:48] <halfak>	 strange.  
[19:07:20] <halfak>	 We can fit both into 16GB of memory on staging, but can't fit just web into 8GB of memory on a web node. 
[19:09:07] <halfak>	 Looks like out per-process memory usage has increased by a factor of 3. 
[19:09:10] <halfak>	 :S 
[19:09:37] <halfak>	 uwsgi used to be 792MB and now it's 3.1GB
[19:09:51] <halfak>	 Hmmmmmmhmmmhmmm
[19:10:10] <halfak>	 Thresholds information takes up too much space. 
[19:14:14] <icinga-wm>	 RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 1.127 second response time
[19:14:44] <icinga-wm>	 RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 0.650 second response time
[19:14:45] <icinga-wm>	 RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 1.057 second response time
[19:37:10] <halfak>	 OK so this might be a blocker.
[19:37:32] <halfak>	 I did not expect memory usage to jump so much. 
[19:37:34] * halfak thinks
[20:30:21] <wikibugs_>	 10Scoring-platform-team, 10ORES: Model information UI (graphs and statistics) - https://phabricator.wikimedia.org/T140364#3662288 (10Halfak)
[21:32:54] <awight>	 FYI, I’m stress testing the new ORES cluster
[21:33:10] <awight>	 I don’t see any needles moving in Grafana, however.
[21:35:31] <Zppix>	 o/
[21:37:58] <Zppix>	 the cluster is ores* on grafana right awight ?
[21:38:10] <awight>	 Zppix: exactly
[21:38:14] <awight>	 Maybe the data just hit?
[21:38:23] <Zppix>	 awight:  thats what i am thinking 
[21:39:08] <awight>	 nice.
[21:39:57] <Zppix>	 awight:  looks good on grafana if im reading this data right (lord knows i suck at reading grafana)
[21:40:21] <awight>	 yes!
[21:40:53] <awight>	 Next test, I’ll dial it up a bit.  CPU usage is around 6%
[21:41:41] <awight>	 Increased the load 6x...
[21:44:03] <awight>	 halfak: I believe the new cluster is handling 100x the normal load on scb*
[21:44:23] <halfak>	 Holy moley. 
[21:44:27] <awight>	 lol, something broke
[21:44:41] * halfak loads up grafana
[21:45:03] <awight>	 This was supposed to be a background task, but it’s fun :p
[21:45:51] <halfak>	 I'm still recovering from the thought that our "thresholds" system requires a huge amount of additional memory. 
[21:46:09] <awight>	 Probably Python’s array storage is horribly inefficient
[21:46:16] <halfak>	 I know it compresses really well. 
[21:46:19] <awight>	 oh
[21:46:24] <halfak>	 And we're using pseudo dicts as rows!
[21:46:53] <halfak>	 I wonder how many rows there are. 
[21:47:49] <wikibugs_>	 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662554 (10Halfak)
[21:48:18] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662569 (10Halfak)
[21:48:34] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662554 (10Halfak) My hypothesis is that it's all 'thresholds' because that's the new big thing.
[21:51:04] <halfak>	 OK so it looks like we're keeping data on ~20000 unique thresholds for every label.
[21:51:25] <halfak>	 So basically we have a unique threshold for every single input observation in testing. 
[21:52:54] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662588 (10Halfak) ``` >>> m = Model.load(open("submodules/editquality/models/enwiki.damaging.gradient_boosting.model")) >>> len(m.info['st...
[21:53:24] <awight>	 maybe a bit overboard
[21:54:15] <halfak>	 Still 20k * 10 floats shouldn't be that much data, right?
[21:54:39] <halfak>	 Like, 160 kilobytes
[21:54:55] <awight>	 in ANSI-C maybe
[21:55:27] <awight>	 Are we using arbitrary precision? :)
[21:55:53] <awight>	 halfak: check this out, https://stackoverflow.com/questions/16972501/numpy-size-of-data-type
[21:55:53] <halfak>	 Should be a 32 bit float
[21:55:58] <halfak>	 It behaves like one
[21:57:08] <awight>	 Seems that really bad things happen when ORES overloads
[21:57:16] <halfak>	 ?
[21:57:48] <awight>	 https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&panelId=3&fullscreen&from=1507236425413&to=now&refresh=10s
[21:59:14] <awight>	 I’ll let the celery queue run down this time
[21:59:21] <awight>	 and see if I need to restart services
[21:59:25] <Zppix>	 awight:  any clue what that is on the non admin url? (no access)
[21:59:47] <awight>	 oops
[21:59:48] <awight>	 https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1507236425413&to=now&refresh=10s
[21:59:52] <Zppix>	 ty
[22:01:18] <halfak>	 awight, not sure what you are showing me
[22:01:52] <awight>	 halfak: That most of the server nodes shut down and did nothing during the overload condition
[22:02:36] <halfak>	 Shutting down is a surprise. 
[22:03:02] <halfak>	 If you did something to redis that could explain it. 
[22:03:12] <halfak>	 It's important that redis doesn't get in a weird state
[22:03:20] <halfak>	 Didn't you delete the queue a while back?
[22:08:00] <halfak>	 Yeah... so a model doesn't take up that much space in memory.  Hmm
[22:10:52] <awight>	 I haven’t touched the queue today
[22:11:32] <awight>	 halfak: watch out for the n.b. in that datatype.itemsize hack.  Apparently it only measures the size of the array elements, but not the array object attributes.  Or something.
[22:11:45] <awight>	 halfak: How can I get redis back to a pristine state?
[22:12:04] <halfak>	 flush all, restart workers and uwsgi
[22:12:05] <awight>	 Pretty sure this overload crash is something I’ve been seeing since June
[22:12:10] <awight>	 kk
[22:12:14] <halfak>	 I've not seen workers stop working
[22:12:28] <halfak>	 Except for that one instance of crazy downtime. 
[22:12:30] <awight>	 I can flush from the redis console, then I’ll deploy from tin perhaps to restart all services?
[22:12:38] <halfak>	 Overload is intented. 
[22:12:44] <halfak>	 Right
[22:13:57] <awight>	 ok.  flushed and restarted.
[22:15:14] <Keegan>	 halfak: "It's largely inspired by the false positive reporting work that currently occurs by hand on wiki pages." does that work for you?
[22:15:30] <Keegan>	 Fiddin' to send
[22:15:30] <awight>	 here goes another stress test.
[22:15:57] <halfak>	 Keegan +1
[22:16:06] * Keegan thumbs up
[22:24:49] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662735 (10Halfak) With one model loaded (enwiki damaging):  ```   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAN...
[22:42:04] <awight|brb>	 Baffling
[22:42:17] <awight>	 The load isn’t well balanced
[22:42:23] <awight>	 servers haven’t actually shut down
[22:42:45] <awight>	 mmm ah I’m wrong about what I’m doing.
[22:43:11] <awight>	 I’m hitting the wsgi endpoint on each server, but that’s not where the actual work is performed.
[22:49:06] <halfak>	 It's what the load balancer does. 
[22:51:11] <awight>	 ooh I’m looking at one of the machines currently not doing work (ores1003), and it was showing 1 celery thread at 100%, all others at 0.  Now it’s switched to loading all threads.
[22:52:38] <awight>	 Funky.
[22:57:33] <halfak>	 So I've determined that info is probably taking up a bunch of space.  But I don't know why that is.  I'll be picking that up again tomorrow.
[22:57:46] <halfak>	 I have some ideas related to __slots__ :) 
[22:57:47] <halfak>	 o/
[23:03:55] <wikibugs_>	 10Scoring-platform-team, 10MediaWiki-Vagrant, 10MediaWiki-extensions-ORES: Can't enable ores role in vagrant - https://phabricator.wikimedia.org/T177555#3662836 (10Mooeypoo)
[23:13:59] <wikibugs_>	 (03PS1) 10Sbisson: WLFilters: Temporarily stop respecting hideNonDamaging on WL with beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/382627
[23:22:29] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3662930 (10awight) Ran a few tests today, and found that the filehandle issue is not solved.  The celery service died on several node...
[23:23:04] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3662933 (10awight)
[23:23:07] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3662931 (10awight) 05Resolved>03Open Reopening, I saw this error kill the celery worker on a...
[23:26:35] <wikibugs_>	 10Scoring-platform-team, 10MediaWiki-Vagrant, 10MediaWiki-extensions-ORES: Can't enable ores role in vagrant - https://phabricator.wikimedia.org/T177555#3662836 (10Reedy) Have you done a git update recently? Is it working successfully?  The problem doesn't look to be ores, it looks to be vagrant; it's not ru...
[23:52:14] <wikibugs_>	 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3663066 (10Halfak) I just tried the article quality model for enwiki and found a much larger RES  ```   PID USER      PR  NI    VIRT    RES...