[00:10:51] <mutante>	 the package for Icelandic will now be installled by puppet. i checked on ores1001 but that's it 
[00:11:01] <mutante>	 (meaning the others will follow)
[01:13:06] <wikibugs>	 10Scoring-platform-team (Current), 10editquality-modeling, 10Patch-For-Review, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for Icelandic - https://phabricator.wikimedia.org/T181099#3836349 (10awight) aspell-is is available on scb* now, so we're ready to deploy on Thursday.
[01:51:05] <xinbenlv>	 Hi wikimedia-ai team, is there an official rate limit for querying ORES service?
[01:51:34] <xinbenlv>	 I can't find it on https://ores.wikimedia.org
[02:22:06] <halfak>	 xinbenlv, officially, no more than two parallel connections at a time. 
[02:23:53] <halfak>	 I have to run.  I'll be back online around 1500 UTC
[02:23:55] <halfak>	 o/
[03:02:52] <Zppix>	 xinbenlv: im here if you need anything
[06:17:05] <travis-ci>	 eisenhaus335/wikilabels#107 (patch-1 - bed0158 : eisenhaus335): The build has errored. https://travis-ci.org/eisenhaus335/wikilabels/builds/316266812
[06:34:50] <travis-ci>	 eisenhaus335/wikilabels#107 (patch-1 - bed0158 : eisenhaus335): The build has errored. https://travis-ci.org/eisenhaus335/wikilabels/builds/316266812
[10:39:02] <travis-ci>	 eisenhaus335/wikilabels#107 (patch-1 - bed0158 : eisenhaus335): The build has errored. https://travis-ci.org/eisenhaus335/wikilabels/builds/316266812
[11:15:52] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836942 (10akosiaris) >>! In T181661#3834939, @awight wrote: > Looks like I'm getting the same error. >  >> commit b67bba77acb...
[11:28:36] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836975 (10awight) Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error...
[11:34:25] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10Regression, and 4 others: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354#3821279 (10Johan) Asking because #user-notice has been added – is this problem isolated to fa.wikipedia...
[11:35:27] <wikibugs>	 10Scoring-platform-team, 10ORES: Exception killing threads in ORES celery workers - https://phabricator.wikimedia.org/T182862#3837001 (10awight)
[11:36:52] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10Regression, and 4 others: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354#3837012 (10awight) @Johan This only affected fawiki, to my knowledge.  Feel free to change tags as nece...
[11:37:25] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10Regression, and 4 others: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354#3837013 (10Ladsgroup) It was only in fawiki but user-facing, if you think it's not needed feel free to...
[11:41:07] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10Regression, and 3 others: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354#3837014 (10Johan) Thanks. The reason I'm asking is that Tech News is a very ineffective way to reach ou...
[11:54:59] <wikibugs>	 (03PS1) 10Ladsgroup: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243
[11:55:38] * awight lunges to CR
[11:55:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243 (owner: 10Ladsgroup)
[11:56:06] <wikibugs>	 (03PS2) 10Ladsgroup: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243
[11:57:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243 (owner: 10Ladsgroup)
[12:00:07] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837059 (10mmodell) `scap deploy-log -v` reveals more:  ``` 11:27:10 [ores1001.eqiad.wmnet] Unhandled error: Traceback (most r...
[12:06:09] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837081 (10awight) Those revisions aren't in gerrit.  I think the github -> gerrit mirroring broke when we were messing around...
[12:14:37] <wikibugs>	 10Scoring-platform-team, 10Phabricator: Access request: Phabricator Repository-Admins - https://phabricator.wikimedia.org/T182864#3837084 (10awight)
[12:18:38] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837099 (10awight) The Phabricator control panels look happy, https://phabricator.wikimedia.org/source/editquality/manage/uris...
[12:24:37] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Scap: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837105 (10awight)
[12:24:53] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797363 (10awight) Oops—we aren't expecting this repo to be mirrored to gerrit.  So the surprise is that the revision exists i...
[12:26:06] <wikibugs>	 10Scoring-platform-team, 10Phabricator, 10Repository-Admins: Access request: Phabricator Repository-Admins - https://phabricator.wikimedia.org/T182864#3837122 (10Aklapper)
[12:31:09] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837143 (10mmodell) @awight: yeah, I'm getting to the bottom of it now. The issue is that the commit does not exist on a local...
[12:43:23] <wikibugs>	 (03PS3) 10Ladsgroup: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243
[12:44:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243 (owner: 10Ladsgroup)
[12:45:27] <wikibugs>	 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3837174 (10awight)
[12:45:33] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837171 (10awight) 05Open>03Resolved a:03awight Using a workaround for T182865, where we go into submodules and checkout...
[12:48:15] <wikibugs>	 10Scoring-platform-team, 10JADE, 10Design: Design conceptual prototype of JADE integration with MediaWiki - https://phabricator.wikimedia.org/T182829#3837175 (10Pginer-WMF) >>! In T182829#3835955, @Halfak wrote: > I had a conversation with @jmatazzoni and @Catrope.  They suggested that they'd be interested i...
[12:49:26] <awight>	 akosiaris: Is ores1004 okay?  A puppet change from yesterday hasn’t propagated there yet.
[12:50:03] <awight>	 Specifically, the aspell-is package added by https://gerrit.wikimedia.org/r/#/c/398078/
[12:54:18] <awight>	 The last Puppet run was at Tue Dec 12 15:34:12 UTC 2017 (2713 minutes ago). Puppet is disabled. akosiaris testing
[12:54:29] <awight>	 no problem.  As long as there’s an explanation :)
[12:58:03] <wikibugs>	 (03PS4) 10Ladsgroup: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243
[13:02:12] <wikibugs>	 (03PS5) 10Ladsgroup: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243
[13:15:19] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Scap: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837105 (10mmodell) Ok I'm going to try to summarize what we learned by quite a lot of manual poking at the `ores1001` target to form a hypothesis and...
[13:16:16] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Scap: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837277 (10mmodell) So I'm going to figure out what needs to change to make git use the right refspec in the submodule update in deploy-local. Probabl...
[13:16:23] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: ORES web worker memory usage graph is meaningless - https://phabricator.wikimedia.org/T182871#3837278 (10awight)
[13:16:44] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Scap, 10Release-Engineering-Team (Kanban): Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837297 (10mmodell) p:05Triage>03High a:03mmodell
[13:31:12] <icinga2-wm>	 PROBLEM - puppet on ores-redis-01 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:32:42] <icinga2-wm>	 PROBLEM - puppet on ores-web-05 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:35:25] <icinga2-wm>	 PROBLEM - puppet on ores-worker-10 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:36:09] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: ORES timeout error graph is incorrect - https://phabricator.wikimedia.org/T182876#3837379 (10awight)
[13:38:31] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3837415 (10awight) Ran another test: https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=15132561...
[13:39:29] <awight>	 paladox: Remind me later, let’s try to make the icinga2 warnings more clear when failures are happening on non-production machines...
[13:39:43] <Zppix>	 awight: ok
[13:40:51] <Zppix>	 We can rename the hosts in the warnings to do labs<hostname>
[13:52:02] <wikibugs>	 (03CR) 10Awight: [C: 032] "Great work!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243 (owner: 10Ladsgroup)
[13:52:15] <awight>	 Zppix: +1 good idea, thanks!
[13:52:17] <awight>	 biab
[13:53:03] <Zppix>	 paladox: we need to rename ores hostnames in icinga2 to labs<hostname> to denote non prod
[13:53:07] <Zppix>	 As per awight
[13:53:51] <wikibugs>	 (03Merged) 10jenkins-bot: Fully PSR-4'd extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398243 (owner: 10Ladsgroup)
[13:57:19] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3837455 (10Zppix)
[14:01:42] <icinga2-wm>	 RECOVERY - puppet on ores-redis-01 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[14:02:13] <icinga2-wm>	 RECOVERY - puppet on ores-web-05 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[14:04:54] <icinga2-wm>	 RECOVERY - puppet on ores-worker-10 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[14:45:20] <Zppix>	 halfak:  just wanted to let you know of T182880
[14:45:20] <stashbot>	 T182880: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880
[14:46:12] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3837455 (10Halfak) Can you give me an example message and what the desired new message would be?
[14:47:22] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3837606 (10Zppix) >>! In T182880#3837599, @Halfak wrote: > Can you give me an example message and what the desired new message would be? <icinga2-wm> RECOV...
[14:50:19] <codezee>	 halfak: if it looks good to you can we merge multilabel to master? so that we atleast have multilabel classification. Addressing time issue in the same patch would make it unwieldy
[14:50:44] <halfak>	 +1
[14:50:52] <halfak>	 Zppix, replied in the task
[14:56:23] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3837627 (10Halfak) How about something like this?    <icinga2-wm> RECOVERY - puppet on ores-redis-01.eqiad.wmflabs is OK: OK: Puppet is currently enabled,...
[15:03:00] <halfak>	 codezee, I see some issues with the multilabel PR but they won't affect current work.  So I'm merging. 
[15:03:04] <halfak>	 I'll submit a follow-up.  
[15:03:08] <halfak>	 It's my own fault
[15:03:14] <halfak>	 You followed my suggestions :) 
[15:03:32] <Zppix>	 halfak: that could be done
[15:04:16] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3837648 (10Zppix) a:03Zppix
[15:04:18] <halfak>	 Nice work codezee 
[15:04:35] <Zppix>	 halfak: ill get a patch out for ya later today
[15:07:25] <codezee>	 halfak: thanks, there was significant help from you on abstracting binarizer stuff :)
[15:08:24] <codezee>	 btw, when I saw your PR, there were already a lot changes exactly same as yours locally which i was ready to push ;) just except the binarizer thing
[15:08:39] <travis-ci>	 wiki-ai/revscoring#1364 (master - 176c032 : Aaron Halfaker): The build was broken. https://travis-ci.org/wiki-ai/revscoring/builds/316465316
[15:09:31] <halfak>	 codezee, sorry about that.  I got overly excited and didn't see you online so I went for it. 
[15:09:33] <codezee>	 halfak: what kind of issues?
[15:09:48] <halfak>	 codezee, in the label-config file. 
[15:09:57] <halfak>	 Using keys as the labels only allows us to have strings :/
[15:12:50] <codezee>	 halfak: you mean in read_labels_and_population_rates ?
[15:13:10] <halfak>	 right
[15:13:28] <halfak>	 I'll have a proposed change that I think could make sense shortly.  It's an easy switch. 
[15:13:50] * codezee goes about updating revscoring at the numerous places he has branched for testing...
[15:13:58] <halfak>	 :) 
[15:29:12] <awight>	 halfak: o/
[15:29:26] <halfak>	 o/ awight 
[15:29:30] <awight>	 I was able to deploy this morning, I think twentyafterfour found the last (recently) broken piece
[15:29:35] <awight>	 So I ran a stress test
[15:29:39] <halfak>	 Oh great!  
[15:29:42] <awight>	 Very strange results, want to try to interpret?
[15:29:46] <halfak>	 Sure. 
[15:29:49] <awight>	 https://phabricator.wikimedia.org/T182249#3837415
[15:36:43] <awight>	 In other interesting news, https://phabricator.wikimedia.org/T182876 and from my reading of the code, the only way that can happen is if we’re aborted by the outer util.timeout
[15:36:44] <halfak>	 What do you mean by "sine wave"
[15:36:51] <halfak>	 I've never seen that discussed before. 
[15:37:02] <awight>	 I’ve noticed it before but haven’t documented.
[15:37:09] <awight>	 Do you have the graph open?
[15:37:27] <awight>	 https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1513256100000&to=1513257600000
[15:37:38] <awight>	 It’s very pronounced, and affects all the machines at the same time.
[15:38:00] <halfak>	 Looks like the wave chills out over the course of 10 minutes. 
[15:38:10] <awight>	 right?
[15:38:13] <awight>	 it’s bizarre
[15:38:20] <awight>	 I’m loath to speculate what that’s about
[15:38:36] <halfak>	 "Possibly returning immediately with TimeoutError" :/
[15:38:41] <halfak>	 Doesn't seem possible.
[15:39:35] <awight>	 Please comment about why… hehe yeah I tried not to muddy the water with guesswork and that’s an example of why I should avoid it while making observations.
[15:39:52] <halfak>	 awight, did you deploy new ORES -- the one that includes "features" in the task_id?
[15:39:55] <awight>	 What are the other 120 uwsgi workers doing...
[15:39:57] <awight>	 yes
[15:40:23] <awight>	 Think it’s worth running with an older stable revision?
[15:40:44] <halfak>	 We should only be *able* to run 150 web workers in parallel with this setup. 
[15:41:04] <halfak>	 Well... then again, we should be *using* the other workers -- they'll just be blocked on celery. 
[15:41:08] <awight>	 K I was thinking about that too, cos they are 1:1 plugged into Celery workers
[15:41:16] <halfak>	 awight, na.  I think this is a good version. 
[15:41:20] <awight>	 ok
[15:41:51] <halfak>	 Why only running against 7 machines?
[15:42:29] <halfak>	 I'm wondering if our limit is celery/redis 
[15:42:40] <halfak>	 We are sending a good chunk of data to our celery workers. 
[15:42:48] <halfak>	 We could move IO to a celery task. 
[15:42:52] * halfak thinks about that. 
[15:43:06] <awight>	 I left ores1004 out because it’s damaged.  puppet is turned off, so it doesn’t have aspell-is, and due to new code, both services are down.
[15:43:09] <halfak>	 I'm not sure if that helps at all. 
[15:43:19] <halfak>	 Sure.  But why not run 8 servers?
[15:43:39] <awight>	 I’d be interested in adding instrumentation to get whatever data we need before changing code...
[15:43:52] <halfak>	 What instrumentation would you want?
[15:44:10] <awight>	 I left out ores1001 cos I thought you had done so historically, something about redis being on that machine.
[15:44:20] <awight>	 there were 8 celery nodes running, though.
[15:44:32] <halfak>	 Nope.  We've been running it in tests. Looks like redis doesn't make that much of a dent
[15:44:41] <awight>	 so the 7 uwsgi endpoints made work available for 8 celery machines.
[15:44:44] <halfak>	 Ahh yeah. 
[15:45:03] <awight>	 OK cool, I’ll add ores1001 to the endpoints.  The extra uwsgi might be significant?
[15:45:04] <halfak>	 theoretically we should have been *able* to max-out celery
[15:45:11] <awight>	 right…
[15:45:12] <halfak>	 Na.  I don't think so.  
[15:45:51] <awight>	 I donno what extra instrumentation we would want, yet.  One thought is to watch the celery task queue size.  But that looks really weird on production so I’m not sure we’re looking at the right thing.
[15:46:02] <awight>	 I think there are abandoned tasks that never expire.
[15:46:16] <awight>	 Are there any other internal queues?
[15:46:23] <halfak>	 5.6k scores per minute is pretty good.  How do you feel about our overload dynamics?
[15:46:30] <halfak>	 awight, not that I know of
[15:46:56] <halfak>	 Looks like we *are* overloading.  So we're riding the line at this request rate. 
[15:46:58] <awight>	 We’re still being limited at the web layer, that’s why there were almost no overload events.  I was set to 60,000k req/min theoretically.
[15:47:26] <halfak>	 Good point. 
[15:47:28] <awight>	 Yeah I think #busy uwsgi==#celery seems to be why it was possible to sometimes hit overload.
[15:47:50] <halfak>	 We need 600 more active web requests than we have active celery workers in order to overload. 
[15:48:05] <awight>	 That should have happened in a matter of seconds.
[15:48:18] <halfak>	 Right.  I'm just saying it's not the uwsgi workers that are limiting that. 
[15:48:53] <awight>	 yes.  150 uWSGIs are blocking until they get a response back, but I want to understand the other 120.
[15:49:01] <awight>	 er mas o menos
[15:49:15] <awight>	 What is a “busy” web worker?
[15:49:29] <awight>	 That’s a metric coming from uWSGI upstream, right?
[15:49:29] <halfak>	 Good Q.  I can tell you what I assume. 
[15:49:47] <halfak>	 A web worker is a single python process.  While that process is "handling a single request" it is busy. 
[15:49:52] <halfak>	 So from request to response.
[15:49:56] <awight>	 fwiw http://uwsgi-docs.readthedocs.io/en/latest/Metrics.html
[15:49:59] <halfak>	 It is coming from uwsgi
[15:50:22] <halfak>	 So one thing I am worried about is -- could we be drawing from celery's cache somehow?  
[15:50:31] * halfak looks at more graphs. 
[15:51:19] <halfak>	 Yeah...no
[15:51:20] <halfak>	 Hmm
[15:51:36] <awight>	 We’re using a metric uwsgi.core.busy_workers
[15:52:38] <awight>	 One thing.  Let me try parallel tests just in case.
[15:53:13] <awight>	 akosiaris: Mind reenabling puppet on ores1004?
[15:53:26] <awight>	 I think we’ve solved the blocker.
[15:53:34] <halfak>	 I'm going to duck back into label-config land while you try another test. 
[15:53:41] <halfak>	 I think parallel stress tests is a good idea. 
[15:53:56] <akosiaris>	 awight: done
[15:54:08] * awight missed a beat
[15:54:11] <awight>	 akosiaris: Thanks!
[15:54:35] <awight>	 halfak: I have 4 hours of things that prevent me from being production, so no rush.
[15:54:45] <awight>	 *productive
[15:55:31] <awight>	 halfak: Think it’s worthwhile to distinguish between the different types of TimeoutError?
[15:56:29] <halfak>	 awight, looks like ores.ores*.uwsgi.core.overloaded.count might give us some indication
[15:56:34] <awight>	 ooh
[15:56:35] <awight>	 ty
[15:56:40] <awight>	 I’ll make a graph
[15:57:07] <awight>	 [switching networks]
[15:58:14] <halfak>	 awight, did you find any definition of these uwsgi metrics?
[15:58:19] <halfak>	 I'm tired of guessing at names 
[15:58:45] <halfak>	 http://uwsgi-docs.readthedocs.io/en/latest/Metrics.html#officially-registered-metrics
[15:58:57] <halfak>	 Best I have ^ but mostly useless
[16:02:16] * halfak begins to go insane
[16:06:04] <halfak>	 I'm not seeing any meaningful increase in ores1003.uwsgi.workers.*.requests.count during the stress test
[16:06:05] <halfak>	 WTF
[16:07:29] <awight>	 overloaded.count is interesting.  About 20 of whatever that unit is, on each machine.
[16:07:38] <awight>	 Sorry, just caught up on backscroll from logs.
[16:07:51] <awight>	 No, there seems to be no document-effing-tation
[16:10:15] <awight>	 grr, /me clones https://github.com/unbit/uwsgi
[16:14:35] <awight>	 https://github.com/unbit/uwsgi/blob/master/core/metrics.c
[16:45:04] * halfak is in meetings for the next 2.5 hours :( 
[16:49:18] <awight>	 worker.*.core.*.exceptions.count is at 20 fwiw
[16:49:56] <awight>	 wish I know what that meant either
[16:50:27] <awight>	 meh all these metrics looks the same.
[16:54:19] <awight>	 Rats, ores1001 redis isn’t emitting metrics.
[16:55:47] <Zppix>	 awight: ill be working on that icinga thing now!
[16:56:16] <awight>	 Zppix: That’s awesome, you’ll be single-handedly responsible for lowering this channel’s blood pressure 20 points.
[16:57:20] <wikibugs>	 10Scoring-platform-team, 10JADE, 10Design: Design conceptual prototype of JADE integration with MediaWiki - https://phabricator.wikimedia.org/T182829#3837967 (10Halfak) Hi @Pginer-WMF!  Thank you for taking a look.  While I'm really excited about using this data for model auditing (false-positive/true-positi...
[16:58:59] <apergos>	 do you want the messages to say cloud- something instead of labs-something?  just because of the rebranding...
[17:00:36] <Zppix>	 apergos: this is for icinga2-wm
[17:00:46] <Zppix>	 awight: cloud or labs?
[17:01:10] <Zppix>	 Or i can go generic "non-prod"
[17:01:26] <apergos>	 yes, for the alert messages
[17:01:43] <apergos>	 I don't know what's better, I just know that referring to things as "labs" is slowly being phased out
[17:02:21] <Zppix>	 apergos: i know but cloudvps doesnt really roll of the tongue :P
[17:04:29] <apergos>	 wmcloud? :-P    dunno, maybe the folks in their channel will have a better idea
[17:05:21] <Zppix>	 I think ill do wmcloud
[17:05:22] <apergos>	 I'm definitely butting in where I have no say, so bear that in mind (except that we read these alerts too) 
[17:05:24] <apergos>	 ok
[17:05:24] <awight>	 “cloud” might not be nebulous enough
[17:06:06] <wikibugs>	 10Scoring-platform-team, 10JADE, 10Design: Design conceptual prototype of JADE integration with MediaWiki - https://phabricator.wikimedia.org/T182829#3837977 (10Halfak) FWIW, I think an important implication of the point I'm making above is that users will want to provide a structured judgement of *every edi...
[17:07:01] <Zppix>	 awight: i can do whatever... i just need confirmation on what im naming this otherwise i may just end up going with host.ores.eqiad.wmflabs
[17:08:11] <awight>	 Zppix: how about .experimental
[17:08:24] <awight>	 We don’t have a good name for this cluster yet, but that should at least keep heart rates down
[17:11:12] <Zppix>	 Hmmm ok
[17:15:36] <awight>	 halfak: parallel test harnesses are able to break through the suspicious 150 barrier, check out the graphs.
[17:16:00] <awight>	 I started the second tester at 17:11:03
[17:20:26] <halfak>	 cool!
[17:20:49] <Zppix>	 halfak: awight  https://gerrit.wikimedia.org/r/398287
[17:20:52] <Zppix>	 Please review
[17:21:20] <awight>	 halfak: However, we didn’t even nudge Celery peformance.  Good thing to know, at least!
[17:22:45] <halfak>	 Interesting.  I suspect redis is our bottleneck now.  I wonder if we're dumping serialized pre-feature-extraction data into redis records. 
[17:22:52] <halfak>	 That might be slow-ish.
[17:23:03] <halfak>	 Oh wait... though would our celery queue fill up?
[17:23:04] <halfak>	 hmm
[17:24:26] <awight>	 I’m watching the celery queue size, and never goes over > 1000
[17:30:15] <awight>	 What.  Scores processed is falling for all machines.
[17:30:31] <awight>	 Glad you talked me into stress-testing until we get this right.
[17:32:23] <wikibugs>	 10Scoring-platform-team, 10VPS-project-icinga2, 10User-Zppix: Rename hostnames in warnings to denote non prod - https://phabricator.wikimedia.org/T182880#3838053 (10Zppix) 05Open>03Resolved
[17:32:36] <icinga2-wm>	 CUSTOM - ping4 on ores-web-05 is OK: PING OK - Packet loss = 0%, RTA = 2.31 ms paladox Test awight zppix
[17:32:50] <awight>	 darn!
[17:33:14] <icinga2-wm>	 CUSTOM - Host ores-web-05 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms zppix Testing new display names
[17:33:29] <Zppix>	 Ffs
[17:33:52] <awight>	 Zppix: maybe the alert template has <shorthost> hardcoded.  I was saying in Gerrit, it would be perfect to just change that template to <fqdn>
[17:33:55] <awight>	 (variable names not included)
[17:34:09] <Zppix>	 Hmm
[17:34:18] <Zppix>	 Paladox ^
[17:34:35] <paladox>	 it's because it needs to be renamed in
[17:34:36] <paladox>	 object Host "ores-lb-02" {
[17:34:40] <paladox>	 which is what the script uses
[17:34:54] <awight>	 can we change the alert script though?
[17:35:53] <Zppix>	 Possibly
[17:36:00] <Zppix>	 Hmm
[17:36:12] <Zppix>	 Paladox?
[17:36:51] <paladox>	 Depends because display_name is not used every where
[17:36:58] <paladox>	 so possibly will be hit by undefined
[17:37:01] <wikibugs>	 10Scoring-platform-team, 10Wikilabels, 10Google-Code-in-2017: Provide a pytest for database of wikilabels - https://phabricator.wikimedia.org/T179014#3838083 (10Ladsgroup) This definitely needs more work, one thing can be that the database has not been properly set up.
[17:38:46] <icinga2-wm>	 CUSTOM - check load on ORES-lb02.Experimental is OK: OK - load average: 0.08, 0.05, 0.01 paladox testing awight zppix
[17:38:55] <awight>	 nice one!
[17:41:19] <Zppix>	 Yay
[18:01:34] <codezee>	 halfak: thanks for the invite, those were interesting discussions... :)
[18:01:45] <halfak>	 \o/  glad to have you
[18:01:57] <halfak>	 I imagine it's nice to have a window into the researchy discussions ^_^
[18:02:15] <Zppix>	 halfak: hows that project?
[18:02:18] <Zppix>	 Reporter
[18:02:22] <codezee>	 maybe i'll also try to contribute from next time... \o/
[18:10:19] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838216 (10awight) Ran a tricky test, in which I stepped up from 1 to 3 test harnesses, then back down. * tester...
[18:35:51] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838258 (10Halfak) Based on this report, I think we should go live with this.  Any follow-up stress testing can...
[18:37:01] <halfak>	 I'm not feeling great today.  I was trying to tough it out, but I think I should lay down. 
[18:37:14] <halfak>	 I'll be AFK but available via gchat/hangouts
[18:37:22] <Zppix>	 Fcc just voted against net neutrality
[18:39:04] <halfak>	 Fuuu
[18:39:12] <halfak>	 codezee, ^ see that PR I put together
[18:39:23] <halfak>	 Note that it changes the structure of labels-config. 
[18:39:25] <halfak>	 OK I'm out 
[18:39:33] * codezee looking
[18:39:58] <awight>	 halfak: More strange results in the next comment above ^
[18:40:49] <awight>	 halfak: Just saw your note.  So your assumption is that we’ll be unable to figure this out in a reasonable amount of time?  That does seem fair.
[18:41:08] <travis-ci>	 wiki-ai/revscoring#1365 (labels_config - feb1caf : halfak): The build passed. https://travis-ci.org/wiki-ai/revscoring/builds/316564891
[18:46:55] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838288 (10awight) I'm happy with that.  It looks like it's going to be difficult to break through this ceiling,...
[19:04:26] <awight>	 fyi, scb1001 is still down
[19:04:49] <awight>	 k back now
[19:05:23] <awight>	 everything looks good.  I’m running down the street to get better wifi
[19:13:02] <awight>	 halfak: What do you think about my guesstimated changes, maybe 150 -> 135 celery workers and 230 -> 160 web workers?
[19:13:40] <awight>	 akosiaris was saying we should tune for capacity and not for the hardware, but IMO we hit a hardware limit and have to scale back accordingly.
[19:14:51] <Zppix>	 awight: he went to lay down aaron isnt feeling well
[19:15:13] <Zppix>	 He said he can be contacted via gchat or hangouts
[19:15:17] <awight>	 ty
[19:15:21] <Zppix>	 Np
[19:27:02] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: ORES timeout error graph is incorrect - https://phabricator.wikimedia.org/T182876#3838371 (10awight) There's one code path that can throw a TimeoutError without adding to this metric, it's the outer timeout in ores/util.py.  Interesting that we're hitting this cod...
[19:36:41] <Zppix>	 awight: if your deploying today i can be around to assist if needed
[19:44:05] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: ORES timeout error graph is incorrect - https://phabricator.wikimedia.org/T182876#3838430 (10awight) The last comment was wrong, I see how the timeout is caught and metrics are recorded.  I currently can't find any code paths to explain the missing metrics.
[20:27:00] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Scap: scap deploy --service-restart doesn't affect ORES celery - https://phabricator.wikimedia.org/T182912#3838509 (10awight)
[20:28:30] <awight>	 halfak: :D: pssh -h ores-hosts -P "ps auxxww|grep celery|wc -l"
[20:32:01] <icinga2-wm>	 PROBLEM - puppet on ORES-redis01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:32:09] <icinga2-wm>	 PROBLEM - puppet on ORES-web05.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:42:21] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3838537 (10thcipriani)
[20:55:33] <Zppix>	 awight: those puppet errors normal?
[20:56:12] <awight>	 I’m just trying to give icinga2 something to think about
[20:56:17] <awight>	 No, I have no clue though
[20:56:26] <awight>	 thanks to the .experimental suffix, we don’t have to worry :D
[20:56:53] <Zppix>	 awight: if you want i can silence them im in the ui right now trying to figure something out
[20:57:11] <awight>	 Zppix: it’s your call!
[20:57:13] <awight>	 ty
[20:57:58] <icinga2-wm>	 ACKNOWLEDGEMENT - puppet on ORES-redis01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues zppix Silence! (Approved via awight.eqiad.irc.net)
[20:58:10] <icinga2-wm>	 ACKNOWLEDGEMENT - puppet on ORES-web05.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues zppix Ack
[20:58:28] <Zppix>	 Awight done
[21:01:31] <icinga2-wm>	 RECOVERY - puppet on ORES-redis01.experimental is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[21:02:39] <icinga2-wm>	 RECOVERY - puppet on ORES-web05.experimental is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[21:16:54] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3838621 (10awight)
[21:16:59] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838619 (10awight) 05Open>03stalled I think we've got our tuning parameters!  45 minutes of overload, and ev...
[21:17:07] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3456617 (10awight)
[21:17:09] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838622 (10awight)
[21:23:02] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3838630 (10awight)
[21:23:04] <wikibugs>	 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3838631 (10awight)
[21:23:07] <wikibugs>	 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3838628 (10awight) 05Open>03Resolved Ok, done for real now.  @Halfak and I decided that the remaining bottlenecks are something n...
[21:23:44] <wikibugs>	 10Scoring-platform-team, 10ORES: Make sure ORES is compatible with stretch - https://phabricator.wikimedia.org/T182799#3838633 (10awight)
[21:23:46] <wikibugs>	 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3355113 (10awight)
[21:24:08] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Scap: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3838636 (10awight)
[21:24:10] <wikibugs>	 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3355113 (10awight) 05stalled>03Open Unstalling, now that the stress testing is complete.
[21:25:24] <wikibugs>	 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3838640 (10awight)
[21:25:27] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3838638 (10awight) 05Open>03Resolved I haven't seen this issue in a few weeks, closing.  Thank you!
[21:30:15] <wikibugs>	 10Scoring-platform-team, 10ORES: Make sure ORES is compatible with stretch - https://phabricator.wikimedia.org/T182799#3838643 (10awight) I'm reconsidering my proposal to use python3.4.  It's only available by adding jessie as an apt source, and causes some annoying dependency fu such as downgrading findutils...
[21:34:23] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: Add a graph of ORES Celery task queue length - https://phabricator.wikimedia.org/T182914#3838644 (10awight)
[21:35:42] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Graphite: Look at additional uWSGI metrics for potential use in the ORES dashboard - https://phabricator.wikimedia.org/T182915#3838656 (10awight)
[22:22:23] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES: Extension:ORES caused MW train rollback - https://phabricator.wikimedia.org/T182921#3838777 (10awight)
[22:22:35] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Wikimedia-Incident: Extension:ORES caused MW train rollback - https://phabricator.wikimedia.org/T182921#3838787 (10awight)
[22:28:52] <wikibugs>	 (03PS1) 10Chad: Avoid notice when oresm_name property doesn't exist [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398381
[22:32:25] <wikibugs>	 (03CR) 10Awight: [C: 032] "Safe fix!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398381 (owner: 10Chad)
[22:36:56] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid notice when oresm_name property doesn't exist [extensions/ORES] - 10https://gerrit.wikimedia.org/r/398381 (owner: 10Chad)
[22:40:24] <wikibugs>	 (03PS1) 10Chad: Avoid notice when oresm_name property doesn't exist [extensions/ORES] (wmf/1.31.0-wmf.12) - 10https://gerrit.wikimedia.org/r/398383
[22:40:42] <wikibugs>	 (03CR) 10Chad: [C: 032] Avoid notice when oresm_name property doesn't exist [extensions/ORES] (wmf/1.31.0-wmf.12) - 10https://gerrit.wikimedia.org/r/398383 (owner: 10Chad)
[22:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid notice when oresm_name property doesn't exist [extensions/ORES] (wmf/1.31.0-wmf.12) - 10https://gerrit.wikimedia.org/r/398383 (owner: 10Chad)