[09:47:24] <wikibugs>	 (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/536467 (owner: 10L10n-bot)
[14:09:25] <halfak>	 o/ akosiaris 
[14:09:38] <halfak>	 Are you available for a quick chat about https://phabricator.wikimedia.org/T232494 ?
[14:27:53] * halfak read https://www.reddit.com/r/Python/comments/4s40ge/understanding_uwsgi_threads_processes_and_gil/d56f3oo?utm_source=share&utm_medium=web2x
[14:28:05] <halfak>	 It's a pretty good discussion of uwsgi threads but not much help to us. 
[15:14:45] <akosiaris>	 halfak: o/
[15:14:48] <akosiaris>	 I am available now
[15:16:07] <halfak>	 Hey!  So I'm not sure I set up the git::lfs class right.  I thought we might make some fast progress in IRC rather than going back and forth on gerrit. 
[15:16:49] <halfak>	 I decided to emit a fail() when the OS doesn't have a git-lfs package.  That seemed better than silently continuing without it. 
[15:17:06] <akosiaris>	 that's fine
[15:18:08] <akosiaris>	 ah, I see the problem, it's being used in ores::base. It should be used in profile::ores::web && profile::ores::worker
[15:18:31] <halfak>	 We need it in our nodes for building models too.  See my discussion in the task. 
[15:18:59] <akosiaris>	 add it to a profile class that this nodes have?
[15:20:33] <akosiaris>	 these*
[15:20:33] <akosiaris>	 which nodes are these btw
[15:20:35] <halfak>	 stat1007.eqiad.wmnet and ores-misc-01.eqiad.wmflabs
[15:20:52] <akosiaris>	 profile::analytics::cluster::packages::common ?
[15:21:05] <akosiaris>	 looks sane for stat1007
[15:22:58] <akosiaris>	 ores-misc-01 seems to not have any applicable profile applied. Maybe role::labs::ores::staging ?
[15:24:02] <halfak>	 Right.  That pulls from role::labs::ores::base
[15:24:23] <halfak>	 stat1006/7 have ores::base too for the same reasons. 
[15:25:22] <halfak>	 It's a nice git in base because we're gonna need git LFS and all of those enchant packages any time we're doing stuff with ORES. 
[15:25:45] <halfak>	 Building models or actually hosting them. 
[15:28:58] <halfak>	 akosiaris, do you think we should have a profile specifically for importing ores::base, but not setting up uwsgi or celery workers? 
[15:30:31] <akosiaris>	 that would kind of violate our current puppet organization (which is documented in https://wikitech.wikimedia.org/wiki/Puppet_coding#Organization)
[15:30:43] <akosiaris>	 which btw is enforced by jenkins (hence the -1 you got from jenkins)
[15:30:54] <akosiaris>	 enforced up to point ofc
[15:32:11] <akosiaris>	 the way I see it, it makes way more sense to change ores::base to reflect it's actual nature (which is to install aspell/myspell/hunspell)
[15:32:12] <halfak>	 Aha.  I couldn't figure out that failure. :) 
[15:32:27] <akosiaris>	 which it's a lot more work ofc
[15:32:38] <akosiaris>	 but nothing in ores::base is really ORES specific 
[15:33:30] <akosiaris>	 if it is going to be reused in other places than the nodes that are running ORES, it's perhaps best split off the ORES module
[15:34:32] <akosiaris>	 that being said, a quick way out of this without all this major refactoring (which I am not it is worth it) would probably indeed be a profile class that includes ores::base and git::lfs 
[15:34:40] <akosiaris>	 and then apply it to any role you want
[15:34:50] <akosiaris>	 s/apply/include/
[15:39:53] * akosiaris has to run
[15:45:40] <halfak>	 akosiaris, I would say the enchant libs are ORES specific.  
[15:45:55] <halfak>	 Maybe I'm missing something.  We use those to build, test, and deploy models for ORES. 
[15:46:46] <halfak>	 I can work on an ores::misc profile for non-web/worker nodes and pull it in from there. 
[15:47:00] <halfak>	 I'll do that now.  Thank for your thoughts :) 
[16:15:57] <wikibugs>	 (03PS1) 10Mainframe98: Let the special page factory construct the SpecialPages [extensions/ORES] - 10https://gerrit.wikimedia.org/r/536632
[16:22:48] <wikibugs>	 10ORES, 10Scoring-platform-team: ORES query with many statistics results in 503 - https://phabricator.wikimedia.org/T232855 (10Tgr)
[18:12:25] <halfak>	 Changing locations.  BRB
[19:20:00] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Why is jawiki's goodfaith model so bad? - https://phabricator.wikimedia.org/T230953 (10Keegan) I'll see if I can find someone next week for you.
[19:24:08] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Why is jawiki's goodfaith model so bad? - https://phabricator.wikimedia.org/T230953 (10Halfak) \o/  Thank you.
[20:43:58] <halfak>	 Hmm.  Looks like ores-wmflabs is struggling right now. 
[20:44:02] <halfak>	 I can't get a score out of it. 
[20:44:42] <halfak>	 Hmm.  Maybe it's just overloaded. 
[20:44:48] * halfak checks the celery queue
[21:09:57] <icinga-wm>	 PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/ORES
[21:10:09] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2097 bytes in 1.541 second response time https://wikitech.wikimedia.org/wiki/ORES
[21:10:25] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2097 bytes in 0.583 second response time https://wikitech.wikimedia.org/wiki/ORES
[21:14:45] <halfak>	 Aha.  Redis took a dump
[21:14:53] <halfak>	 Because the VM is on a host that is overloaded. 
[21:41:35] <icinga-wm>	 RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/ORES
[21:53:10] <halfak>	 Ok we have a new redis and it isn't overloaded
[21:53:24] <halfak>	 I'm working on adding one more celery worker.  All of our workers are *pinned*
[21:53:35] <accraze>	 on the new machines?
[21:59:31] <halfak>	 Oh the celery workers are pinned.  The uwsgi workers seem to be happy. 
[22:00:12] <halfak>	 But for some reason, our back pressure isn't really working right. 
[22:01:35] <halfak>	 We should be able to see requests pile up in celery when we're overloaded. 
[22:02:07] <halfak>	 we check the length of the celery task queue on redis to know when to start replying with 500 errors 
[22:02:31] <halfak>	 But the celery queue isn't filling up.  I'm not sure what is up with that. 
[22:03:39] <halfak>	 Anyway, we have a big quota, so I want to get another worker in there and reconsider our options on Monday. 
[22:03:45] <halfak>	 This is proving to be a *huge* PITA. 
[22:04:07] <halfak>	 It feels like new types of issues are coming from left field constantly. 
[22:04:15] <halfak>	 like the iowait issue for our redis VM
[22:04:21] <halfak>	 I was really hoping that was going to explain more. 
[22:04:33] <halfak>	 Turns out it was unrelated. :|
[22:05:25] <halfak>	 OK ores-worker-04 is coming online now. 
[22:05:38] <halfak>	 We'll see if the request fill the available space like a goldfish. 
[22:05:40] <halfak>	 :) 
[22:12:51] <halfak>	 Welp, the new node (for reasons I can't figure out) keeps running out of memory. 
[22:12:59] <halfak>	 So I'm going to give up and go home 
[22:14:05] <halfak>	 o/
[22:14:14] <accraze>	 later halfak, have a good weekend!
[22:31:11] <travis-ci>	 wikimedia/editquality#689 (docs-test-ci - 3ceef13 : Andy Craze): The build was fixed. https://travis-ci.org/wikimedia/editquality/builds/584798522