[09:47:24] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/536467 (owner: 10L10n-bot) [14:09:25] o/ akosiaris [14:09:38] Are you available for a quick chat about https://phabricator.wikimedia.org/T232494 ? [14:27:53] * halfak read https://www.reddit.com/r/Python/comments/4s40ge/understanding_uwsgi_threads_processes_and_gil/d56f3oo?utm_source=share&utm_medium=web2x [14:28:05] It's a pretty good discussion of uwsgi threads but not much help to us. [15:14:45] halfak: o/ [15:14:48] I am available now [15:16:07] Hey! So I'm not sure I set up the git::lfs class right. I thought we might make some fast progress in IRC rather than going back and forth on gerrit. [15:16:49] I decided to emit a fail() when the OS doesn't have a git-lfs package. That seemed better than silently continuing without it. [15:17:06] that's fine [15:18:08] ah, I see the problem, it's being used in ores::base. It should be used in profile::ores::web && profile::ores::worker [15:18:31] We need it in our nodes for building models too. See my discussion in the task. [15:18:59] add it to a profile class that this nodes have? [15:20:33] these* [15:20:33] which nodes are these btw [15:20:35] stat1007.eqiad.wmnet and ores-misc-01.eqiad.wmflabs [15:20:52] profile::analytics::cluster::packages::common ? [15:21:05] looks sane for stat1007 [15:22:58] ores-misc-01 seems to not have any applicable profile applied. Maybe role::labs::ores::staging ? [15:24:02] Right. That pulls from role::labs::ores::base [15:24:23] stat1006/7 have ores::base too for the same reasons. [15:25:22] It's a nice git in base because we're gonna need git LFS and all of those enchant packages any time we're doing stuff with ORES. [15:25:45] Building models or actually hosting them. [15:28:58] akosiaris, do you think we should have a profile specifically for importing ores::base, but not setting up uwsgi or celery workers? [15:30:31] that would kind of violate our current puppet organization (which is documented in https://wikitech.wikimedia.org/wiki/Puppet_coding#Organization) [15:30:43] which btw is enforced by jenkins (hence the -1 you got from jenkins) [15:30:54] enforced up to point ofc [15:32:11] the way I see it, it makes way more sense to change ores::base to reflect it's actual nature (which is to install aspell/myspell/hunspell) [15:32:12] Aha. I couldn't figure out that failure. :) [15:32:27] which it's a lot more work ofc [15:32:38] but nothing in ores::base is really ORES specific [15:33:30] if it is going to be reused in other places than the nodes that are running ORES, it's perhaps best split off the ORES module [15:34:32] that being said, a quick way out of this without all this major refactoring (which I am not it is worth it) would probably indeed be a profile class that includes ores::base and git::lfs [15:34:40] and then apply it to any role you want [15:34:50] s/apply/include/ [15:39:53] * akosiaris has to run [15:45:40] akosiaris, I would say the enchant libs are ORES specific. [15:45:55] Maybe I'm missing something. We use those to build, test, and deploy models for ORES. [15:46:46] I can work on an ores::misc profile for non-web/worker nodes and pull it in from there. [15:47:00] I'll do that now. Thank for your thoughts :) [16:15:57] (03PS1) 10Mainframe98: Let the special page factory construct the SpecialPages [extensions/ORES] - 10https://gerrit.wikimedia.org/r/536632 [16:22:48] 10ORES, 10Scoring-platform-team: ORES query with many statistics results in 503 - https://phabricator.wikimedia.org/T232855 (10Tgr) [18:12:25] Changing locations. BRB [19:20:00] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Why is jawiki's goodfaith model so bad? - https://phabricator.wikimedia.org/T230953 (10Keegan) I'll see if I can find someone next week for you. [19:24:08] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Why is jawiki's goodfaith model so bad? - https://phabricator.wikimedia.org/T230953 (10Halfak) \o/ Thank you. [20:43:58] Hmm. Looks like ores-wmflabs is struggling right now. [20:44:02] I can't get a score out of it. [20:44:42] Hmm. Maybe it's just overloaded. [20:44:48] * halfak checks the celery queue [21:09:57] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/ORES [21:10:09] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2097 bytes in 1.541 second response time https://wikitech.wikimedia.org/wiki/ORES [21:10:25] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2097 bytes in 0.583 second response time https://wikitech.wikimedia.org/wiki/ORES [21:14:45] Aha. Redis took a dump [21:14:53] Because the VM is on a host that is overloaded. [21:41:35] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/ORES [21:53:10] Ok we have a new redis and it isn't overloaded [21:53:24] I'm working on adding one more celery worker. All of our workers are *pinned* [21:53:35] on the new machines? [21:59:31] Oh the celery workers are pinned. The uwsgi workers seem to be happy. [22:00:12] But for some reason, our back pressure isn't really working right. [22:01:35] We should be able to see requests pile up in celery when we're overloaded. [22:02:07] we check the length of the celery task queue on redis to know when to start replying with 500 errors [22:02:31] But the celery queue isn't filling up. I'm not sure what is up with that. [22:03:39] Anyway, we have a big quota, so I want to get another worker in there and reconsider our options on Monday. [22:03:45] This is proving to be a *huge* PITA. [22:04:07] It feels like new types of issues are coming from left field constantly. [22:04:15] like the iowait issue for our redis VM [22:04:21] I was really hoping that was going to explain more. [22:04:33] Turns out it was unrelated. :| [22:05:25] OK ores-worker-04 is coming online now. [22:05:38] We'll see if the request fill the available space like a goldfish. [22:05:40] :) [22:12:51] Welp, the new node (for reasons I can't figure out) keeps running out of memory. [22:12:59] So I'm going to give up and go home [22:14:05] o/ [22:14:14] later halfak, have a good weekend! [22:31:11] wikimedia/editquality#689 (docs-test-ci - 3ceef13 : Andy Craze): The build was fixed. https://travis-ci.org/wikimedia/editquality/builds/584798522