[00:00:03] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 9.035 second response time [00:00:13] > awight: stopped beta ores. [00:01:00] One more test. [00:01:35] * halfak logs [00:01:48] yep this is working. [00:02:05] let me know when you re-enable :) [00:02:07] I checked the requests by restarting the service right away. The S:RC page doesn’t recover for 1 minute [00:02:09] I did [00:02:21] kk [00:02:22] ty :) [00:02:25] n/p [00:02:36] YAS [00:02:38] * halfak is running backup wherever he can :) [00:02:42] S:RC recovered. [00:02:46] Its !log deployment-prep fyi halfak [00:02:55] So now I really need to relocate, back in 5 [00:03:07] yarg [00:03:41] Get it right for once geeez xD halfak [00:03:52] :P Zppix [00:03:54] Still around [00:03:56] This is a long day [00:04:01] I'm so ready to be done with it. [00:04:29] Ores is been such a whiny b#### lately [00:04:46] Na. ORES is OK. We need to spend more focus on ORES Ext. [00:05:20] I neglect it with scheduling work because I am not much of a contributor to the ORES ext [00:05:22] actually ORES infra needs some love as well [00:05:30] Well that isn't wrong. [00:05:32] Yea [00:05:44] things like uwsgi logging at various levels and not just INFO at logstash [00:05:51] akosiaris, if you have some to-do's that's something I'll tackle. [00:05:55] or splitting the workers for the HTTP hosts [00:06:10] web vs. celery? [00:06:18] yeah, that's a ball in my court [00:06:18] I'd love that. I need $$ :) [00:06:22] We do that in labs :D [00:06:30] s/labs/cloud/ [00:06:37] I know. And hardware wise we are ready to do that in production as well [00:06:45] as soon as the stress tests are done [00:06:55] akosiaris, next FY, let's collab on a request for a few dedicated web nodes. [00:07:06] I had no idea these stress tests would be so involved! [00:07:18] actually we are gonna collab for moving to kubernetes :P [00:07:20] But I'm stoked about the boost in capacity that it looks like we can get :) [00:07:22] Oh yeah [00:07:23] That [00:07:25] Sure :) [00:07:44] Just so long as you're down for carrying some of that weight. :D [00:07:53] a lot of the weight probably [00:08:01] How long until we start experimenting with that? [00:08:13] but at least it will be a goal for department & team so I will be able to justify the time [00:08:27] cause that's one resource I don't have a lot of unfortunately [00:08:44] so, next Q is the first application running on top of kubernetes [00:08:56] whoa I missed something fun. [00:08:57] with the point of learning as much as possible from the migration [00:09:05] kubernetes! [00:09:10] We're going to migrate right now ;) [00:09:12] infrastructure + development wise [00:09:13] loool [00:09:34] We decided that none of us sleep until k723642s is working [00:09:47] As long as we can do WMF budgeting in PAWS [00:10:01] OMG PAWS. I need to write that proposal. [00:10:02] so, second quarter of next year we can probably have a real shot at a migration [00:10:15] OK that sounds good to me. I'll keep in mind :) [00:10:31] but anyway, should we deploy the patch to ores:ext ? [00:10:32] In the short term, make ores work, medium, make ORES cluster work, [00:10:38] from what I gather it worked on beta ? [00:10:44] Right so awight has something mostly ready [00:10:48] I just have a couple questiosn. [00:11:09] Why lockTSE of 10? [00:11:16] The examples all set it to 30 [00:11:17] (03PS1) 10Awight: Revert "Fallback to old thresholds API as necessary (take 2)" [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) [00:11:20] (03PS1) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) [00:11:21] Is this some sort of wait>? [00:11:40] I think those are seconds. [00:11:48] yeah I cribbed directly from examples, I’m not gonna try that tonight. [00:11:49] OK I was guessing that. [00:12:11] OK I guess I just didn't see the examples with 10. [00:12:13] I’d prefer to read the code, etc. Feel free to try to decode what the heck is going on in WANObjectCache, though! [00:12:45] Yeah. That's a big mess :| [00:12:53] Well, not a mess. Just deep [00:12:56] Maybe a mess too [00:13:11] Smoke test says a lot. [00:13:16] (03CR) 10Awight: [C: 032] "Self-merge backport." [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) (owner: 10Awight) [00:13:16] What did you confirm in the smoke test? [00:13:21] (03CR) 10Awight: [C: 032] "Self-merge backport." [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:13:42] https://gerrit.wikimedia.org/r/#/c/393945 BTW [00:13:50] for akosiaris if you speak MW PHP [00:14:09] jenkins doesn't like something [00:14:39] oh my MW PHP is very bad [00:14:46] but I 'll have a look [00:14:57] Probably about as good as mine [00:15:17] I used to write a lot of PHP back in... 4? When did PHP get new style classes? [00:15:22] halfak: So what I confirmed was * code doesn’t blow up in either sucess or failure branches, * when the service goes down and we want a threshold, we correctly put [] in the cache with TTL 60s. * Requesting the page within that window doesn’t cause any more service requests, and * Once the 60s is up, the extension tries again. If it fails, it caches another [], if it succeeds then we’re back to 1 day caching of good thresholds. [00:15:24] 5 [00:15:37] lol /me wonders if halfak is thinking Perl [00:15:38] Right at the beginning of 5 is when I checked out. [00:15:41] nope [00:15:47] lol [00:15:52] other $var lang. [00:16:06] Fatal error: Cls: Expected string or object in /home/jenkins/workspace/mwext-testextension-hhvm-jessie/src/extensions/ORES/includes/Stats.php on line 121 [00:16:43] Sooo much useless punctuation in PHP [00:16:57] huh [00:17:37] => ? [00:17:38] akosiaris: where’s that from? [00:17:57] awight: last few lines of https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/24606/console [00:19:24] (03Merged) 10jenkins-bot: Revert "Fallback to old thresholds API as necessary (take 2)" [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) (owner: 10Awight) [00:20:28] (03Merged) 10jenkins-bot: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:21:51] omg I copypasta’d straight out of an example, smh [00:22:28] (03PS2) 10Awight: Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) [00:23:06] halfak: Thanks for the company. The extension backports are ready to deploy. [00:23:50] Ha! OK [00:24:18] * halfak waits on Jenkins [00:24:32] awight, did that code run in the smoke test? [00:24:46] No, we ran master. [00:25:00] I don’t think we can easily put beta on this branch. [00:25:48] the… tests passed is all I can say about the branch,. [00:26:37] OK so we need to make a call. Gotcha. Once we merge you can smoke test? [00:26:55] I could just run this branch locally... [00:27:12] Lemme to an abbreviated test, just general smoke rather than the specific cache thing. [00:27:14] *do [00:27:19] OK [00:28:43] hooks are firing correctly. [00:29:51] “scap sync-vile” [00:29:57] ARISE [00:30:35] jenkins is taking it's sweet time on https://gerrit.wikimedia.org/r/#/c/393945/ [00:30:53] It really is [00:31:04] Here goes. [00:32:54] (03CR) 10Halfak: [C: 032] Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:32:58] Done [00:33:03] oh shit [00:33:10] k we can look at that on beta [00:33:19] I just deployed the parent commit though [00:33:26] Sorry I misunderstood. Might be better? [00:34:14] Sure, we can try that after some local testing [00:34:25] so far, the more basic patch is holding its own though [00:36:27] Actually, there haven’t been any ORES 400s since 00:08, so we wouldn’t know. [00:36:52] We need the error window. [00:37:10] Meanwhile, I need to backport to wmf.10 to say I did. [00:37:11] oh... huh [00:37:31] Why not. Weird! [00:38:11] (03PS1) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) [00:38:21] just to screw with us, apparently. [00:38:30] I think ORES may be dating SKYNET [00:38:43] But this is MW :P [00:38:51] (03Merged) 10jenkins-bot: Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:39:09] (03CR) 10Awight: [C: 032] "Self-merging backport." [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:39:19] I think the error window is starting [00:39:42] per https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=2&fullscreen&orgId=1&from=now-30m&to=now-1m [00:39:51] wheee! [00:39:57] throw your hands in the air [00:40:09] the rest of the graphs in that dashboard are for the first time at close to 1k scores btw [00:40:34] No 400s [00:41:52] akosiaris: Can you pull user agents easily? [00:41:54] Looks like we have a big increase in requests. [00:42:11] Not external. This is precaching [00:42:14] (03Merged) 10jenkins-bot: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:42:21] halfak: yes [00:42:31] WThorridF [00:42:48] Did some continent just wake up? [00:43:30] awight: yeah give a min [00:43:46] Europe is starting to get very early morn (~midnight) halfak [00:43:50] akosiaris: not necessary, thanks [00:44:04] Maybe east asia [00:44:07] akosiaris: The graphs are showing that it’s legitimate internally-generated requests. [00:45:05] 1767 user agent "MediaWiki/1.31.0-wmf.8" [00:45:05] 1640 user agent "ChangePropagation/WMF" [00:45:10] halfak: waidaminute. These are scores errored, but nothing is overloaded [00:45:14] that's the 2 top since 00:30 [00:45:16] what does that mean? [00:45:18] akosiaris: ty! [00:46:04] we have someone who sends just "Mozilla" as a user agent [00:46:08] halfak: Not timed out either?? [00:46:12] cause that's believable.. lol [00:46:30] akosiaris: that could be any amnt of browsers/clients [00:46:31] akosiaris: We should email Mozilla and give them a piece of our minds [00:46:55] awight: be nice... now Mozilla didnt mean jt xD [00:47:16] Zppix: no not really. It's just Mozilla nothing like https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent/Firefox [00:47:52] instead of "Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" it's .... "Mozilla" [00:48:03] so it's 99% phabricated and not real [00:48:15] I am leaving a 1% just to play it safe [00:48:38] anyway halfak awight things are finally looking good ? [00:49:10] I think so. halfak just realized, the “scores errored” is incredibly low, barely a handful. [00:49:14] I'm looking into it. [00:49:35] Oh damn. KK [00:49:56] 400s are gone so I’m not going to rush that last cache patch. [00:51:32] 10Scoring-platform-team, 10MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), 10Patch-For-Review: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3795157 (10awight) Still need to better understand, and test, https://gerrit.wikimedia.org/r/#/c/39... [00:51:47] ok, I am going to re-enable celery on scb1002 and scb1001 [00:52:21] OK [00:53:03] Might hit memory errors again though :/ [00:53:22] yeah that's what I want to see [00:53:26] This is random, but if there was some underlying thing which was causing off-schedule memcache purges for our thresholds data (I’m suspicious of that), then we’ll start hearing complaints about enwiki etc. not having ORES features for 1min at a time now. [00:53:41] akosiaris: What time zone are you in? [00:53:47] UTC+2 [00:53:59] Man. Late night for you [00:54:05] yes [00:54:09] I'm tired and I'm UTC-5 [00:54:10] ACK [00:54:15] I don't know what tired is [00:54:16] I say good night, sir! [00:54:22] Right on. [00:54:31] I could go for a few more hours lol [00:55:10] I 'll watch memory for a few more mins on those 2 hosts and go to bed if all is ok [00:55:35] +1 [00:55:50] awight, we are due for another cycle if the hourly cadence holds. [00:56:19] No problem, I can lurk for a bit. [00:56:44] 4(noping)pergos gave me some homework I need to get started on :D [01:03:27] Still nothing [01:03:36] * halfak squints [01:05:04] memory at scb1001, scb1002 looks stable at ~23GB [01:06:03] so here's a good question. Should we stick with the no persistence for queue redis or not ? [01:06:57] it does look like ORES can survive just fine losing the queue [01:06:59] IMO persistence is helpful for the score cache, but unwanted for the celery cache [01:07:18] +1 [01:07:41] But all we lose if we have no persistence for either is /somwhat/ (unknown) higher load until the cache warms up again [01:08:15] Right. Not a huge problem [01:08:52] ok then. I 'll upload a patch tomorrow to remove the queue persistence [01:08:54] Oh I guess we can guess the extra load just from the cache hit ratio. [01:10:00] Precaching (which obviously does not involve cache hits) is the dominant request load. [01:10:02] halfak: cache hit rate averages as high as 75%. I think that’s 4x the load if we lose the cache. [01:10:22] Na. Changeprop isn't include in cache hit rates :) [01:10:29] ooh [01:10:33] changeprop is about 10x requests themselves [01:10:33] sneaky [01:10:51] Also mediawiki maintains its own cache [01:11:39] So at 7% cache hit rate, the cache barely makes any improvement. [01:11:54] It’s just useful for research end-users. [01:12:43] While you’re chasing mirages: I think this is a book I saw once. Really bizarre but fun, https://www.amazon.com/Mirages-Anomalous-Rainbows-Electromagnetic-Phenomena/dp/0915554127 [01:14:14] awight, it's speed for patrollers :) [01:14:18] speed is really importanty [01:14:29] ah right so Huggle? [01:14:32] Right [01:14:36] neat. [01:14:44] You load up huggle and it's wants the score for the last N revisions [01:14:45] BAM [01:14:47] you've got it. [01:15:12] I think the window passed... score rates have dropped and we suffered no issue [01:15:32] akosiaris: that was above and beyond… I hope you get to sleep in. [01:15:42] and that's my queue :) [01:15:48] good night and good luck [01:15:49] Right thanks all. [01:17:17] I'll keep some light monitoring going. [01:32:10] (03CR) 10Krinkle: Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [01:38:11] (03CR) 10Krinkle: Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [02:59:11] hi again [02:59:18] I'll try and work on the poolcounter stuff tonight [04:41:45] (03PS1) 10Prtksxna: Unify BetaFeatures SVG screenshot markup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393989 (https://phabricator.wikimedia.org/T180427) [04:54:23] Checking in. ORES looks good. [07:18:32] (03CR) 10VolkerE: [C: 04-1] Unify BetaFeatures SVG screenshot markup (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393989 (https://phabricator.wikimedia.org/T180427) (owner: 10Prtksxna) [08:02:42] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3795673 (10akosiaris) [08:02:45] 10Scoring-platform-team (Current), 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3795671 (10akosiaris) 05Resolved>03Open Re-opening per the following: After a brief discussion in #wikimedia-ai at ~01... [10:10:55] (03PS4) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) [10:11:03] (03CR) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation (037 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [10:11:59] (03CR) 10jerkins-bot: [V: 04-1] Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [10:20:23] heads up we are going to have a minor downtime in scorings as I deploy https://gerrit.wikimedia.org/r/394022 [10:20:25] (03PS5) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) [10:20:29] Amir1: ^ [10:20:50] I am guessing aaron is not yet around so I 'll avoid pinging [10:20:51] akosiaris: Okay, I'm around to monitor [10:20:57] yeah [10:20:59] :) [10:21:16] we are going to lose all the queue jobs btw [10:21:25] but that's the entire idea behind this [10:22:50] ok I am seeing clients reconnected quickly on codfw [10:22:56] connected_clients:336 [10:24:37] some minor overload errors as well on codfw (~10) [10:24:42] have subsided already [10:26:07] scorings have returned to previous levels, this looks ok. Moving on to eqiad [10:29:14] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:37] ack'd [10:31:08] connected_clients:602 [10:31:10] I see errors are going up [10:31:16] eqiad clients are going up as well [10:33:37] hmmm it's not converging yet [10:33:49] it should have recovered by now [10:34:07] # Clients [10:34:07] connected_clients:768 [10:34:07] client_longest_output_list:0 [10:34:07] client_biggest_input_buf:260640 [10:34:07] blocked_clients:0 [10:34:18] we got a ton of clients already reconnected to redis [10:36:55] akosiaris: The jobs are recovering [10:37:04] maybe it's just too slow [10:37:23] https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1 [10:37:31] yeah.. overloads are now zero, but we have a steady flow of scores errored at around 12,5 [10:37:48] which is higher than normal [10:38:24] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 8.030 second response time [10:41:08] I think it's because of the "moving average" thing [10:41:28] akosiaris: it's six now, and it's dropping [10:41:57] yeah looks like it's recovering [10:42:23] but this window was rather large.. some 6-7 minutes of overloads [10:42:46] maybe killing persistency wasn't that good an idea after all [10:43:01] now every time redis gets restarted we will probably have a similar event [10:44:21] maybe we should have a fallback [10:44:34] I remember some discussions about twmproxy [10:44:34] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:45:13] WTF [10:45:29] graphs are all fine [10:45:44] akosiaris: ^ [10:46:07] Amir1: yeah we enabled it. And then had to rollback immediately [10:46:19] something in ores (celery probably) uses the MULTI redis command [10:46:26] and it's not twmproxy compliant [10:46:34] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 5.528 second response time [10:47:27] Amir1: so we did have a set of overload errors again... on 10:42 [10:47:32] it has subsided again [10:47:42] I am going to lower the celery concurrency on scb1001, scb1002 [10:47:46] these are less powerful boxes [10:48:00] they shouldn't be having the same amount of workers as the other too [10:48:14] okay [10:48:18] that will lower overall capacity a bit, but better to have lower capacity than that ^ [10:49:24] We should finish fixing the ores for the new boxes that should be the most robust solution I guess [10:49:29] yes [11:00:04] yeah, lowering the concurrency on scb1001, scb1002 seems to have helped the boxes themselves... load is dropping [11:00:16] so does memory usage [11:00:20] let's see if it lasts [11:01:45] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:53] oh for the love of [11:02:27] 1k overload errors the moment I restarted celery ? [11:02:44] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 7.035 second response time [11:03:02] well it's recovering faster this time around [11:06:47] so this feel more fragile than I would like it to be [11:11:04] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:30] akosiaris: this looks worrying [11:11:45] damn [11:11:54] yeah, it's overloading again [11:12:27] scb1001 had a minor overload spike [11:13:02] it's weird btw this check is reporting but others are not [11:13:19] it's clearly unearthing some underlying issue [11:14:04] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 3.527 second response time [11:14:36] all scb boxes had overload spikes... sigh [11:19:12] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796085 (10akosiaris) [11:19:16] 10Scoring-platform-team (Current), 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3796083 (10akosiaris) 05Open>03Resolved Re-resolving. This has been deployed. The deploy in codfw... [11:20:20] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Introduce ModelLookup interface and its SQL implementation (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [11:35:24] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.015 second response time [11:36:22] damn [11:36:25] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 2.022 second response time [11:36:34] yeah so the service is suffering [11:36:50] the job rate seems to be unstainable ? [11:39:35] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:44] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 7.033 second response time [11:49:02] no real useful logs into what's going on [11:50:22] Amir1: what does overload mean exactly ? it's just that there are no available celery workers to work on the jobs ? [11:50:32] I am guessing that, but want to make sure [11:58:05] akosiaris: when the celery queue is 100 it stops accepting scores [11:58:31] that's per host or globally ? [11:58:40] globally AFAIK [11:59:48] hmm maybe I should increase that [11:59:57] do we have any good reason it's at 100 ? [12:00:03] and not.. say 150 ? [12:00:58] it does look to be per host however [12:01:09] it's definitely a setting per host [12:01:26] the queue itself is global but the setting is per host [12:10:24] someone external is hammering codfw ORES pretty good. https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&orgId=1&from=1511947481662&to=1511956735700 [12:10:36] but I don't think they are causing problems yet. [12:14:41] I think I 'd like to conservatively increase it [12:14:47] the celery queue limit that is [12:14:51] for 100 to say 120 [12:14:54] from* [12:26:16] Amir1: so https://gerrit.wikimedia.org/r/#/c/394047/ is up, lemme know what you think [12:26:44] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.025 second response time [12:30:08] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796190 (10awight) Looks like we'll be doing the same thing today. There have been intermittent overload incidents for the last two hours, during... [12:35:33] 10Scoring-platform-team, 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796193 (10awight) @akosiaris And thoughts about how we would troubleshoot this metrics problem? I'd like to review the modules responsible for... [12:35:42] 10Scoring-platform-team (Current), 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796195 (10awight) [12:38:21] oboy [12:38:33] It’s been raining errors again. [12:47:04] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 4.029 second response time [13:01:24] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.024 second response time [13:03:34] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 849 bytes in 7.036 second response time [13:03:52] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796259 (10awight) [13:09:44] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:35] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796299 (10awight) @akosiaris I'd like to get our celery logs routed to logstash, at INFO level. We could just pipe into a file too, per... [13:16:45] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 6.533 second response time [13:20:54] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:54] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 832 bytes in 6.557 second response time [13:23:29] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3796325 (10awight) @thcipriani @mmodell Is the fix for T179013 deployed to production? I'm hoping the fix will be that s... [13:24:54] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.018 second response time [13:26:35] halfak: solid overload errors from eqiad [13:30:53] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3796332 (10Aklapper) [13:34:06] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3796335 (10Aklapper) @Ladsgroup: Thanks! Assuming this is about Python, imported as... [13:35:21] o/ [13:35:23] Damn [13:35:26] halfak: yep [13:36:07] LMK when you’re up to speed, I need to relocate so I’m not entangled with family [13:36:14] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 8.039 second response time [13:36:31] I’d say we should failover to codfw again, but that didn’t work out so well yesterday. [13:37:15] I want to turn on celery logging at INFO level we we can see what’s happening during high-cpu usage. It’s gotta be something unusual, cos baseline CPU usage is so steady. [13:37:53] Also pushing for releng to fix deployment to the new cluster. I haven’t quite got the heart to make that UBN [13:37:58] but maybe we should. [13:38:10] I think it just involves deploying a patch they’ve already written. [13:38:11] We are failing to CODFW [13:38:25] Looks like three nodes in CODFW and one node in EQIAD are responding [13:38:40] Why would the cache rate be falling btw? [13:38:55] *cache hit rate [13:39:01] Only guess is that we're getting requests for non-recent revisions. [13:39:18] There was that user in here. Victor --> x.... [13:39:25] Was just about to run a big analysis. [13:39:30] Might be hammering us to death :) [13:39:37] xinblev [13:40:17] lol [13:40:29] A scapegoat would be helpful right about now [13:40:54] uwsgi and celery are doing *something* on scb1003 but according to Grafana, they aren't scoring revisions. [13:41:25] This doesn’t make sense to me: overload errors are 1k/min on eqiad, but scores errored only 10/minute [13:41:47] Scores errored != Overload errors. [13:41:55] An overload error happens before scoring can start [13:42:08] k. I’m reading scoring_system metrics code [13:42:26] What makes you say we’re failing over to codfw? [13:42:27]