[00:00:03] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 9.035 second response time [00:00:13] > awight: stopped beta ores. [00:01:00] One more test. [00:01:35] * halfak logs [00:01:48] yep this is working. [00:02:05] let me know when you re-enable :) [00:02:07] I checked the requests by restarting the service right away. The S:RC page doesn’t recover for 1 minute [00:02:09] I did [00:02:21] kk [00:02:22] ty :) [00:02:25] n/p [00:02:36] YAS [00:02:38] * halfak is running backup wherever he can :) [00:02:42] S:RC recovered. [00:02:46] Its !log deployment-prep fyi halfak [00:02:55] So now I really need to relocate, back in 5 [00:03:07] yarg [00:03:41] Get it right for once geeez xD halfak [00:03:52] :P Zppix [00:03:54] Still around [00:03:56] This is a long day [00:04:01] I'm so ready to be done with it. [00:04:29] Ores is been such a whiny b#### lately [00:04:46] Na. ORES is OK. We need to spend more focus on ORES Ext. [00:05:20] I neglect it with scheduling work because I am not much of a contributor to the ORES ext [00:05:22] actually ORES infra needs some love as well [00:05:30] Well that isn't wrong. [00:05:32] Yea [00:05:44] things like uwsgi logging at various levels and not just INFO at logstash [00:05:51] akosiaris, if you have some to-do's that's something I'll tackle. [00:05:55] or splitting the workers for the HTTP hosts [00:06:10] web vs. celery? [00:06:18] yeah, that's a ball in my court [00:06:18] I'd love that. I need $$ :) [00:06:22] We do that in labs :D [00:06:30] s/labs/cloud/ [00:06:37] I know. And hardware wise we are ready to do that in production as well [00:06:45] as soon as the stress tests are done [00:06:55] akosiaris, next FY, let's collab on a request for a few dedicated web nodes. [00:07:06] I had no idea these stress tests would be so involved! [00:07:18] actually we are gonna collab for moving to kubernetes :P [00:07:20] But I'm stoked about the boost in capacity that it looks like we can get :) [00:07:22] Oh yeah [00:07:23] That [00:07:25] Sure :) [00:07:44] Just so long as you're down for carrying some of that weight. :D [00:07:53] a lot of the weight probably [00:08:01] How long until we start experimenting with that? [00:08:13] but at least it will be a goal for department & team so I will be able to justify the time [00:08:27] cause that's one resource I don't have a lot of unfortunately [00:08:44] so, next Q is the first application running on top of kubernetes [00:08:56] whoa I missed something fun. [00:08:57] with the point of learning as much as possible from the migration [00:09:05] kubernetes! [00:09:10] We're going to migrate right now ;) [00:09:12] infrastructure + development wise [00:09:13] loool [00:09:34] We decided that none of us sleep until k723642s is working [00:09:47] As long as we can do WMF budgeting in PAWS [00:10:01] OMG PAWS. I need to write that proposal. [00:10:02] so, second quarter of next year we can probably have a real shot at a migration [00:10:15] OK that sounds good to me. I'll keep in mind :) [00:10:31] but anyway, should we deploy the patch to ores:ext ? [00:10:32] In the short term, make ores work, medium, make ORES cluster work, [00:10:38] from what I gather it worked on beta ? [00:10:44] Right so awight has something mostly ready [00:10:48] I just have a couple questiosn. [00:11:09] Why lockTSE of 10? [00:11:16] The examples all set it to 30 [00:11:17] (03PS1) 10Awight: Revert "Fallback to old thresholds API as necessary (take 2)" [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) [00:11:20] (03PS1) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) [00:11:21] Is this some sort of wait>? [00:11:40] I think those are seconds. [00:11:48] yeah I cribbed directly from examples, I’m not gonna try that tonight. [00:11:49] OK I was guessing that. [00:12:11] OK I guess I just didn't see the examples with 10. [00:12:13] I’d prefer to read the code, etc. Feel free to try to decode what the heck is going on in WANObjectCache, though! [00:12:45] Yeah. That's a big mess :| [00:12:53] Well, not a mess. Just deep [00:12:56] Maybe a mess too [00:13:11] Smoke test says a lot. [00:13:16] (03CR) 10Awight: [C: 032] "Self-merge backport." [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) (owner: 10Awight) [00:13:16] What did you confirm in the smoke test? [00:13:21] (03CR) 10Awight: [C: 032] "Self-merge backport." [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:13:42] https://gerrit.wikimedia.org/r/#/c/393945 BTW [00:13:50] for akosiaris if you speak MW PHP [00:14:09] jenkins doesn't like something [00:14:39] oh my MW PHP is very bad [00:14:46] but I 'll have a look [00:14:57] Probably about as good as mine [00:15:17] I used to write a lot of PHP back in... 4? When did PHP get new style classes? [00:15:22] halfak: So what I confirmed was * code doesn’t blow up in either sucess or failure branches, * when the service goes down and we want a threshold, we correctly put [] in the cache with TTL 60s. * Requesting the page within that window doesn’t cause any more service requests, and * Once the 60s is up, the extension tries again. If it fails, it caches another [], if it succeeds then we’re back to 1 day caching of good thresholds. [00:15:24] 5 [00:15:37] lol /me wonders if halfak is thinking Perl [00:15:38] Right at the beginning of 5 is when I checked out. [00:15:41] nope [00:15:47] lol [00:15:52] other $var lang. [00:16:06] Fatal error: Cls: Expected string or object in /home/jenkins/workspace/mwext-testextension-hhvm-jessie/src/extensions/ORES/includes/Stats.php on line 121 [00:16:43] Sooo much useless punctuation in PHP [00:16:57] huh [00:17:37] => ? [00:17:38] akosiaris: where’s that from? [00:17:57] awight: last few lines of https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/24606/console [00:19:24] (03Merged) 10jenkins-bot: Revert "Fallback to old thresholds API as necessary (take 2)" [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393959 (https://phabricator.wikimedia.org/T179602) (owner: 10Awight) [00:20:28] (03Merged) 10jenkins-bot: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.8) - 10https://gerrit.wikimedia.org/r/393960 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:21:51] omg I copypasta’d straight out of an example, smh [00:22:28] (03PS2) 10Awight: Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) [00:23:06] halfak: Thanks for the company. The extension backports are ready to deploy. [00:23:50] Ha! OK [00:24:18] * halfak waits on Jenkins [00:24:32] awight, did that code run in the smoke test? [00:24:46] No, we ran master. [00:25:00] I don’t think we can easily put beta on this branch. [00:25:48] the… tests passed is all I can say about the branch,. [00:26:37] OK so we need to make a call. Gotcha. Once we merge you can smoke test? [00:26:55] I could just run this branch locally... [00:27:12] Lemme to an abbreviated test, just general smoke rather than the specific cache thing. [00:27:14] *do [00:27:19] OK [00:28:43] hooks are firing correctly. [00:29:51] “scap sync-vile” [00:29:57] ARISE [00:30:35] jenkins is taking it's sweet time on https://gerrit.wikimedia.org/r/#/c/393945/ [00:30:53] It really is [00:31:04] Here goes. [00:32:54] (03CR) 10Halfak: [C: 032] Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:32:58] Done [00:33:03] oh shit [00:33:10] k we can look at that on beta [00:33:19] I just deployed the parent commit though [00:33:26] Sorry I misunderstood. Might be better? [00:34:14] Sure, we can try that after some local testing [00:34:25] so far, the more basic patch is holding its own though [00:36:27] Actually, there haven’t been any ORES 400s since 00:08, so we wouldn’t know. [00:36:52] We need the error window. [00:37:10] Meanwhile, I need to backport to wmf.10 to say I did. [00:37:11] oh... huh [00:37:31] Why not. Weird! [00:38:11] (03PS1) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) [00:38:21] just to screw with us, apparently. [00:38:30] I think ORES may be dating SKYNET [00:38:43] But this is MW :P [00:38:51] (03Merged) 10jenkins-bot: Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:39:09] (03CR) 10Awight: [C: 032] "Self-merging backport." [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:39:19] I think the error window is starting [00:39:42] per https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=2&fullscreen&orgId=1&from=now-30m&to=now-1m [00:39:51] wheee! [00:39:57] throw your hands in the air [00:40:09] the rest of the graphs in that dashboard are for the first time at close to 1k scores btw [00:40:34] No 400s [00:41:52] akosiaris: Can you pull user agents easily? [00:41:54] Looks like we have a big increase in requests. [00:42:11] Not external. This is precaching [00:42:14] (03Merged) 10jenkins-bot: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] (wmf/1.31.0-wmf.10) - 10https://gerrit.wikimedia.org/r/393966 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [00:42:21] halfak: yes [00:42:31] WThorridF [00:42:48] Did some continent just wake up? [00:43:30] awight: yeah give a min [00:43:46] Europe is starting to get very early morn (~midnight) halfak [00:43:50] akosiaris: not necessary, thanks [00:44:04] Maybe east asia [00:44:07] akosiaris: The graphs are showing that it’s legitimate internally-generated requests. [00:45:05] 1767 user agent "MediaWiki/1.31.0-wmf.8" [00:45:05] 1640 user agent "ChangePropagation/WMF" [00:45:10] halfak: waidaminute. These are scores errored, but nothing is overloaded [00:45:14] that's the 2 top since 00:30 [00:45:16] what does that mean? [00:45:18] akosiaris: ty! [00:46:04] we have someone who sends just "Mozilla" as a user agent [00:46:08] halfak: Not timed out either?? [00:46:12] cause that's believable.. lol [00:46:30] akosiaris: that could be any amnt of browsers/clients [00:46:31] akosiaris: We should email Mozilla and give them a piece of our minds [00:46:55] awight: be nice... now Mozilla didnt mean jt xD [00:47:16] Zppix: no not really. It's just Mozilla nothing like https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent/Firefox [00:47:52] instead of "Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" it's .... "Mozilla" [00:48:03] so it's 99% phabricated and not real [00:48:15] I am leaving a 1% just to play it safe [00:48:38] anyway halfak awight things are finally looking good ? [00:49:10] I think so. halfak just realized, the “scores errored” is incredibly low, barely a handful. [00:49:14] I'm looking into it. [00:49:35] Oh damn. KK [00:49:56] 400s are gone so I’m not going to rush that last cache patch. [00:51:32] 10Scoring-platform-team, 10MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), 10Patch-For-Review: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3795157 (10awight) Still need to better understand, and test, https://gerrit.wikimedia.org/r/#/c/39... [00:51:47] ok, I am going to re-enable celery on scb1002 and scb1001 [00:52:21] OK [00:53:03] Might hit memory errors again though :/ [00:53:22] yeah that's what I want to see [00:53:26] This is random, but if there was some underlying thing which was causing off-schedule memcache purges for our thresholds data (I’m suspicious of that), then we’ll start hearing complaints about enwiki etc. not having ORES features for 1min at a time now. [00:53:41] akosiaris: What time zone are you in? [00:53:47] UTC+2 [00:53:59] Man. Late night for you [00:54:05] yes [00:54:09] I'm tired and I'm UTC-5 [00:54:10] ACK [00:54:15] I don't know what tired is [00:54:16] I say good night, sir! [00:54:22] Right on. [00:54:31] I could go for a few more hours lol [00:55:10] I 'll watch memory for a few more mins on those 2 hosts and go to bed if all is ok [00:55:35] +1 [00:55:50] awight, we are due for another cycle if the hourly cadence holds. [00:56:19] No problem, I can lurk for a bit. [00:56:44] 4(noping)pergos gave me some homework I need to get started on :D [01:03:27] Still nothing [01:03:36] * halfak squints [01:05:04] memory at scb1001, scb1002 looks stable at ~23GB [01:06:03] so here's a good question. Should we stick with the no persistence for queue redis or not ? [01:06:57] it does look like ORES can survive just fine losing the queue [01:06:59] IMO persistence is helpful for the score cache, but unwanted for the celery cache [01:07:18] +1 [01:07:41] But all we lose if we have no persistence for either is /somwhat/ (unknown) higher load until the cache warms up again [01:08:15] Right. Not a huge problem [01:08:52] ok then. I 'll upload a patch tomorrow to remove the queue persistence [01:08:54] Oh I guess we can guess the extra load just from the cache hit ratio. [01:10:00] Precaching (which obviously does not involve cache hits) is the dominant request load. [01:10:02] halfak: cache hit rate averages as high as 75%. I think that’s 4x the load if we lose the cache. [01:10:22] Na. Changeprop isn't include in cache hit rates :) [01:10:29] ooh [01:10:33] changeprop is about 10x requests themselves [01:10:33] sneaky [01:10:51] Also mediawiki maintains its own cache [01:11:39] So at 7% cache hit rate, the cache barely makes any improvement. [01:11:54] It’s just useful for research end-users. [01:12:43] While you’re chasing mirages: I think this is a book I saw once. Really bizarre but fun, https://www.amazon.com/Mirages-Anomalous-Rainbows-Electromagnetic-Phenomena/dp/0915554127 [01:14:14] awight, it's speed for patrollers :) [01:14:18] speed is really importanty [01:14:29] ah right so Huggle? [01:14:32] Right [01:14:36] neat. [01:14:44] You load up huggle and it's wants the score for the last N revisions [01:14:45] BAM [01:14:47] you've got it. [01:15:12] I think the window passed... score rates have dropped and we suffered no issue [01:15:32] akosiaris: that was above and beyond… I hope you get to sleep in. [01:15:42] and that's my queue :) [01:15:48] good night and good luck [01:15:49] Right thanks all. [01:17:17] I'll keep some light monitoring going. [01:32:10] (03CR) 10Krinkle: Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [01:38:11] (03CR) 10Krinkle: Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [02:59:11] hi again [02:59:18] I'll try and work on the poolcounter stuff tonight [04:41:45] (03PS1) 10Prtksxna: Unify BetaFeatures SVG screenshot markup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393989 (https://phabricator.wikimedia.org/T180427) [04:54:23] Checking in. ORES looks good. [07:18:32] (03CR) 10VolkerE: [C: 04-1] Unify BetaFeatures SVG screenshot markup (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393989 (https://phabricator.wikimedia.org/T180427) (owner: 10Prtksxna) [08:02:42] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3795673 (10akosiaris) [08:02:45] 10Scoring-platform-team (Current), 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3795671 (10akosiaris) 05Resolved>03Open Re-opening per the following: After a brief discussion in #wikimedia-ai at ~01... [10:10:55] (03PS4) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) [10:11:03] (03CR) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation (037 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [10:11:59] (03CR) 10jerkins-bot: [V: 04-1] Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [10:20:23] heads up we are going to have a minor downtime in scorings as I deploy https://gerrit.wikimedia.org/r/394022 [10:20:25] (03PS5) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) [10:20:29] Amir1: ^ [10:20:50] I am guessing aaron is not yet around so I 'll avoid pinging [10:20:51] akosiaris: Okay, I'm around to monitor [10:20:57] yeah [10:20:59] :) [10:21:16] we are going to lose all the queue jobs btw [10:21:25] but that's the entire idea behind this [10:22:50] ok I am seeing clients reconnected quickly on codfw [10:22:56] connected_clients:336 [10:24:37] some minor overload errors as well on codfw (~10) [10:24:42] have subsided already [10:26:07] scorings have returned to previous levels, this looks ok. Moving on to eqiad [10:29:14] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:37] ack'd [10:31:08] connected_clients:602 [10:31:10] I see errors are going up [10:31:16] eqiad clients are going up as well [10:33:37] hmmm it's not converging yet [10:33:49] it should have recovered by now [10:34:07] # Clients [10:34:07] connected_clients:768 [10:34:07] client_longest_output_list:0 [10:34:07] client_biggest_input_buf:260640 [10:34:07] blocked_clients:0 [10:34:18] we got a ton of clients already reconnected to redis [10:36:55] akosiaris: The jobs are recovering [10:37:04] maybe it's just too slow [10:37:23] https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1 [10:37:31] yeah.. overloads are now zero, but we have a steady flow of scores errored at around 12,5 [10:37:48] which is higher than normal [10:38:24] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 8.030 second response time [10:41:08] I think it's because of the "moving average" thing [10:41:28] akosiaris: it's six now, and it's dropping [10:41:57] yeah looks like it's recovering [10:42:23] but this window was rather large.. some 6-7 minutes of overloads [10:42:46] maybe killing persistency wasn't that good an idea after all [10:43:01] now every time redis gets restarted we will probably have a similar event [10:44:21] maybe we should have a fallback [10:44:34] I remember some discussions about twmproxy [10:44:34] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:45:13] WTF [10:45:29] graphs are all fine [10:45:44] akosiaris: ^ [10:46:07] Amir1: yeah we enabled it. And then had to rollback immediately [10:46:19] something in ores (celery probably) uses the MULTI redis command [10:46:26] and it's not twmproxy compliant [10:46:34] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 5.528 second response time [10:47:27] Amir1: so we did have a set of overload errors again... on 10:42 [10:47:32] it has subsided again [10:47:42] I am going to lower the celery concurrency on scb1001, scb1002 [10:47:46] these are less powerful boxes [10:48:00] they shouldn't be having the same amount of workers as the other too [10:48:14] okay [10:48:18] that will lower overall capacity a bit, but better to have lower capacity than that ^ [10:49:24] We should finish fixing the ores for the new boxes that should be the most robust solution I guess [10:49:29] yes [11:00:04] yeah, lowering the concurrency on scb1001, scb1002 seems to have helped the boxes themselves... load is dropping [11:00:16] so does memory usage [11:00:20] let's see if it lasts [11:01:45] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:53] oh for the love of [11:02:27] 1k overload errors the moment I restarted celery ? [11:02:44] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 7.035 second response time [11:03:02] well it's recovering faster this time around [11:06:47] so this feel more fragile than I would like it to be [11:11:04] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:30] akosiaris: this looks worrying [11:11:45] damn [11:11:54] yeah, it's overloading again [11:12:27] scb1001 had a minor overload spike [11:13:02] it's weird btw this check is reporting but others are not [11:13:19] it's clearly unearthing some underlying issue [11:14:04] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 3.527 second response time [11:14:36] all scb boxes had overload spikes... sigh [11:19:12] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796085 (10akosiaris) [11:19:16] 10Scoring-platform-team (Current), 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3796083 (10akosiaris) 05Open>03Resolved Re-resolving. This has been deployed. The deploy in codfw... [11:20:20] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Introduce ModelLookup interface and its SQL implementation (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [11:35:24] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.015 second response time [11:36:22] damn [11:36:25] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 2.022 second response time [11:36:34] yeah so the service is suffering [11:36:50] the job rate seems to be unstainable ? [11:39:35] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:44] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 7.033 second response time [11:49:02] no real useful logs into what's going on [11:50:22] Amir1: what does overload mean exactly ? it's just that there are no available celery workers to work on the jobs ? [11:50:32] I am guessing that, but want to make sure [11:58:05] akosiaris: when the celery queue is 100 it stops accepting scores [11:58:31] that's per host or globally ? [11:58:40] globally AFAIK [11:59:48] hmm maybe I should increase that [11:59:57] do we have any good reason it's at 100 ? [12:00:03] and not.. say 150 ? [12:00:58] it does look to be per host however [12:01:09] it's definitely a setting per host [12:01:26] the queue itself is global but the setting is per host [12:10:24] someone external is hammering codfw ORES pretty good. https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&orgId=1&from=1511947481662&to=1511956735700 [12:10:36] but I don't think they are causing problems yet. [12:14:41] I think I 'd like to conservatively increase it [12:14:47] the celery queue limit that is [12:14:51] for 100 to say 120 [12:14:54] from* [12:26:16] Amir1: so https://gerrit.wikimedia.org/r/#/c/394047/ is up, lemme know what you think [12:26:44] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.025 second response time [12:30:08] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796190 (10awight) Looks like we'll be doing the same thing today. There have been intermittent overload incidents for the last two hours, during... [12:35:33] 10Scoring-platform-team, 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796193 (10awight) @akosiaris And thoughts about how we would troubleshoot this metrics problem? I'd like to review the modules responsible for... [12:35:42] 10Scoring-platform-team (Current), 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796195 (10awight) [12:38:21] oboy [12:38:33] It’s been raining errors again. [12:47:04] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 4.029 second response time [13:01:24] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.024 second response time [13:03:34] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 849 bytes in 7.036 second response time [13:03:52] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796259 (10awight) [13:09:44] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:35] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796299 (10awight) @akosiaris I'd like to get our celery logs routed to logstash, at INFO level. We could just pipe into a file too, per... [13:16:45] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 6.533 second response time [13:20:54] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:54] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 832 bytes in 6.557 second response time [13:23:29] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3796325 (10awight) @thcipriani @mmodell Is the fix for T179013 deployed to production? I'm hoping the fix will be that s... [13:24:54] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.018 second response time [13:26:35] halfak: solid overload errors from eqiad [13:30:53] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3796332 (10Aklapper) [13:34:06] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3796335 (10Aklapper) @Ladsgroup: Thanks! Assuming this is about Python, imported as... [13:35:21] o/ [13:35:23] Damn [13:35:26] halfak: yep [13:36:07] LMK when you’re up to speed, I need to relocate so I’m not entangled with family [13:36:14] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 8.039 second response time [13:36:31] I’d say we should failover to codfw again, but that didn’t work out so well yesterday. [13:37:15] I want to turn on celery logging at INFO level we we can see what’s happening during high-cpu usage. It’s gotta be something unusual, cos baseline CPU usage is so steady. [13:37:53] Also pushing for releng to fix deployment to the new cluster. I haven’t quite got the heart to make that UBN [13:37:58] but maybe we should. [13:38:10] I think it just involves deploying a patch they’ve already written. [13:38:11] We are failing to CODFW [13:38:25] Looks like three nodes in CODFW and one node in EQIAD are responding [13:38:40] Why would the cache rate be falling btw? [13:38:55] *cache hit rate [13:39:01] Only guess is that we're getting requests for non-recent revisions. [13:39:18] There was that user in here. Victor --> x.... [13:39:25] Was just about to run a big analysis. [13:39:30] Might be hammering us to death :) [13:39:37] xinblev [13:40:17] lol [13:40:29] A scapegoat would be helpful right about now [13:40:54] uwsgi and celery are doing *something* on scb1003 but according to Grafana, they aren't scoring revisions. [13:41:25] This doesn’t make sense to me: overload errors are 1k/min on eqiad, but scores errored only 10/minute [13:41:47] Scores errored != Overload errors. [13:41:55] An overload error happens before scoring can start [13:42:08] k. I’m reading scoring_system metrics code [13:42:26] What makes you say we’re failing over to codfw? [13:42:27] We got hammered at 13:04 [13:42:36] codfw is handling a lot of requests [13:42:38] Looks like we’re getting 1k/min failed requests [13:42:42] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-3h&to=now-1m [13:43:03] We're returning scores for 1000 per minute. [13:43:09] (not cached) [13:43:52] Also, “all scores returned” shows all machines are returning scores, why do you say that only four nodes are responding? [13:44:04] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.015 second response time [13:44:23] awight, only 4 nodes are generating scores. [13:44:36] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-3h&to=now-1m&panelId=3&fullscreen [13:44:39] Is “scores returned” at the wsgi level? [13:44:52] Yes [13:45:02] ok [13:45:14] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 832 bytes in 2.536 second response time [13:45:35] halfak: What do you think about logging celery at INFO level? [13:45:55] Bad [13:45:58] https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/scoring_system.py#L165 [13:46:25] https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L68 [13:47:11] Arg I need to change locations too. [13:47:42] halfak: Sorry, is your point that we get a metrics point for everythign that would be celery logging? [13:48:14] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 871 bytes in 0.020 second response time [13:51:03] halfak: I’m thinking, celery=INFO will let us see anything weird in lookup vs process score, and warnings about the redis queue. [13:51:12] currently those go to /dev/null [13:51:38] What was “Bad” in response to? [14:00:56] awight, OH i see I misread your question. I think that logging celery to see what is up is a good idea. [14:01:18] Want to sign here: https://phabricator.wikimedia.org/T181621#3796299 [14:01:20] I'm not sure anything weird is happening beyond celery getting jammed because of OOM and a *huge* rate of external requests. [14:01:36] It’s the OOM, yeah. [14:01:55] awight: [14:01:56] halfak: akosiaris: Nov 29 10:32:59 scb1003 celery-ores-worker[13231]: MemoryError: [Errno 12] Cannot allocate memory [14:03:20] awight, why are we not addressing this obvious problem and cutting the # of workers? [14:03:25] akosiaris: halfak: oh hey. This is the entire contents of scb1003 ores/app.log, [14:03:28] Connection to Redis lost: Retry (0/20) now. [14:03:29] Connection to Redis lost: Retry (1/20) in 1.00 second. [14:03:30] Connection to Redis lost: Retry (2/20) in 1.00 second. [14:03:31] Connection to Redis lost: Retry (3/20) in 1.00 second. [14:03:32] Connection to Redis lost: Retry (4/20) in 1.00 second. [14:03:33] Connection to Redis lost: Retry (5/20) in 1.00 second. [14:03:37] woah [14:03:41] um. [14:03:46] wtf [14:03:52] Why is everything going wrong at once? [14:04:06] I think akosiaris might be recovering from last night, we should grab another opsen? [14:04:25] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 2.036 second response time [14:05:12] back [14:05:20] what on earth is going on ? [14:05:25] hey sorry to shout [14:05:27] IONO [14:05:35] akosiaris, same shit, more confusion [14:05:39] But see scb1003:/srv/log/ores/app.log [14:05:43] The entire contents are... [14:06:07] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796413 (10awight) Most of our worker boxes are down. Here's the app.log from one that's down, last written to 2 hours ago: ``` Connection to Red... [14:06:13] ^ [14:06:28] Also, a shitton of OOM deaths. [14:06:44] not in that file though. [14:06:56] akosiaris: Do I have privs to read oom_kill logs? [14:07:21] I can’t find that anywhere. syslog and kern.log are read group adm [14:07:34] I propose we reduce the # of workers and restart everything. I want to get these OOM out of the way before we start trying to figure anything out. [14:08:01] akosiaris dropped #workers to 20 already, on scb1001-2 [14:08:09] And still OOM? [14:08:16] Then it's not us for sure! [14:08:21] mmm [14:08:37] I’m thinking of the OOMs we saw on beta... [14:08:53] Beta? What? [14:09:05] This 100% celery behavior reminds me of when we had a rogue revid and an regex problem. [14:09:09] Maybe we have another one of those. [14:09:27] Also possible is that scb1001 is handling a ridiculous load. [14:09:34] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.015 second response time [14:09:44] Looks like we're not 100% on a single celery process for long [14:09:52] So it's not like the roque revid event. [14:10:00] akosiaris: Can we set up INFO logging btw? https://phabricator.wikimedia.org/T181621#3796299 [14:10:15] I can make the puppet patch if you agree. [14:10:27] ok backlog read, starting to look at what's happening [14:11:05] there was no OOM on scb1003 today [14:11:24] oh good. [14:11:51] in fact memory has been pretty stable [14:11:55] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All [14:11:59] akosiaris: I don’t understand. > Nov 29 10:32:59 scb1003 celery-ores-worker[13231]: MemoryError: [Errno 12] Cannot allocate memory [14:12:29] when I say OOM, I mean the kernel OOM killer [14:12:44] that's a process the kernel spawns when the box is under intense memory pressure [14:12:54] right [14:12:58] akosiaris: BTW I’m pretty sure that the memory graphs for scb1001-2 are wrong, https://phabricator.wikimedia.org/T181544 [14:13:04] it kills processes until it has managed to stabilize the situation [14:13:34] OK I’ll believe that OOM killer wasn’t involved, but how would it *not* get involved if we’re seeing errors like “cannot allocate memory"? [14:13:47] I need to leave my desk for 45 mins. I'll be back on as soon as I can. [14:14:10] awight: that's a good question [14:14:24] awight: which graphs is that task referring to ? [14:15:31] cause this https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=14&fullscreen&orgId=1&var-server=scb1001&var-network=eth0 and this https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=88&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All [14:15:33] akosiaris: This one, https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511954088176&to=1511964828177&panelId=6&fullscreen [14:15:37] are in agreement [14:16:17] Do these graphs show the OOM conditions yesterday? [14:16:36] yes [14:16:38] for example [14:16:45] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 8.036 second response time [14:16:46] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=14&fullscreen&orgId=1&var-server=scb1001&var-network=eth0&from=1511951404293&to=1511951681817 [14:16:57] you can see it's hitting the ceiling very very clearly [14:17:08] ah nice [14:17:21] this does to https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=88&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All&from=1511951282292&to=1511951830300 [14:17:29] but we need to graph the ceiling there as well [14:18:18] This graph never drops below 19GB free, https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511792288603&to=1511965028603&panelId=6&fullscreen [14:18:29] this https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511954088176&to=1511964828177&panelId=6&fullscreen&edit is wrong though [14:18:40] no way that box has 21 G free currently [14:19:03] 10Scoring-platform-team (Current), 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796458 (10awight) Here's a graph that shows the OOM ceiling correctly, for comparison: https://grafana.wikimedia.org/dashboard/file/se... [14:19:20] ok cool, thanks for convincing me that we have something that shows the right number. [14:20:03] so back to what's happening [14:20:12] we are having huge overload errors which means 1 thing [14:20:35] we are pushing way too many jobs and the max_queue limit built into ores is stopping us [14:21:06] stopping us? It seems that OOM and possibly redis errors are stopping us. [14:21:34] yeah that oom thing is something I need to look into better [14:21:35] I think we’re all in agreement that it’s fine to turn down the celery worker count until the machines can handle max load, if that’s what you mean. [14:21:41] let me though first look at redis [14:21:45] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.029 second response time [14:21:46] kk [14:22:15] akosiaris: What do you think about trying to pool the new cluster? Is that pointless until we clean the old house? [14:22:44] I 'd like us to pool it, just not sure it should be a kneejerk reaction [14:22:55] [761 | signal handler] (1511951211) Received SIGTERM scheduling shutdown... [14:22:56] aha [14:22:58] k that’s how I feel. Also, we’re blocked on releng [14:23:01] wat [14:23:02] so that's oresrdb1001 [14:23:09] yikes. [14:23:29] BTW it’s fine to reduce the number of keys cached in redis, I’m not sure if Redis sets that limit itself or we do. [14:23:42] it's memory based [14:23:51] we have it capped at 1G for the queue and 6G for the cache currently [14:23:55] we can increase both [14:23:56] ok. [14:24:00] Low is fine. [14:24:35] so the other thing that happened is that the cache redis also exhibited [14:24:41] [754] 29 Nov 10:27:06.100 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. [14:25:18] ah the redis restarting was me [14:25:27] for enabling the non-persistency setting [14:25:28] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796496 (10awight) oresrdb1001 is dying: ``` [09:22am] akosiaris: [761 | signal handler] (1511951211) Received SIGTERM scheduling shutdown... [09:2... [14:25:33] Oops. Let me remove my uninformed comment. [14:25:38] ok that's expected [14:25:47] Removed. [14:26:54] So, it could be as simple as, several of the celery managers never recovering from redis shutdown. [14:26:55] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 847 bytes in 6.533 second response time [14:26:57] so the logs about [14:27:01] Connection to Redis lost: Retry (3/20) in 1.00 second. [14:27:05] are untimestamped [14:27:09] we need to fix that [14:27:13] The file has a timestamp though [14:27:17] yes +1 [14:27:20] yeah 10:26 [14:27:25] so it's the same thing [14:27:30] I think we’re running with default logging config. [14:27:32] the restart I did for the persistency thing [14:27:46] so redis misbehaving is a red herring up to now I think [14:27:52] I’ll put that in our wikitech notes, that celery workers have to be restarted after redis [14:29:14] akosiaris: Argh, I just realized that our logging config is checked in with the source, and there’s no mechanism to look in another place that can be managed by puppet. I’ll modify in source for now. [14:30:08] so one thing that changed is that I lowered celery consistency [14:30:10] eer [14:30:12] concurrency [14:30:26] maybe I could bump that [14:30:31] that was a good idea to lower [14:31:05] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:24] so about that (04:11:59 μμ) awight: akosiaris: I don’t understand. > Nov 29 10:32:59 scb1003 celery-ores-worker[13231]: MemoryError: [Errno 12] Cannot allocate memory [14:31:28] where did you see thing ? [14:31:31] (03PS1) 10Awight: Increase celery verbosity; use message format including timestamp [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/394060 (https://phabricator.wikimedia.org/T181538) [14:31:33] which file I mean [14:32:33] akosiaris: sudo service celery-ores-worker status -l [14:32:34] I am guessing journalctl [14:32:38] yep [14:32:38] yeah that's it [14:33:01] Would be nice if that stuff went into logstash, eh? [14:33:10] yes [14:33:12] we need to fix that too [14:33:22] so that happened around the same time I restart redis [14:33:32] 6 mins later [14:33:51] so it's related [14:34:15] 10Scoring-platform-team, 10Operations, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3796535 (10awight) [14:34:20] but it's not currently happening [14:34:29] akosiaris: I’m going to restart services on all nodes, one moment please. [14:34:36] ok [14:36:13] I don’t think that worked. [14:36:22] need -f or some shit [14:37:24] all of this point to not being able to survive the oncoming load of requests [14:37:56] may I merge https://gerrit.wikimedia.org/r/#/c/394047/ ? [14:38:09] well… can’t survive with celery only running on 3/9 machines [14:38:20] 9 machines ? [14:38:25] aaa you mean the new cluster [14:38:42] no I mean that celery is down on most machines cos of the redis restart [14:38:57] you think so ? it shouldn't be true [14:38:59] maybe hold off a few minutes on increasing workers, until we see what existing workers can do [14:39:04] akosiaris: wait [14:39:05] but lemme restart all of them [14:39:10] I just restarted [14:39:13] ah ok [14:39:16] or mid-restart [14:39:21] sorry, I’m confused. [14:39:40] IMO we demonstrated that the celery manager on most nodes died during the redis restart, and never recovered. [14:40:06] Do you have reason to think otherwise? [14:40:14] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 832 bytes in 6.527 second response time [14:41:11] kind of [14:41:13] this https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All&from=1511951282292&to=1511951830300&panelId=85&fullscreen [14:41:32] if the manager was dead if would not spawn new children [14:41:46] but the number of celery workers is ok process wise [14:41:51] but lemme check per box [14:42:16] btw codfw is fine, right ? [14:42:19] this is eqiad only [14:42:25] akosiaris: yes, I think so [14:42:26] at least that's what I read from the graphs [14:42:28] ah no [14:42:39] scb2005-6 have been down too [14:43:14] akosiaris@scb1001:~$ ps auxw | grep celery |wc -l [14:43:14] 22 [14:43:19] akosiaris@scb1003:~$ ps auxw | grep celery |wc -l [14:43:19] 46 [14:43:21] as expected [14:43:29] I’ve been restarting... [14:43:34] destroying evidence :) [14:43:37] lol [14:43:42] Try scb2005- [14:43:43] 6 [14:43:45] ok looking into codfw then [14:44:16] This graph shows that scb2005 has been down since Nov 28, 19:24 [14:44:17] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511880233269&to=1511966573269&panelId=3&fullscreen [14:44:47] and scb2006 died at the same time, came back online, then was dead since 20:24 [14:45:08] so not related to the current problems [14:45:13] but rather yesterday's [14:45:27] anyway investigating that while you restart stuff in eqiad [14:46:03] Another important thing I just learned… the “* scores returned” graphs show *wsgi* activity, while “scores processed” shows celery activity. That threw me for a loop. [14:46:14] All services should be restarted now. [14:46:27] yes and overload is way before a job is even submitted [14:46:30] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:16] yep, I hadn’t internalized that info, though I probably heard it before. [14:47:18] celery on scb2005, scb2006 seems to be running... now why isn't it not doing anything [14:47:24] right? [14:47:26] argh [14:47:34] It does take a while to warm up [14:47:38] a few minutes. [14:48:00] ah, zombie processes [14:48:21] akosiaris: Want to bless this btw? https://gerrit.wikimedia.org/r/#/c/394060/ [14:48:28] I [14:48:36] I’m okay self-merging if you don’t have +2 [14:48:39] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 832 bytes in 9.536 second response time [14:49:06] (03CR) 10Alexandros Kosiaris: [C: 031] Increase celery verbosity; use message format including timestamp [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/394060 (https://phabricator.wikimedia.org/T181538) (owner: 10Awight) [14:49:12] go for it [14:49:56] so scb2005, scb2006 are fully of zombie process [14:49:56] akosiaris: So, the celery service restart failed due to zombies? [14:50:03] and the celery master has indeed failed there [14:50:17] it's dead and not reaping children I think [14:50:31] is that stat=S? [14:50:38] Z [14:50:44] Z Nov28 0:26 [celery] [14:50:47] that's a zombie ^ [14:51:29] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.017 second response time [14:51:36] Hmmm, on stat1003 ps auxxww shows stat=S, but the start time is 10:30 so they weren’t restarted. [14:51:45] stat1003 ? [14:51:52] oops *scb1003 [14:51:58] Seeing what happens if I restart manually [14:52:39] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 847 bytes in 9.032 second response time [14:53:19] now the question is why the master while still being alive it does not reap children [14:53:31] it has probably croaked internally on its own or something [14:53:33] stracing [14:53:51] akosiaris: That’s really creepy, `sudo service celery-ores-worker restart` works but scap —service-restart did not [14:53:56] it's stuck at trying to read from fd 138 [14:54:02] aha [14:54:13] need to see what scap does [14:54:26] but lemme finish the master being stuck thing [14:54:31] I’ll restart by hand where I can [14:54:55] ah it's one of those pipes [14:55:03] the ones we were seeing the other day [14:55:24] so it's stuck waiting to read from a probably dead child [14:55:35] yeah that should not happen [14:56:14] Only developers can get away with talking about dead children. [14:58:17] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#3796581 (10awight) [14:58:50] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#3796595 (10awight) [14:59:04] Why are we having so so many issues lately? [14:59:28] Zppix: we weren’t hugged enough as kids. [14:59:44] awight: ig ores needs a hug? [15:02:04] Too late, piss off :p [15:03:59] :( [15:05:39] awight: I 'll restart celery all over codfw btw [15:05:58] akosiaris: I’m doing that ATM [15:06:02] ah ok [15:06:30] Happy to stop, if you want to try anything specific [15:06:35] no no go ahead [15:06:45] I was debugging the stuck celery [15:06:54] but I think I got enough to reproduce [15:07:32] btw, overload errors are 0 now [15:07:38] seems like the celery restart worked [15:08:06] akosiaris: o/5 ! [15:08:17] I think we’re fine now. That was silly. [15:08:20] and we are serving more req/s than any other time right now [15:08:32] so we had indeed increases in traffic that caused issues [15:08:40] and with celery half-dead [15:08:44] we could not serve it [15:08:53] now, we need to figure out why... [15:08:59] I got a pretty good idea [15:09:25] which is that when the child (the worker) dies due to an exception [15:09:41] for some reason the master (the father) does not successfully reap it [15:09:51] and stays as a zombie process [15:10:25] that issue seems a bit silly though. as in celery should be able to handle zombie children [15:10:37] I need to review the code that spawns worker I guess [15:11:56] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Investigate overload condition, seems that we lose nodes - https://phabricator.wikimedia.org/T181634#3796634 (10awight) [15:12:38] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796649 (10akosiaris) https://gerrit.wikimedia.org/r/394060 [15:12:56] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796651 (10akosiaris) And yes let's send these logs to logstash!!! [15:13:05] akosiaris: if you want to dump notes about zombies into T181634, that’s probably our highest-prio takeaway. [15:13:05] T181634: Investigate overload condition, seems that we lose nodes - https://phabricator.wikimedia.org/T181634 [15:13:22] akosiaris: Also, we’re having a staff meeting now, to explain our silence. Feel free to join if you’re interested. [15:15:29] I got an already delayed meeting with my manager though [15:15:39] I 'll show up if I manage to finish in time [15:16:14] +1 no worries, we’re not talking about the outage until the end, if at all. [15:42:36] 10Scoring-platform-team (Current), 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796764 (10akosiaris) Now that I had some time to view those graphs (that is https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&fr... [15:59:45] Thanks for a great meeting all. I really like having y'all participate in discussions of how the team is managed and to drive proposals for what we should be doing :) [15:59:58] halfak: Mind taking the docs meeting w/o me? [16:00:05] I desperately need to relocate [16:00:06] I could. [16:00:08] OK [16:00:18] great. Back in 30 min [16:00:27] Roger :) [16:04:55] * halfak sits all alone in the docs meeting [16:05:01] * halfak twiddles thumbs [16:05:04] j/k working on docs :D [16:06:24] o/ srrodlund [16:07:36] https://www.mediawiki.org/w/index.php?title=Topic:U1vvmc0oparwh4dd&topic_showPostId=u2um3hg75v31sq73#flow-post-u2um3hg75v31sq73 [16:07:49] * halfak gets documentation work done while he hangs out. [16:11:02] brt [16:22:03] While I'm waiting for a colectivo, wanted to mention there's a logging patch on ores-prod to review [16:22:15] linky [16:22:21] boo [16:25:20] (03PS6) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) [16:25:51] halfak: I can't install aspell-is on ores-misc, what should I do? [16:26:03] Do we need to clean up sotrage? [16:26:49] what's the error? [16:27:59] (03CR) 10Ladsgroup: Introduce ModelLookup interface and its SQL implementation (034 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [16:28:08] Amir1, ^ [16:28:17] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3559974 (10Deniskamazur) Do you mean machine learning by saying artificial intellig... [16:28:37] halfak: dpkg is locked [16:28:55] when I deleted the dpkg lock files and run dpkg -a --configure, it stucks [16:29:16] got it. Will have a look [16:33:57] Amir1, see /srv/ores-compute-01-20170711 [16:34:02] Clean up your stuff there :) [16:38:38] done [16:39:58] halfak: cleaned everything and the problem still persists [16:40:05] https://www.irccloud.com/pastebin/1fFUWk3p/ [16:40:07] OK checking stuff [16:40:30] it get stuck at this place all the time [16:41:20] Is it a problem with the package? [16:41:36] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3796963 (10Ladsgroup) >>! In T174384#3796908, @Deniskamazur wrote: > Do you mean ma... [16:42:32] halfak: no, it's about something else I think [16:42:45] it would be a good idea to tear down everything and set it up again [16:42:53] but we would lose too much [16:43:29] maybe I should do it my laptop [16:43:32] Amir1, did you do some googling of the error? [16:43:42] We need to check with the exact same version of linux [16:43:44] yup, no result :( [16:43:45] Try on staging :) [16:44:04] oh yes, I forgot [16:44:07] thanks [16:46:57] OMG you can edit your favorites menu in phab [16:47:02] (03CR) 10Thiemo Mättig (WMDE): [C: 032] Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [16:47:02] * halfak adds "create paste" [16:48:31] 10Scoring-platform-team, 10Analytics, 10ORES: Enable ores::base on stat1006 - https://phabricator.wikimedia.org/T181646#3796974 (10Halfak) [16:48:44] (03Merged) 10jenkins-bot: Introduce ModelLookup interface and its SQL implementation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [16:48:49] halfak: ^ That's fancy [16:49:04] let me know when it's done, I want to do things (mwhahahaha) [16:49:47] I want to move all of my model building work to stat1006 so that we can free up more space on ores-misc :) [16:55:12] wiki-ai/editquality#13 (iswiki_reverted - b8a5595 : Ladsgroup): The build passed. https://travis-ci.org/wiki-ai/editquality/builds/309072360 [16:57:41] Amir1, looks like stat1005 works just fine. [16:57:49] Make sure to "nice" any long running processes there. [17:01:27] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3797048 (10Deniskamazur) Are you sure you need ml for this kind of task? I mean we... [17:18:41] back. ores seems to be holding quite well.. although score requests have dropped considerably [17:19:18] 2 hours ago we have peaks of 1.5k but now it's around 600 [17:19:36] please let this hold ... [17:21:53] Evil spirits. [17:22:03] I waited the 30 min without any vans stopping for me! [17:22:08] So I walked to town [17:38:50] akosiaris: If you’re around, could you do some rm’ing on the new cluster? [17:39:00] Per this workaround: https://phabricator.wikimedia.org/T181552#3796889 [17:39:17] halfak: This is the review I was asking for, https://gerrit.wikimedia.org/r/#/c/394060/ [17:39:34] awight: I am in a meeting, can it wait ? [17:39:40] akosiaris: yes [17:39:43] ok, thanks [17:40:26] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3797253 (10awight) @thcipriani OK thank you for the workaround. I'll note that I don't have permissions to do that mysel... [17:40:34] akosiaris: I realized I can work around myself by deploying a new revision... [17:41:39] nice [17:43:44] Amir1: Want to review either of these? https://gerrit.wikimedia.org/r/#/c/393822/ https://gerrit.wikimedia.org/r/#/c/394060/ [17:48:55] 10Scoring-platform-team, 10Wikilabels, 10Easy, 10Google-Code-in-2017: Introduce and create pytest for flask application of the wikilabels AI service - https://phabricator.wikimedia.org/T179015#3797305 (10Phantom42) a:03Phantom42 I will work on this [17:49:52] was in meeting sorry! [17:50:03] Now heading to lunch. Will review when I get back [18:01:36] (03CR) 10Awight: [C: 032] "Self-merging." [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/393822 (owner: 10Awight) [18:03:24] (03CR) 10Awight: [V: 032 C: 032] Remove unprovisioned servers [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/393822 (owner: 10Awight) [18:03:56] (03CR) 10Ladsgroup: [V: 032 C: 032] Increase celery verbosity; use message format including timestamp [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/394060 (https://phabricator.wikimedia.org/T181538) (owner: 10Awight) [18:04:11] Amir1: thanks! [18:04:49] I’ll keep that one in my pocket cos I only have one chance to deploy per revision due to a scap bug :-/ [18:04:51] awight: halfak|Lunch: I have to leave for today unfortunately, I worked only four hours, will make it in the next days [18:05:00] Sorry [18:05:04] Enjoy the bitter cold :) [18:05:26] I'm turning into a white walker, it's bearable now :D [18:05:29] o/ [18:05:39] Being… undead solves a lot of things I’m sure [18:07:50] Amir1: see -dev please [18:09:56] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) [18:10:00] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) a:05awight>03None [18:10:19] akosiaris: Another fun blocker for when you’re unshackled ^ [18:14:05] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3797381 (10awight) a:05awight>03None [18:21:02] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review, 10Wikimedia-log-errors: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 - https://phabricator.wikimedia.org/T179830#3797410 (10awight) Confirmed th... [18:40:14] o/ [18:40:58] awight, anything left for me to review? [18:41:08] Naw [18:42:18] OK cool. I think I might push on some JADE stuff unless there's something else that is pressing. [18:42:55] Oh BTW, the meeting with srrodlun(d) was mostly about next steps for her and SIGDOCS. We're going to keep the weekly checkin but she's going to focus more on cloud for a while. [18:45:01] I proposed that the next thing that she does for us is a new audit where she helps us figure out what the next big impact item is. [18:45:04] awight, ^ [18:45:22] kk [18:45:29] I’m just whining in SoS [18:48:04] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10mmodell) This is very strange. I can't tell exactly what would be causing this. [18:48:55] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3797532 (10Halfak) We already have a machine learning model for predicting article... [18:51:17] * halfak read that as "winning SoS" [18:51:30] You'd need to Scrum all of the Scrums -- more than anyone else. [19:03:39] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review, 10Wikimedia-log-errors: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 - https://phabricator.wikimedia.org/T179830#3797629 (10awight) Strange, whe... [19:12:49] (03PS1) 10Awight: Protect Special:Contributions code from missing threshold [extensions/ORES] - 10https://gerrit.wikimedia.org/r/394109 (https://phabricator.wikimedia.org/T179830) [19:12:55] 10Scoring-platform-team, 10Global-Collaboration, 10MediaWiki-extensions-ORES: Hide ORES filters from Special:Contributions when thresholds aren't available - https://phabricator.wikimedia.org/T181666#3797654 (10awight) [19:13:30] halfak: Here’s some CR if you still want it, https://github.com/wiki-ai/ores/pull/236 [19:13:38] It’s just a cherry-pick from the CELERY_4 branch. [19:17:23] Will check it out [19:18:22] ORES still has the 'testwiki' hack/feature thing right? [19:18:37] which hack/feature thing? [19:18:40] legoktm: soort of—what are you trying to do? [19:18:43] We have a vagrant role [19:19:21] fully set up the ORES extension on my test wiki again, ideally without setting up all of ORES [19:19:38] legoktm: use the vagrant role [19:19:40] halfak: I think it took the rev id you passed it and returned it back flipped? [19:19:44] I don't use vagrant :/ [19:20:03] there’s a separate role for the service, if you want that local, but it’s not required. Without the ores-service role, you’ll be pointing to the production service. [19:20:38] If you do enable the ores-service role, you get “testwiki” although it won’t be called that. [19:20:43] ok, I'll just look at the role as documentatoin [19:20:53] legoktm: ok well at least use vagrant as a guideline for setting up your config, yeah [19:20:58] lemme point out the key piece [19:22:01] legoktm: https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/role/manifests/ores.pp#L10 [19:23:01] legoktm: Thanks, I didn’t realize that we actually do have testwiki enabled in production. e.g. https://ores.wikimedia.org/v3/scores/testwiki/123 [19:23:13] thanks [19:23:48] I will work on that after lunch :) [19:25:29] I recently added statistics to the RevIdScorer used in testwiki, not sure why I can’t get that to show up on production. [19:25:56] Without that, you won’t be able to do much with thresholds. You can hardcode them if you don’t care about the API fetching code. [19:25:59] awight, did we deploy that change? I think we intentionally did not just to be safe. [19:26:21] halfak: I thought it would at least be on beta… I guess not though. [19:26:33] Let's get it up on beta :) [19:26:38] halfak: Wouldn’t this do the job? https://ores-beta.wmflabs.org/v3/scores/?models=testwiki&model_info=statistics [19:27:17] Yup [19:27:21] halfak: Putting it on beta means merging to master, though, unless you think this is worth messing with branches? [19:27:44] ah which [19:27:50] Oh... uh... awight something is broken there [19:27:55] /o\ [19:28:16] Given https://ores-beta.wmflabs.org/v3/scores/?models=testwiki&model_info= [19:28:24] We should see something reasonable for https://ores-beta.wmflabs.org/v3/scores/?models=testwiki&model_info=statistics [19:28:37] Agreed [19:28:50] Also https://ores-beta.wmflabs.org/v3/scores/?models=testwiki&model_info=environment should return a 40X response. [19:29:31] Yeah... very broken for all models :) [19:29:52] Oh wait. this is weird. [19:29:56] models=testwiki :P [19:30:06] ymm https://ores-beta.wmflabs.org/v3/scores/?models=damaging&model_info=statistics [19:30:10] yeah like you just said [19:30:27] * awight wanders off, whistling [19:30:30] https://ores-beta.wmflabs.org/v3/scores/testwiki?models=damaging&model_info= [19:30:35] got to relocate [19:31:59] Okay so all that happened is that beta has old code. [19:32:00] https://ores-beta.wmflabs.org/v3/scores/testwiki?models=damaging&model_info=statistics [19:32:10] or should I say, https://ores-beta.wmflabs.org/v3/scores/testwiki?models=damaging&model_info= [19:32:14] back in 15 [19:42:24] (03PS10) 10Catrope: Split WL and RC prefs for ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/392452 (https://phabricator.wikimedia.org/T180866) (owner: 10Petar.petkovic) [19:44:28] (03CR) 10Catrope: [C: 032] Split WL and RC prefs for ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/392452 (https://phabricator.wikimedia.org/T180866) (owner: 10Petar.petkovic) [19:50:41] (03Merged) 10jenkins-bot: Split WL and RC prefs for ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/392452 (https://phabricator.wikimedia.org/T180866) (owner: 10Petar.petkovic) [19:54:45] halfak: you’re right, the RevIdScorer enhancements were in ores and we declined to update. [19:54:50] Might as well do it though :) [19:56:43] (03PS1) 10Awight: Bump ores submodule [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/394122 [19:57:03] (03CR) 10Catrope: [C: 032] Make it a bit easier to figure out that Range::combineWith() is used [extensions/ORES] - 10https://gerrit.wikimedia.org/r/392910 (owner: 10Legoktm) [20:02:56] Logged on from my phone to say I'm AFK for ~ an hour for coffee with some profs [20:03:29] legoktm: RoanKattouw: This was merged prematurely, do either of you have time to confirm that this example code is doing what I think it is? https://gerrit.wikimedia.org/r/#/c/393945/ [20:03:41] (03Merged) 10jenkins-bot: Make it a bit easier to figure out that Range::combineWith() is used [extensions/ORES] - 10https://gerrit.wikimedia.org/r/392910 (owner: 10Legoktm) [20:07:05] From what I understand, the lockTSE causes a mutex around the get and set within getWithSetCallback, so multiple threads don’t try to fetch new thresholds at the same time. [20:08:11] and pcTTL keeps a cached version in-memory so that we don’t recalculate multiple times in one request, due to cache fetch coming from a replica. [20:09:56] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3797904 (10Paladox) git-lfs is now supported :). See https://gerrit.wikimedia.org/r/#/c/394125/ [20:10:39] I’d also like to do something Krinkle suggested, where we check the cached value before it expires, and if the service is unreachable, put the old value back into the cache for some shorter TTL. [20:11:25] awight git-lfs is now available in gerrit :) [20:12:05] paladox: oh hey, that’s a game-changer! [20:12:13] heh :) [20:12:39] awight i can enable it on your repo if you want. But need the name and it will have to be merged by someone who has +2 on All-Projects. [20:13:12] paladox: Just musing about how to migrate onto that. I think we should start with a junk repo, if that’s not too annyoing. [20:13:21] ok [20:13:23] yeh [20:13:26] which one? [20:13:58] Are there already any infrastructure-testing repos laying about? [20:14:08] yep, we have gerrit-ping [20:14:12] we have been testing on [20:14:16] https://gerrit.wikimedia.org/r/#/admin/projects/test/gerrit-ping [20:14:46] awight here's my change https://gerrit.wikimedia.org/r/#/c/394125/ [20:15:59] 10Scoring-platform-team, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Patch-For-Review: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3797934 (10awight) So, my understanding is that the lockTSE causes a mutex around the get and set w... [20:20:09] paladox: trying to figure out how to clone the LFS files… interesting [20:21:18] ok [20:21:20] you clone it normally [20:21:24] awight ^^ [20:21:45] ah the change isn’t merged, is all. [20:21:59] (even better!) [20:22:30] awight the push command is the same [20:22:50] awight this https://github.com/git-lfs/git-lfs/wiki/Tutorial will help with getting it into lfs [20:23:00] but if you use git clone over ssh in gerrit [20:23:03] you will have to do [20:23:31] vi .git/config and add [20:23:55] [lfs "https://@gerrit.wikimedia.org/r/a/test/gerrit-ping.git/info/lfs"] [20:23:56] access = basic [20:23:56] locksverify = false [20:29:24] paladox: Only cat.bin has content, right? [20:29:35] yes, i only did it on content [20:30:24] That’s fantastic, it’s working! [20:30:42] :) [20:30:58] awight i can switch it on some of your repos if you want? [20:31:03] i can use regex too [20:31:27] 10Scoring-platform-team, 10ORES, 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3797980 (10demon) [20:31:35] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3797978 (10demon) 05Open>03Resolved a:03demon [20:37:07] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798012 (10awight) [20:38:18] paladox: I made a task for us to decide what to do :) ^ [20:38:44] The only big question is probably whether we should rewrite the history or create new repos. [20:39:57] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798028 (10awight) [20:42:29] awight: rewrite so we can know who to blame for bugs [20:43:03] Zppix: good point, actually we’d be keeping a copy of the repos in either case [20:43:38] awight: be easier to keep it all in one place [20:45:39] ok [20:46:17] awight we can enable it, but dosen't mean you have to use it on the repo :) [20:46:39] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3798085 (10awight) [20:46:45] 10Scoring-platform-team, 10ORES, 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3798086 (10awight) [20:46:49] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798084 (10awight) [20:49:26] 10Scoring-platform-team, 10ORES, 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3798097 (10Paladox) 05stalled>03Open [20:51:57] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798104 (10awight) I'm guessing we want to do something like, # Copy repos to a read-only location. # Set LFS flags and metadata on repo (unknown) # git... [20:56:36] akosiaris: Seems that some of our servers didn’t come back after all: https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=1511879596857&to=1511925593659 [20:57:04] scb1004, scb2001, scb2005, scb2006 [20:57:07] halfak: Amir1: ^ [20:57:46] service status looks fine. [20:58:31] CPU is at our 20% baseline for all of those! [21:00:55] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3798137 (10thcipriani) Hrm. I think this error probably has something to do with ssh client timeout. I'm not sure if anything rece... [21:01:47] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798142 (10demon) They'll mirror just fine since Phabricator just observes upstream. [21:02:02] awight: yes, that looks fine to me [21:02:03] Celery workers are… working. But nothing in the metrics. [21:02:07] I wish admin links for granfana would auto redirect to non-admin link for non admins :/ [21:02:11] Grafana* [21:02:17] legoktm: Cache patch? Awesome, ty. [21:02:28] Zppix: oops, ty for the reminder [21:02:44] Oh i didnt realise you sent an admin link lol [21:02:57] I was just thinking out loud [21:03:12] I’m definitely a culprit. There in the backscroll I did exactly that. [21:03:18] redirecting would be cool. [21:03:20] I see that now [21:03:28] Hmm i wonder if its possible [21:03:45] I mean we probably have to hack it on our grafana install [21:04:07] Probably would be a traffic question... [21:04:15] Who is the traffic team? [21:05:37] halfak: Ping when you’re back from the meeting, I don’t understand WTF. [21:06:01] and, an epic rainstorm is drifting in my direction so if I slam the computer closed I’ll bbiab [21:06:34] I wonder if i should open a task asking for that redirect or just straight up ask in IRC... awight [21:06:51] Zppix: A task is probably best, since it’s not trivial. [21:07:11] Ugh, ill do it l8r [21:07:38] lol taskophobia, I understand [21:08:03] back! [21:08:07] awight: its an traffic/ops task if its not perfect no one will care xD [21:08:46] Plus gci has my attn atm [21:08:47] halfak: Celery metrics are zero on 4 of our servers again. [21:08:59] halfak: But the workers are doing things. [21:09:02] We need that logging... [21:09:08] I’ll deploy that now [21:09:09] awight, 100% cpu on them? [21:09:12] +1 [21:09:18] no, totally normal CPU. 20% [21:09:23] wtf [21:09:40] individual workers spin up to one CPU and back down, as expected. [21:09:40] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3798203 (10Deniskamazur) Oh, sorry, didn't get the task. Sry for the dumb question [21:09:50] K I’ll deploy the logging [21:09:53] beta-fitrst [21:09:55] *first [21:10:42] awight: the link you pasted , aka https://grafana.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=1511879596857&to=1511925593659 is in the past [21:10:52] OMG [21:10:52] to Nov 29, 03:1:53 UTC [21:10:57] Lmao [21:11:00] Thanks, I was refreshing like crazy [21:11:29] and with that, I am going to sleep [21:11:31] halfak: akosiaris: Confirmed, workers are just fine. [21:11:32] byez [21:11:37] o/ [21:11:41] lol [21:11:49] was just looking and wondering :) [21:11:59] ah rain storm gimme 5 [21:12:03] That pattern of timespans is a little counter-intuitive. [21:12:18] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3798206 (10Deniskamazur) I couldn't find the current accuracy of the models, where... [21:12:49] akosiaris: night :) [21:13:09] Ah the GCI students are picking up our tasks [21:13:47] 10Scoring-platform-team, 10articlequality-modeling, 10Easy, 10Google-Code-in-2017, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3798212 (10Halfak) https://ores.wikimedia.org/v3/scores/enwiki/?models=wp10&model_i... [21:16:28] awight, rain storm cuts the internets? [21:17:47] I was chilling on a sunny lawn :) [21:17:54] no quarter [21:18:04] ha gotcha [21:18:31] halfak: Hey paladox just showed me that git-lfs is ready to rock! [21:18:37] :) [21:18:49] Nice was just looking at that. [21:18:54] You want it or can I take a pass on it? [21:18:59] All you! [21:19:00] T181678 [21:19:00] T181678: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678 [21:19:20] lol I think “you want it” == “can I take a pass on it" [21:19:26] git lfs migrate is going to make this way easier than re-writing the git tree used to be. [21:19:30] prepositions... [21:19:47] yeah, glad they thought to include such an essential function [21:19:56] Oh yeah... taking a pass could mean making an attempt or skipping an attempt. [21:20:29] I spent a couple of hours developing a filter branch call last time. [21:20:58] “pass at” I think [21:21:14] not actually being didactic… I just thought it was funny [21:21:38] note it has to be switched on this repo [21:22:20] halfak: Logging works! [21:22:40] \o/ nice! [21:24:35] Cool, I’ll roll that to production for fun. Want to throw the ores submodule bump in for good measure? [21:24:51] +1 [21:25:33] (03CR) 10Awight: [V: 032 C: 032] Bump ores submodule [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/394122 (owner: 10Awight) [21:34:40] halfak: https://ores-beta.wmflabs.org/v3/scores/testwiki?models=damaging&model_info=statistics [21:35:01] \o/ [21:35:09] legoktm: If you point your E:ORES at ores-beta testwiki, you’ll get normal thresholds behavior now. [21:35:27] https://ores-beta.wmflabs.org/v3/scores/testwiki?models=damaging&model_info=statistics.thresholds.true.%22maximum%20precision%20@%20recall%20%3E=%201.1%22 [21:35:30] :DDD [21:35:35] lol [21:35:45] It doesn't work in the exact right way :D [21:36:35] halfak: Good point, well it does but you have to give preposterous constraints. [21:36:53] Right. That's good and useful though :) [21:37:05] For testing, you'll want to have it [21:37:26] $wgOresFiltersThresholds = array( "damaging" => array( "likelybad" => array( "max" => 1, "min" => "recall_at_precision(min_precision=0.45)" ), "likelygood" => array( "max" => "recall_at_precision(min_precision=0.99)", "min" => 0 ), "maybebad" => array( "max" => 1, "min" => "recall_at_precision(min_precision=0.15)" ) ) ); [21:37:47] I had to flip some things around in unrealistic ways. [21:37:56] legoktm: ^ you’ll need that secret decoder setting as well. [21:38:06] $wgOresModels = array( "damaging" => true, "goodfaith" => false, "reverted" => false, "wp10" => false ); [21:38:12] $wgOresWikiId = “testwiki” [21:38:58] We could change RevIdScorer to look more like a normal curve I suppose, but that would probably ruin the nice deterministicicity [21:39:56] * awight curses English for not allowing infinite suffixation [21:40:03] determinism [21:41:31] "migrate: Rewriting commits: 64% (258/399)" [21:41:33] :DDD [21:41:39] holy cow [21:41:41] * halfak destroys history [21:41:55] You can’t! I remember it alllll.... [21:42:14] git lfs migrate awight --everything [21:42:22] lol [21:42:28] * awight replaces self with an unlinked local clone [21:43:19] halfak: fyi paladox needs to flip some bits before you push any LFS repos. Also, I’d suggest we snapshot the repos into an archive. [21:43:26] Your branch and 'origin/master' have diverged, [21:43:27] and have 399 and 399 different commits each, respectively. [21:43:31] lol [21:43:35] /o\ [21:43:40] awight which repos do you want it enabled on? [21:44:01] halfak: You agree with the list in https://phabricator.wikimedia.org/T181678 ? [21:44:05] editquality, draftquality, wikiclass, ores-deploy-wheels [21:44:07] * halfak looks there [21:44:11] that was it. [21:44:35] Cool. Need the phab links? [21:44:53] The internet thinks that I just need to push this to github and we're done. [21:44:55] paladox: ^ ? Note that only one of those uses gerrit as the master [21:45:07] halfak: Where are you archiving the unrewritten repos? [21:45:08] ah thanks. [21:45:14] i guess research/ores/wheels [21:45:20] Locally on my disk. [21:45:21] yep [21:45:25] harrr [21:45:36] Just went for coffee, didya [21:45:38] Want a backup somewhere on WMF servers? [21:45:45] yes pls [21:46:01] I'll put it on stat1006 :) [21:46:01] awight: Re lockTSE and stuff, Aaron Schulz and Krinkle are the experts, I suggest asking them [21:46:08] In the public datasets repo [21:46:09] awight https://gerrit.wikimedia.org/r/#/c/394179/ [21:47:23] * halfak compresses a massive file for upload [21:47:24] paladox: So what happens with the mirrored repos? How is the LFS mirrored over? Do we have to do any configuration, you think? [21:47:33] halfak: You could just clone directly there [21:47:34] University intertubes FTW [21:47:41] Damn... that's a good idea. [21:47:41] haha I see [21:47:47] it should work with a normal git push i think. not sure though [21:48:00] i only know git-lfs works with local git pull, not sure about mirrors [21:48:09] What could possible go wrong 8D [21:48:13] lol [21:49:50] * awight jumps in fright [21:52:02] * halfak waits for the old repo and history to compress for later before trying anything. [21:52:15] I'll be pushing something to github shortly though [21:54:29] That’s awesome. [21:57:41] halfak awight lfs is now enabled [21:57:50] Great! [21:57:53] https://gerrit.wikimedia.org/r/#/admin/projects/research/ores/wheels [21:58:55] The editquality repo is only 2.2GB compressed! [21:59:22] If someone wants to review phantoms' pr and let me know that would be great (hes a gci student) [22:00:06] My phone was annoying me [22:00:10] i gave the repo 3gb of lfs. we can increase it if you need to me. [22:00:14] Zppix, had a look at it. [22:00:23] paladox, we'll need more than that [22:00:34] halfak ok, how much more? [22:00:37] Oh wait... not for wheels [22:00:45] But for editquality we will need more -- like 20GB [22:01:03] ah ok [22:01:30] And? [22:01:35] ok i will enable lfs on [22:01:42] mediawiki/services/ores/editquality [22:01:58] Hmm... that is the gerrit repo but we don't use that at all. [22:02:02] Just the phab one [22:02:16] 10Scoring-platform-team, 10Operations, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3798373 (10awight) A slightly related request--it looks like /srv/log/ores/main.log is created by modules/service/manifests/uwsgi.pp, it would... [22:02:20] https://phabricator.wikimedia.org/source/editquality/ [22:02:49] 10Scoring-platform-team, 10Wikilabels, 10Easy, 10Google-Code-in-2017, 10Patch-For-Review: Introduce and create pytest for flask application of the wikilabels AI service - https://phabricator.wikimedia.org/T179015#3798374 (10Phantom42) I just published [[ https://github.com/wiki-ai/wikilabels/pull/212 | g... [22:02:57] awight hfalfak https://gerrit.wikimedia.org/r/#/c/394198/ [22:03:02] halfak ^^ [22:03:03] Signing off for the day. [22:03:50] o/ awight [22:03:55] Have a good evening [22:04:19] paladox, will it matter that we don't use gerrit there? [22:04:44] I think so. as the lfs objects are stored in a seperate repo in gerrit. [22:04:49] though im not sure [22:06:37] So it all gets mirrored from github. Is it going to be a problem if I push my lfs rewrite now? [22:08:28] paladox, ^ [22:08:30] I could wait [22:08:39] it should not be a problem :) [22:08:47] though i only tested if lfs pushed correctly [22:08:55] you should try :) [22:09:02] git push origin HEAD:refs/for/master [22:09:22] https://github.com/git-lfs/git-lfs/wiki/Tutorial [22:15:06] OK trying it out! [22:16:19] Looks like I'm uploading 5.5GB for editquality -._o_.- [22:16:27] Not sure why the repo is way bigger when cloned. [22:18:27] heh editquality ? [22:18:35] the change for that repo needs to be merged [22:18:46] you can do it in the research repo though. [22:19:01] 3gb limit. though can be increased if needed. [22:19:10] halfak ^^ [22:20:32] "research repo" [22:20:33] ? [22:20:58] The wheels one is gerrit only. [22:21:03] No github involvement there. [22:21:26] I'll get a patchset for wheels ready [22:21:28] wheels [22:36:52] * halfak submits an N commit patchset [22:36:57] where N is large [22:38:46] This might not finish before I must leave. [22:38:47] :| [22:42:32] I have three terminal windows open. One is pushing a huge patchset for wheels. One is uploading an even more huge history to the wiki-ai/editquality and another is compressing the huge history of wiki-ai/wikiclass :) [22:44:12] Halfak your cpu dead yet? [22:44:23] Na man. Still turning. [22:44:30] SSD i hope? [22:44:58] yup [22:45:15] 10Scoring-platform-team, 10Wikilabels, 10Easy, 10Google-Code-in-2017, 10Patch-For-Review: Introduce and create pytest for flask application of the wikilabels AI service - https://phabricator.wikimedia.org/T179015#3798513 (10Phantom42) However, right now CI build is failing. That's because some of tests I... [22:45:59] Almost done with wiki-ai/editquality :) [22:46:38] Halfak on T179015 needs your opinion (they want a DB accessiable for CI testing) [22:46:39] T179015: Introduce and create pytest for flask application of the wikilabels AI service - https://phabricator.wikimedia.org/T179015 [22:47:01] * paladox has a ssd [22:47:03] Thanks. Not going to be able to respond today :\ [22:47:07] it's a flash drive they call it :) [22:47:56] https://github.com/wiki-ai/editquality is updated [22:48:07] paladox, now we get to see what diffusion does [22:48:12] heh [22:57:29] gerrit errors "no common ancestry" [22:57:30] :D [22:57:39] Looks like we might have to have someone push this manually. [22:57:43] I'll look into that tomorrow. [22:58:33] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798554 (10Halfak) Trying start a gerrit review for wheels. Got this: ``` Do you really want to submit the above commits? Type 'yes' to confirm, other... [22:59:47] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798556 (10Halfak) Putting repo backups here: https://analytics.wikimedia.org/datasets/archive/public-datasets/all/ores/ I'm editquality and draftquali... [23:00:17] Wikiclass is at 13GB so far O_O [23:00:34] bz2 compressed. [23:00:34] Still growing [23:00:56] I really want to close my laptop and bike away but I didn't put this into a screen >:( [23:02:55] halfak git push origin HEAD:refs/for/master [23:03:22] In gerrit? I could try that [23:03:28] I'm guessing I don't have the permissions though [23:05:59] Damn. Looks like the migration script doesn't work quite right [23:07:48] paladox, same error [23:11:46] oh [23:12:14] oh i see halfak ! [remote rejected] HEAD -> refs/publish/master/git-lfs-migration (no common ancestry) [23:12:25] yeh probaly needs force push or do it for a different branch [23:14:28] OK. I'm going to run away now. Will try to find someone who can force-push tomorrow ^_^ [23:14:35] Thanks for your help, paladox [23:15:00] your welcome :) [23:17:56] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798581 (10Halfak) https://github.com/wiki-ai/draftquality is fully updated.