[08:05:05] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3819169 (10akosiaris) >>! In T169246#3817883, @awight wrote: > Point well taken. What if we temporarily depool some of the servers f... [08:13:50] 10Scoring-platform-team, 10ORES, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3819202 (10akosiaris) removing uWSGI from the tests is very easy. Just submit directly to the celery queue the jobs/min you 'd like and see if the scores proces... [11:27:41] (03PS1) 10Ladsgroup: Fix name of class in docs [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395978 [11:29:12] (03PS2) 10Ladsgroup: Join decomposition of ores_model table queries [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395811 (https://phabricator.wikimedia.org/T181334) [14:18:14] (03CR) 10Umherirrender: Fix name of class in docs (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395978 (owner: 10Ladsgroup) [14:22:26] Good morning! [14:22:35] awight: hows you? [14:22:57] (03CR) 10Thiemo Mättig (WMDE): [C: 032] Fix name of class in docs (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395978 (owner: 10Ladsgroup) [14:23:09] Good morning! [14:23:14] refeed[m]: o/ [14:23:34] *activity* - is there a phab task for adding ORES scores to AbuseFilter? [14:24:01] I dont think so [14:24:22] Our workboard is at scoring-platform-team feel free to check there TheresNoTime [14:24:40] (03Merged) 10jenkins-bot: Fix name of class in docs [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395978 (owner: 10Ladsgroup) [14:26:52] It's already night here tbh ww [14:34:05] TheresNoTime: hi! There are some interesting comments about AbuseFilter integration on https://phabricator.wikimedia.org/T123178 [15:27:40] 10Scoring-platform-team, 10ORES, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3820270 (10Halfak) We do have graphite logging in celery workers every time a score is processed. I'm not sure how that helps us in this situation. We're curr... [15:28:22] halfak: I thought akosiaris’s suggestion was great, to inject straight into the celery queue. [15:28:37] awight, I don't know. That sounds painful and weird. [15:28:42] hehe [15:28:45] Does celery *just* put it at the end of the queue? [15:29:02] Or do we need to write a new stress tester to connect to celery and submit the jobs directly. [15:29:23] How do we make sure we send the data in the same way that uwsgi does? [15:29:38] I think the latter, we would need to tap into our code to inject celery [15:29:48] +1 that increasing uwsgi workers is also a perfectly decent approach [15:29:54] Why don't we just use the uwsgi client to send it to the celery queue? It seems perfectly suited to the task of being a stress testor. [15:29:56] what does uwsgi have to do with the code that does the job insertion ? [15:30:10] uwsgi is just an app server.. it's the ores code anyway that does the job insertion [15:30:19] * halfak sighs [15:30:35] from the sigh I guess I might be wrong ? [15:30:36] ores client that lives in uwsgi == "uwsgi client" [15:30:54] What would be nice is to find a way to add instrumentation that can diagnose this bottleneck [15:31:06] We'd need to instrument the uwsgi queue [15:31:15] We have instrumentation of the * of active web workers. [15:31:18] We already have some metrics coming out of that [15:31:30] yeah, and # of requests served [15:31:48] btw I am not saying that uwsgi is not the bottleneck here, it might very well be. And given the ~1s request time it's entirely plausible [15:32:04] now if we only could drop the request time from ~1s ... [15:32:13] The web worker count instrumentation seems to be broken for ores* [15:32:21] ooh? [15:32:32] ~1s request time? [15:32:47] https://grafana.wikimedia.org/dashboard/db/ores?panelId=13&fullscreen&orgId=1&from=now-24h&to=now-1m [15:32:47] The mean 1.17s to process a score? [15:33:00] yup, that's what I got from the graphs... ~1s [15:33:07] I don't see how that is relevant [15:33:22] what does the uwsgi worker do in that ~1s ? [15:33:28] I think it’s relevant because uWSGI needs to maintain an open socket for that time [15:33:40] even if it’s not working, it has a limited number of parallel requests it can handle [15:34:10] awight, right. But we can't make it much faster and regardless we need to handle even longer requests. [15:34:28] Right! That's why we need to bump up the number of parallel requests that uwsgi can handle. [15:34:41] ok, just curious here. with just 1 worker, how many req/s would we handle ? [15:34:51] what kind of worker? [15:34:59] right, sorry about that. uwsgi process [15:35:20] We could handle 1 / 1.1 reqs/sec [15:35:28] 1 request per 1.1 seconds [15:35:48] And it's not really requests [15:35:50] Mind. [15:35:57] ? [15:36:00] When we have a cache hit, it's fast and cheap. [15:36:09] These are processed scores we are talking about. [15:36:14] A cache miss. [15:36:14] ah yes, but that's irrelevant for what I am about to say [15:36:20] And it's a request for a *score* [15:36:27] We get other requests for model info and stuff. [15:36:42] My guesstimate is that we should increase the uWSGI pool by x5, simple because we’re at 20% CPU. [15:36:47] So we should use the language of scores/sec [15:36:48] so in the case of a cache miss for a scoring, the uwsgi worker is blocked for the entirety of 1.1 secs, doing nothing else, right ? [15:37:02] awight, why would we increase by 5x? [15:37:19] It’s looking like our celery worker count will have available workers even when the CPUs are maxed-out, so that won’t be a limiting factor. [15:37:23] akosiaris, not entirely, no [15:37:29] halfak: because that would give us 100% CPU utilization. [15:37:29] The web worker does some stuff. [15:37:43] awight, why not just have enough web workers to feed the celery workers? [15:37:48] awight: you DO NOT WANT 100% utilization [15:37:49] Like I proposed. [15:38:01] akosiaris: +1 — what’s the correct target? [15:38:14] I’m only suggesting this for stress testing, not a production level. [15:38:48] Yes. So this is an old conversation -- that we need to have slightly more web workers than celery workers in order to saturate the capacity we have on the celery side. [15:38:59] I'd be OK with having twice as many web workers too. [15:39:04] production wise, the correct target is what allows you to serve your guesstimation of incoming traffic + a % for a buffer against spikes [15:39:05] But more than that is unnecessary [15:39:21] which is not that issue to guesstimate ;-) [15:39:26] Right. akosiaris speaks to a mixture of uwsgi and celery workers. [15:39:26] s/issue/easy/ [15:39:42] Basically we need 1:1 uwsgi and celery for whatever capacity meets akosiaris' constraints. [15:39:54] Between 1:1 and 2:1 is fine. [15:39:56] akosiaris: fwiw, we have an average traffic level of 500 req/min over the past year. [15:40:18] so ~9 r/s (rounding up) [15:40:30] The largest recorded spike was in February, when it hit 4.5k req/min for a day or two, but I think that was being throttled by the uWSGI bottleneck so we don’t know how high a real spike might be. [15:40:47] awight, not possible [15:40:53] We had more web workers than celery workers. [15:41:05] halfak: I don’t think it’s relevant to match uWSGI workers to celery workers, because the current number of celery workers can oversaturate the CPUs. [15:41:21] Oh then we should cut down celery workers. [15:41:51] awight, can or *will* if they were all active at the same time? [15:41:52] halfak: ok either way, the peak has a flat plateau, which suggest that something was throttling the incoming requests at a steady level, and we don’t know what the real demand would have been. [15:42:06] awight, did we overload? [15:42:07] awight: probably less [15:42:10] Because we've overloaded with less. [15:42:12] +1 that my napkin indicates that we’ll end up turning down the celery worker count [15:42:44] We can't operate effectively without cranking up uwsgi worker count regardless. [15:42:45] halfak: no but remember, overload conditions seem to be caused by OOM [15:42:55] halfak: +1! [15:42:56] What? [15:42:57] No [15:43:03] I mean yeah if it kills workers. [15:43:20] I found that our nodes were dying intermittently forever now [15:43:23] Overload is just when celery's queue gets too big -- there are too many incoming requests for the celery pool to handle. [15:43:39] Right. If a node dies, the celery pool is smaller all of a sudden. [15:44:31] yes but it’s looking like our celery pool can actually handle all the requests we receive—this is the scenario you’ve been worried about, that we’re silently overloading at the uWSGI entry point [15:45:47] awight, we got overloads when google was hammering us. [15:45:58] If my estimate was right, then 50 workers x 4 machines should be able to handle almost 9k req / min [15:46:03] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1484268570210&to=1486633765710 [15:46:08] halfak: Can you … ah [15:46:27] yeah that’s the window I was thinking of [15:46:37] awight, *scores/min [15:46:42] ty [15:46:57] I don’t see any hint of the workers dying [15:47:02] right [15:47:24] Because we didn't need workers to die to get overloaded [15:47:48] I believe we might have bumped worker count after this point. [15:47:57] k, that demonstrates that the uWSGI / celery ratio was good then [15:48:11] We didn't have changeprop response speed tracking at the time it seems. [15:50:03] Is it possible to instrument average celery worker idle time, I wonder? [15:50:19] I’d think that celery would support some of that natively. [15:52:24] In the meantime, we should probably try the obvious solution of bringing out uwsgi workers up to spec. [15:52:53] fwiw, on Feb 5 we increased the celery worker count 40 -> 45 [15:54:07] I'm looking for the right way to change uwsgi workers_per_core on ores* specifically. [15:54:30] also fyi, workers_per_core was == 2 in Feb [15:54:48] https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/ores/stresstest.yaml [15:54:56] halfak: search operations-puppet for workers_per_core [15:54:57] Does that seem like the right place? [15:55:18] Right awight. I don't want to change it for scb* [15:55:28] yes, but I have to admit I never understood why we have workers_per_core and not just plain "workers" [15:55:39] akosiaris, I'm OK with "workers". [15:55:49] That’s the right place for our stress boxes, yes [15:56:10] yeah, overall I think we should not be tuning too much the software to the hardware [15:56:23] kubernetes is coming and that way is not going to work [15:56:34] ores::web::workers_per_core [15:56:43] akosiaris: What’s the alternative? [15:56:53] In celery we just set the number of workers [15:57:13] profile::ores::celery::workers vs profile::ores::web::workers_per_core [15:57:25] akosiaris: ah, I see what you’re saying. [15:57:50] honestly I don’t understand when we add “profile::"- [15:57:50] in kubernetes land ? just decide how many req/s you want to serve, then define the quantum execution unit and allow kubernetes to autoscale [15:58:07] akosiaris, +1 [15:58:12] But we're not there now [15:58:15] akosiaris: halfak: oh hey, $processes = $::processorcount * $workers_per_core [15:58:21] right [15:58:24] so we’re free to specify $processes directly instead. [15:58:35] nope, but keep it in mind it's coming [15:58:41] awight, not sure that helps anything [15:58:44] akosiaris, next year [15:58:55] next quarter is mathoid, the quarter after that I would like to propose ORES [15:59:00] awight, setting $processes that is [15:59:11] akosiaris, sure I'm happy to do that. [15:59:34] This quarter, we're trying to deploy to our new cluster without breaking our current cluster :) [15:59:37] Sorry to say, a tidal wave of meeting is headed my way. [15:59:46] lol [16:00:04] I'm going to work on a workers_per_core patch for stresstest.yaml [16:03:24] * halfak adds a task for refactoring ORES puppet stuff for kubernetes. [16:04:33] 10Scoring-platform-team, 10ORES, 10Operations: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331#3820428 (10Halfak) [16:05:49] 10Scoring-platform-team, 10ORES, 10Operations: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331#3820428 (10awight) One thing that @akosiaris pointed out, we'll want to replace this puppet formula: > $processes = $::processorcount * $workers_per_core and specify the num... [16:06:28] awight, damn wrong task :P [16:06:31] I'm working on that one [16:06:37] lol [16:06:42] 10Scoring-platform-team, 10ORES: Refactor ORES puppet for Kubernetes - https://phabricator.wikimedia.org/T182332#3820450 (10Halfak) [16:06:44] There ^ [16:07:14] note the point in the description where I thought "Oh crap just post it and edit later" [16:08:13] 10Scoring-platform-team, 10ORES: Refactor ORES puppet for Kubernetes - https://phabricator.wikimedia.org/T182332#3820450 (10Halfak) [16:12:23] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3820532 (10Halfak) In this case, it's advanced smoke testing for the cluster. I'm hesitant to deploy in production until we've thoro... [16:27:10] halfak: what would be the multiclass equivalent of counting class here - https://github.com/wiki-ai/revscoring/blob/master/revscoring/scoring/statistics/classification/counts.py#L23 [16:27:43] i'm not able to extend "self['predictions'][label][predicted] += 1" to the multiclass case [16:31:33] codezee, oh damn. [16:31:38] For label in label [16:31:45] :) [16:31:49] * halfak thinks. [16:32:29] I wonder if we should change the default for "label" to be a list of labels and multilabel would be the only case where "labels" contains more than one label. [16:32:34] This is a painful generalization. [16:32:41] halfak: i thought abt that but given [A,B] in predicted and [A,B,C] in true, does it make sense to increment each of [A][A], [A][B], [A][C] [16:32:57] no [16:33:08] good point. [16:34:18] Counts only seems to make sense in the single-label case as it is defined. [16:34:23] * halfak thinks more [16:35:05] Actually, wait. [16:35:14] You could just leave it as is and it will make a HUGE table. [16:35:22] Pairs of sets. [16:36:03] increment [(A,B,C)][(A,B)] [16:36:08] I don't know if I like that idea. [16:36:20] It doesn't convey what I really want. [16:37:24] yeah, that hardly seems useful [16:39:18] awight, when you said we can change $processes directly, I was mis-reading the code. I see what you mean now. This could work :) [16:39:31] * halfak works that into his patch [16:39:39] patch for review already [16:39:51] :| I've been wasting my time. [16:39:53] lol [16:40:10] That's why I'd announced I was working on it. [16:40:14] Oh well. [16:40:17] I need to relocate, this connection is too slow to even upload [16:40:23] edit conflicts happen [16:40:34] I thought you were just bumping the number [16:41:06] On the down side, this patch affects production. [16:41:08] i see now why sklearn doesn't support confusion matrix for multi-label :/ [16:41:15] and… decreases workers perhaps [16:41:18] Working out the right way to do it without affecting scb* [16:41:26] * awight taps fingers waiting for the upload [16:41:27] haha codezee [16:41:35] You can have this if it’s useful… and if it ever uploads [16:41:40] Was just going to suggest you look around online to see what other people do. [16:42:16] https://gerrit.wikimedia.org/r/396055 [16:42:17] All yours [16:42:51] Icinga died ill tell -operations [16:43:08] halfak: different numbers of CPUs on scb1001-2, vs scb1003-4 I see [16:43:43] 24 vs 32 CPUs [16:44:10] So a really cautious patch would set different numbers for each machine. [16:44:20] How about we just do 1:1? [16:44:58] (that’s what I’ve done in that patch) [16:45:26] it also makes sense cos celery_workers is that way already. Are we concerned that this will OOM? [16:51:20] * awight throws knife I used to kill icinga-wm off a short bridge [16:54:38] awight: ^ i had it fixed [17:01:10] relocating to avoid extradition…. back in 10 [17:11:13] halfak: You want to own that patch or should I fix it up? [17:12:38] awight, I'm working on my own. [17:12:46] It looks like we had different ideas. [17:13:04] ok [17:13:28] eisenhaus335/wikilabels#2 (master - a0bcb0a : eisenhaus335): The build has errored. https://travis-ci.org/eisenhaus335/wikilabels/builds/313065894 [17:13:56] akosiaris: Is there some magic Puppet glue I’m overlooking, or is profile::ores::celery::workers a typo for profile::ores::web::celery_workers ? [17:14:17] halfak: want to share the WIP? [17:14:55] akosiaris: It… seems to work which scares me. [17:15:24] nvm, I see the explicit hiera call, ./modules/profile/manifests/ores/web.pp: $celery_workers = hiera('profile::ores::celery::workers', 45), [17:15:26] awight, documenting math [17:15:45] we shouldn't have that in web.pp [17:15:50] akosiaris: Why is it like that? I thought direct hiera calls were evil? [17:15:59] Not in profile apparently [17:16:01] ? [17:16:06] O_o [17:16:39] halfak: Looks like you’re right, grep -rw hiera modules/profile/ | wc -l => 1361 [17:17:44] I hate puppet, and actively wish for something better to eclipse it. [17:17:58] Probably all of ops feels the same way [17:19:44] I’m going to try to pay attention to JADE like I threatened on the calendar... [17:20:14] 10Scoring-platform-team, 10ORES, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3820695 (10Halfak) OK so, I think we actually need to bump the worker count to `celery_workers` + `queue_size`. Since `queue_size` is 600... [17:20:27] awight, ^ [17:20:33] https://gerrit.wikimedia.org/r/396064 [17:20:49] I'm looking into a better way to overwrite $processes directly. [17:21:15] Will submit that as a followup if I can get it. [17:21:46] halfak: I would just mash our patches together. [17:21:55] set $uwsgi_workers to 64, so scb1003-4 are unharmed, then specify $uwsgi_workers directly for scb1001 and scb1002 [17:22:15] c.f. hieradata/hosts/scb1001.yaml [17:23:20] 64 * 9 isn't going to cut it for stresstest [17:23:28] rather ores* [17:23:48] Sure, specifiy as 2200 / 9 or something in the stresstest.yaml [17:24:53] +1 for putting all the ducks in a row, so we don’t have to ask for multiple ops CRs [17:25:54] halfak: whats exactly the function of population_rates parameter? [17:26:00] stresstest workers + queue_size = 750 fwiw [17:26:13] oh hmm that’s queue_size over all machines [17:26:19] right [17:26:26] so 150 + (600 / 9) [17:26:33] 220 [17:27:51] 600 + 1 [17:28:08] Note that it's good to have more than that because we get requests that do not ask for a score. [17:28:36] * awight tiptoes away [17:29:42] * Zppix drags awight back [17:31:48] * awight throws icinga-wm off another precipice and Zppix must decide whose life to save [17:32:19] maybe 150 + (600 / 9) + 50 fudge [17:32:20] Neither [17:32:20] icinga-wm: is outdated anyway [17:32:25] 266 [17:32:57] Maybe just 10 fudge since non-scoring requests are so fast. [17:33:20] ~230 [17:34:24] halfak: factor in cached reqs [17:34:40] Zppix, same story [17:36:01] I wonder if we move celery to k8s or something if we could get more bang for our buck [17:37:16] Zppix, that's the plan [17:37:30] Oh sweet [17:48:59] awight, I agree re. merging. I think that your change is good though I don't know where the "45" number came from. [17:49:16] I see 48 on scb1* nodes [17:49:18] hehe 1:1 with celery. But it turns out that’s not a number we use [17:49:24] and 32 on scb2* nodes. [17:49:35] Oh I'm talking about current production. [17:49:41] I’m happy to rebase and tweak it, shall I? [17:49:50] I'm thinking it's safe to just set everything to 48 [17:49:56] For current scb nodes [17:50:02] and then have something in stresstest for ores* nodes. [17:50:20] sure, let’s do that. [17:50:34] The only thing bothering me is that memory is right up against the wall already. [17:50:56] Was looking at that. We have fewer CPUs per host in scb2* [17:51:03] BUT the same amount of memory [17:51:17] So theoretically it would be safe to bump the web worker count. [17:52:16] RES is 800M for these, and I doubt they’re even close to that, cos of the copy-on-write thing. [17:52:47] actual memory per celery worker is 310M according to my napkin, and those do much more data stuff. [17:53:05] right [17:53:19] We'd be going up from 32 workers to 48 workers. [17:53:47] here, lemme just rebase and tweak [17:53:51] so it’s safe [17:54:39] awight, https://gerrit.wikimedia.org/r/396064 [17:55:14] Woops. I have a mistake in there ^_^ [17:55:57] No patchset uploaded. [17:56:03] *new [17:56:51] Mind if I do? [17:57:13] do what [17:57:21] nvm [17:57:43] I was going to tweak it to not change production [17:57:53] but it looks like you’re still hacking [17:58:05] Na. How would we fully not change production? [17:58:27] In this case it shouldn't change scb1*, but it will have an effect on scb2* [17:58:58] It is very annoying that scb is different between CODFW and EQIAD. [17:59:01] I would set the default to 32, then add entries to hosts/scb100* to set to current values. [17:59:08] awight: direct hiera calls are not evil, implicit ones are [17:59:33] akosiaris: Like, class params? [17:59:34] and any kind of hiera call outside of a profile is [17:59:38] https://gerrit.wikimedia.org/r/#/c/396064/4/modules/profile/manifests/ores/web.pp is an example of explicit? [17:59:42] is evil that is [17:59:51] yes, that's the good call [17:59:58] cool I think I see what you mean then [18:00:34] * akosiaris searches search for the wmf doc on this [18:00:46] there you go https://wikitech.wikimedia.org/wiki/Puppet_coding#Organization [18:00:52] that's WMF guidelines on these things [18:00:56] akosiaris: I think future computer historians are going to look back on puppet and go “OMFG” [18:01:19] I definitely agree [18:01:27] awight, I agree with setting up specific hiera for scb1* [18:01:32] You want to do it or should I? [18:01:52] anyway, got to run.. send reviews my way and I 'll review :D [18:01:56] :) [18:02:05] halfak: Happy to. One moment, please [18:02:46] OK I'll leave it to you. [18:02:56] kk thanks akosiaris [18:04:00] halfak: I won’t do it this time, but FYI my favorite way to co-author stuff like this is to do a followup patch in real-time, and the first author can cherry-pick -p in as desired. [18:04:58] halfak: I see a few of the existing lines are violating “explicit hiera calls with no fallback value” [18:05:27] awight, not sure what you mean there. [18:05:38] AIUI, this is about the “, 48)” default value [18:05:49] So that's good or bad? [18:05:59] Looks like that is a "fallback value" [18:06:16] not gonna change that though. [18:06:41] I think the style guide is saying that fallback value is bad [18:06:42] Is that good or bad [18:07:09] Both [18:07:17] But it’s not clear to me where the defaults should go, so I’ll leave it alone. [18:07:21] Ahh yeah. I agree. There are two fallbacks in place. [18:07:27] In the module itself [18:07:29] Check it out [18:07:31] At the top [18:07:38] the fallback in the module will never come into play though [18:07:56] modules/ores/manifests/web.pp [18:08:03] Why not? [18:08:58] I think we only um instantiate (donno what it’s called in puppetville) those modules from the profile code. [18:10:50] There’s no hiera.yaml, so I have no clue how to reason about which config files are read for each node. [18:11:15] awight, but you changed stuff in modules/ores/manifests/web.pp before [18:11:25] Why did you change it there if you don't think it matters? [18:11:27] sure I’m just following the trail [18:11:51] I can’t blaze new trails cos I don’t know what’s up [18:12:47] Ahh yeah. I'm in the same boat. Gonna rely on ako* to help us out in review. [18:13:14] I think we should drop the fallback in profile, keep it in the module and set the hiera in all of the scb1* [18:13:16] And we're done. [18:13:32] Then a follow-up can clean up all of the other fallbacks in profile. [18:14:07] awight, ^ [18:14:07] Other way around maybe but ok [18:14:11] ? [18:14:17] What would be the other way around? [18:14:24] If the profile always instantiates the module, then the fallback in the module is never used. [18:14:35] I don't think that is the case. [18:14:54] ummm [18:14:55] I think if hiera doesn't have anything (and we don't do the fallback in profile) then the module's value will be used. [18:15:03] ok I’m keeping both fallbacks cos this is spooky [18:15:08] I’ll comment in the commit message. [18:15:10] kk [18:15:38] That's fine with me. Again, can be cleaned up in a follow-up [18:19:01] Pushed. [18:24:19] +1'd [18:24:23] I'm running off to lunch [18:24:30] back for post morten in ~30 mins [18:24:34] *mortem [18:24:37] o/ Nettrom [18:24:38] :D [18:24:51] Typo reminded me to say "Hi" but now I'm running away for lunch [18:24:55] have a good lunch, halfak :) [18:53:19] akosiaris: your one of the Datacenter team people right? [18:55:46] dc ops? no he's not [18:56:34] (also bear in mind it's 9 pm our tz so he's hopefully away) [18:57:41] back [18:57:49] In the Post Mortem call [18:57:59] (not started yet) [18:58:23] o/ awight [18:58:29] have you ever done one of these before. [18:58:35] nope [18:58:38] I just realized I have no idea what they are actually like [18:58:46] I guess we let Greg drive it? [18:58:53] Only in fundraising, which is unique I’m sure [18:59:10] I actually didn’t realize this was a standard thing. [19:04:29] (03CR) 10Catrope: [C: 04-1] Join decomposition of ores_model table queries (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395811 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [19:09:17] Ugh anyone an networking expert here? [19:36:19] hey, do you know the status of ores200*.codfw.wmnet? [19:36:33] they seem to exist but not in use yet? [19:36:48] are they going to use role(ores::stresstest) maybe? [19:37:28] i just need all nodes to have _a_ role, if in doubt i will give it "role(test)" for now, just cant be without any role as it seems now [19:38:29] found it, status: stalled ok https://phabricator.wikimedia.org/T165170 [19:38:59] but the reason to stall it was "while https://phabricator.wikimedia.org/T169246 is ongoing" and that ticket is resolved [19:39:10] so might be unstalled [19:40:42] 10Scoring-platform-team, 10ORES, 10Operations, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3821059 (10Dzahn) Is this unstalled now? The reason was while T168246 is ongoing but that ticket is resolved. Is it really resolved though? [19:41:41] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3414945 (10Dzahn) Is the stress test over? Then T165170 is probably unstalled now. Is it not over yet? Then maybe this ticket shou... [19:45:15] *note, reopen stress task [19:45:37] ^ lol [19:46:08] awight: ive never heard of a dev to want to do that but ok [20:02:45] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3821108 (10awight) @Dzahn sorry--we decided to test some more, to overcome a suspiciously low performance ceiling. I'll make the fol... [20:05:45] 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3821137 (10awight) [20:05:47] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3821132 (10awight) 05Resolved>03Open Reopening until we finish with {T182249}. [20:15:02] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Refactor ORES puppet for Kubernetes - https://phabricator.wikimedia.org/T182332#3821154 (10Halfak) Right now, it seems like we want to have one uwsgi worker per celery worker because a uwsgi worker will block while a celery worker generates a score. We'll... [20:16:41] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Refactor ORES puppet for Kubernetes - https://phabricator.wikimedia.org/T182332#3821157 (10Halfak) [20:17:27] 10Scoring-platform-team, 10ORES, 10Performance: Profile ORES code memory use - https://phabricator.wikimedia.org/T182350#3821158 (10awight) [20:22:04] back in 30 minutes [20:52:59] 10Scoring-platform-team, 10MediaWiki-extensions-ORES: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354#3821279 (10awight) [21:31:45] awight, I have growing skepticism about using Flow for JADE comments [21:49:16] It comes down to, are we providing a way to spell, I think. [21:49:16]