[09:03:58] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10Collaboration-Team-Triage (Collab-Team-This-Quarter), and 4 others: Enable ORES filters for svwiki - https://phabricator.wikimedia.org/T174560#3962379 (10Sebastian_Berlin-WMSE) [10:04:39] 10Scoring-platform-team, 10ORES, 10Operations: Clean up redundant ORES celery_workers defaults - https://phabricator.wikimedia.org/T186734#3962586 (10fgiunchedi) p:05Triage>03Low [12:45:48] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3963176 (10akosiaris) >>! In T171851#3959677, @Halfak wrote: > I found this in our deploy repo. > > {P6677} > > Not sure what is going on as th... [13:35:30] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Rebuild ORES wheels on Stretch - https://phabricator.wikimedia.org/T184135#3963324 (10akosiaris) >>! In T184135#3952838, @awight wrote: > @akosiaris Our Stretch patches are available on the `stretch_conversion` branch. That includes Python 3.5 w... [14:33:30] Fawlty wifi day [14:33:30] I’m staring at myself in Hangouts [14:36:28] It’s strange, websites work and hangouts sees you [14:36:38] halfak: fyi ^ [14:37:10] Weird. I see you joining and leaving, but no audio [14:42:02] (03PS1) 10Zoranzoki21: Removed deprecated position statements from resource loader module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/409900 (https://phabricator.wikimedia.org/T184257) [14:46:52] (03CR) 10Jayprakash12345: [C: 031] Removed deprecated position statements from resource loader module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/409900 (https://phabricator.wikimedia.org/T184257) (owner: 10Zoranzoki21) [15:55:03] halfak: whoa, Phabricator won’t open for me: [15:55:04] Too many tasks (250). [15:55:34] maybe it’s just a network thing [16:04:25] yep. [16:34:25] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3964104 (10awight) @akosiaris Do you know whether that group_size change will apply to rollback as well? We want the rollback to be as fast as po... [16:39:48] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3964154 (10akosiaris) Yes it will. Which is why I am experimenting already with `fetch_batch_size` [1] [1] https://github.com/wikimedia/scap/blob... [16:44:41] (03Draft2) 10Alexandros Kosiaris: Remove the cluster server group and related stuff Add a new scb server group, to be removed soon Remove the ores-worker server group from production Move the production configs under [wmnet] stanza Add checks for the default group Pass --no-download to vi [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/409932 (https://phabricator.wikimedia.org/T171851) [16:48:38] akosiaris: Just wondering, I have some time today to help with the ores* deployment. Are you actively messing around with that? Anything I can do without being in your way? [16:49:36] awight: An answer to https://phabricator.wikimedia.org/T184135#3963324 would be greatly appreciated [16:49:49] on it [16:49:49] even if it's just a "go ahead with master" [16:50:21] I think your analysis is correct—also, I’d like to leave master untouched until we’re fully migrated. [16:50:23] I’m scared. [16:50:34] Lemme prepare the missing patch. [16:51:55] ok thanks [16:52:22] awight, looks like wheels needs a bump in the stretch_conversion branch [16:52:25] Doing that now. [16:53:02] (03PS3) 10Halfak: Bumps models and requirements for conversion to Debian Stretch. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/404886 (https://phabricator.wikimedia.org/T182799) [16:53:17] Was just sitting there waiting for the bump :| [16:53:36] Make sure to do "git submodule update --init" when testing out the deployment [16:53:44] WAIT! [16:53:48] DNM [16:53:53] That needs to be on the new branch [16:53:59] Uh? [16:54:00] It is [16:54:13] no, that’s the ”topic” [16:54:17] it’ll merge to master. [16:54:18] It's part of stretch_conversion [16:54:26] I’m picking over now [16:54:37] * halfak is not planning to merge it. [16:54:41] grr, we have no branch yet, one moment [16:54:42] :) tty [16:54:46] *ty [17:00:19] This is odd. The branch stretch_conversion already has that commit. [17:01:17] huh. it’s a different version of the same patch. [17:03:18] (03Abandoned) 10Awight: Bumps models and requirements for conversion to Debian Stretch. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/404886 (https://phabricator.wikimedia.org/T182799) (owner: 10Halfak) [17:03:33] (03PS1) 10Awight: Update wheels for Stretch [services/ores/deploy] (stretch_conversion) - 10https://gerrit.wikimedia.org/r/409950 [17:04:00] halfak: if you have a second to merge? ^ [17:04:31] (03CR) 10Halfak: [V: 032 C: 032] Update wheels for Stretch [services/ores/deploy] (stretch_conversion) - 10https://gerrit.wikimedia.org/r/409950 (owner: 10Awight) [17:04:37] akosiaris: ^ :) [17:05:02] ah nice [17:05:19] Sorry to slow down your work by bluffing earlier ;-) [17:05:23] no worries [17:05:40] let me see if this will work [17:05:45] doing a test deploy on ores1001 now [17:08:43] (03PS3) 10Alexandros Kosiaris: Remove the cluster server group and related stuff [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/409932 (https://phabricator.wikimedia.org/T171851) [17:09:35] K I’ll space out on other stuff but my highest priority this week is to support the migration. [17:10:21] btw, the Timeout, server ores100X.eqiad.wmnet not responding messages are totally misleading [17:10:47] what's happening after all in the background is that tin reaches max children and refuses git clone, git fetch etc [17:11:39] I am not sure why that totally misleading messages gets printed though [17:11:59] Cool, that’s vaguely what I had remembered. [17:21:04] wooohooo [17:21:06] success [17:21:09] awight: halfak ^ [17:21:30] ok ores1001 is now having both uwsgi-ores and celery-ores-worker working fine [17:21:54] it's not receiving currently requests but it would be scoring stuff already [17:23:28] right on! [17:23:48] I can run a wee stress test, if this is a good time? [17:24:32] yeah sure [17:24:42] remember it's only ores1001 currently up and running [17:25:03] I am starting deploys on all the others slowing trying to figure out the optimal value for fetch_batch_size [17:25:05] ah okay well, I’ll just make internal queries. [17:27:15] akosiaris: Responses are healthy. [17:27:26] :) [17:59:43] halfak: https://etherpad.wikimedia.org/p/JADE_extension_implementation [18:01:26] all ores100* boxes are now successfully deployed [18:01:40] fetch_batch_size does what you 'd expect [18:02:15] takes quite a bit of time to fetch the git objects but at least the promote and check stages proceed way faster [18:02:52] but overall a deploy takes around 13m with a concurrency value of 1 [18:03:09] I think I can bump this to 3 relatively harmlessly [18:06:42] akosiaris: Thanks for the incremental fix, I appreciate that we aren’t blocked on something that might not be corrected :D [18:07:12] And the new variable means that we have parallel rollback? [18:07:29] it should [18:07:37] note the *should* [18:07:46] I guess some testing will verify that [18:13:44] scores have dropped the in graphs in https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&orgId=1 [18:14:06] expected .. since the new ores clusters are scoring stuff [18:16:24] That’s great news. Should we be so bold as to pool temporarily? [18:22:01] we can [18:22:22] let's pool a couple of hosts and see what's happening [18:22:26] halfak ^ [18:23:00] In meeting. Looks exciting. Can't really make judgments about what is reasonable right now, but I don't see any concerns with giving it a quick test. [18:23:18] just pooled ores1001 [18:23:59] Great, I’ll watch graphs and what logs I have access to. [18:24:57] akosiaris: umm… jfyi it seems that all 18 hosts have been scoring since 17:35 [18:25:29] I think the celery config is pointing to eqiad’s Redis? [18:25:41] err eqiad == scb1* sorry [18:25:49] No harm done, it seems. [18:26:37] no, ores200* config points to codfw celery [18:26:40] not just eqiad [18:26:41] yep [18:26:45] 'w [18:26:58] Well, that means the machines have been aggressively scoring :) [18:28:09] I am trying to see if fetch_batch_size=3 actually changes anything now [18:28:14] in codfw only that is [18:33:45] akosiaris: Note for later, I think we can increase the celery worker count quite a bit. Those machines are sitting with 36GB of free memory and <5% CPU usage. [18:33:53] I forget why we chose such conservative numbers [18:34:22] ah—in part this is because of my puppet change last week which dialed down worker count on existing machines due to OOM. [18:34:39] yup [18:34:44] and agreed ofc [18:35:07] (duration: 06m 47s) for a deploy with fetch_batch_size: 3 [18:36:12] Cool, that’s livable. [18:36:15] ok doing one final around with a dummy comment change in scap/scap.cfg and then a rollback test [18:36:19] TY! [18:39:13] (03CR) 10WMDE-Fisch: [C: 032] Removed deprecated position statements from resource loader module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/409900 (https://phabricator.wikimedia.org/T184257) (owner: 10Zoranzoki21) [18:44:12] (03Merged) 10jenkins-bot: Removed deprecated position statements from resource loader module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/409900 (https://phabricator.wikimedia.org/T184257) (owner: 10Zoranzoki21) [18:45:10] (03CR) 10jenkins-bot: Removed deprecated position statements from resource loader module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/409900 (https://phabricator.wikimedia.org/T184257) (owner: 10Zoranzoki21) [18:50:05] 18:48:08 Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 14s) for all 18 boxes [18:50:05] T171851: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851 [18:50:28] rolling back now [18:50:49] * awight wrings hands [18:58:19] didn't work [18:58:25] it actually failed ... [18:58:29] what on earth... [18:59:49] Glad this was only a test! [19:01:06] so... fetch on an already fetched revision failed [19:01:22] /o\ [19:03:12] * akosiaris goes for dinner and bbl to debug this more [19:04:38] haarrr [19:04:53] Maybe this has to do with the new variable... [19:05:01] * awight reclocates too [20:07:30] halfak: These are both ready for re-review, https://github.com/adamwight/mw-ext-JADE/pulls [20:23:49] halfak: https://snag.gy/oRfhTj.jpg [20:23:57] Validation is solid. [20:24:06] OMG [20:24:08] Nice [20:24:58] Just got out of mega meeting. [20:25:03] I'm gonna eat a mega lunch [20:25:07] And then go to more mega meetings [20:25:54] Sorry awight. Will need to review tomorrow. [20:26:09] yup see you in a minute [20:26:20] Should be AFK for 35 mins now. [20:59:29] o/ [20:59:39] Got back just in time [21:21:19] yay! Welcome back ChanServ [21:31:08] halfak: Tomorrow is great, but I wanted to point you to this: https://etherpad.wikimedia.org/p/JADE_extension_implementation [22:55:31] OMG MEETINGS. [22:55:37] Just finished up. [22:55:41] * halfak clicks on etherpad [23:00:50] awight, I like it. The order makes sense. [23:01:05] Cool, thanks for taking a look.