[12:50:22] akosiaris: o/ [12:55:59] o/ [13:05:08] akosiaris: Can we start the process of deployment for ores? what else is needed? [13:05:12] halfak: o/ [13:05:39] I just logged in securely in bastian3001 [13:12:27] o/ [13:12:36] In meeting will be back for updates in ~50 minutes :) [13:19:28] Amir1: just merging https://gerrit.wikimedia.org/r/280403 and https://gerrit.wikimedia.org/r/291751 [13:21:37] \خ/ [13:21:41] sorry, \o/ [13:33:50] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2340955 (10Ladsgroup) Got a workaround with adding deployment-tin.deployment-prep.eqiad.wmflabs' IP as tin.eqiad.wmnet in /etc/hosts [14:12:22] Back! [14:13:40] hey halfak, what software did you use to create the diagrams at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores? [14:20:15] Google drawing [14:20:21] :) [14:20:27] schana, ^ [14:20:39] BRB dog fetch time [14:45:08] 06Revision-Scoring-As-A-Service, 10Wikilabels: Write post-mortem for wikilabels downtime (2016-05-29) - https://phabricator.wikimedia.org/T136523#2341055 (10Halfak) Can you use the incident reporting template on Wikitech. See https://wikitech.wikimedia.org/wiki/Incident_documentation and example https://wikit... [14:45:23] Amir1, https://phabricator.wikimedia.org/T136523#2341055 [14:46:17] sure [15:22:57] Amir1: one last question before merging https://gerrit.wikimedia.org/r/#/c/280403/ [15:23:16] I see redis_host defaulting to 127.0.0.1 but I don't see it overrideable in any hiera for ores [15:23:32] er, overrided* [15:23:36] akosiaris: it's overridden in beta setup [15:23:38] let me check [15:24:05] https://wikitech.wikimedia.org/wiki/Hiera:Ores-staging [15:24:15] https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [15:24:44] "ores::scapdeploy::redis_host": deployment-ores-redis.deployment-prep.eqiad.wmflabs [15:24:53] oh, we need to change this [15:24:58] ah, it's the mess with half of deployment-prep's hiera being in ops/puppet and half not [15:25:00] ok [15:25:26] ok, lemme add a hiera yaml file for that to override it for production as well [15:26:35] thanks akosiaris [15:35:36] 06Revision-Scoring-As-A-Service, 10Wikilabels: Write post-mortem for wikilabels downtime (2016-05-29) - https://phabricator.wikimedia.org/T136523#2341293 (10Ladsgroup) @halfak what do you think of this? https://wikitech.wikimedia.org/wiki/Incident_documentation/20160531-wikilabels [15:41:42] Amir1: merged https://gerrit.wikimedia.org/r/#/c/280403/. /srv/ores/config/config/99-main.yaml got generated on ores-web-0X [15:42:13] will it cause any problems ? or does the 99 means it's lesser priority so the previous ones are more significant ? [15:42:18] Yes [15:42:20] or vice versa ? [15:42:21] That's backwards [15:42:34] so > number, greater priority [15:42:40] 00-main.yaml gets read first and overwritten [15:42:51] Yeah. Reading happens in alphabetical order. [15:44:05] ok. good thing uwsgi did not get restarted then. lemme see what needs fixing then. I am thinking only the redis host [15:50:34] ok fixed [15:50:59] I 'll restart ores-web-03 to make sure everything works as expected [15:55:49] everything seems to be fine [15:57:27] * halfak checks https://grafana.wikimedia.org/dashboard/db/ores [15:57:48] Yeah... can see the sudden tip in precaching and then recover. [15:58:22] Amir1, "Fix access issue with halfak's subpage."? [15:59:59] Amir1: halfak, port 8080 is already allocated in production. I 'll bump it to 8081 [16:00:24] OK. makes sense. [16:01:17] akosiaris: \o/ [16:36:50] Amir1, lost you. Seems like the connection dropped. [16:37:21] in one sec [16:37:21] trying to reconnect [16:59:03] 06Revision-Scoring-As-A-Service, 10ORES: [spike] Find out if we can still get health check warnings after lb rebalance - https://phabricator.wikimedia.org/T134782#2341630 (10Halfak) +1 for something like /nginx_status [17:00:34] BRB dog tiem [17:00:36] halfak: it's 10 PM here [17:00:41] :) [17:00:42] got to go [17:00:57] be back in a few minutes [17:15:19] Back! [17:22:04] I'm back now [17:26:19] akosiaris: it's awesome. Do you want to try deploying ores? I think we are ready (or we need something else and I'm wrong) [17:26:46] I'm not talking about LVS, varnish, etc. just the ores itself [18:04:48] o/ YuviPanda [18:04:57] Was hoping to get some thoughts from you about deployment systems [18:05:03] hello halfak [18:05:07] fun topic :) [18:05:19] We're considering to continue using Fabric in our WMFlabs environments and use scap with WMF production deployments [18:05:36] Given that Scap is way more complicated and totally unsupported in labs [18:06:03] I agree [18:06:07] that seems like the thing to do [18:06:12] but this is part of a larger quesetion of [18:06:13] We are getting a little push-back on trying to keep things configurable in this way. See https://gerrit.wikimedia.org/r/#/c/291527/ [18:06:43] halfak: was that actual pushback recorded elsewhere or just the question? [18:06:52] E.g. not strongly associating uwsgi services in general too strongly with scap deployment [18:06:56] halfak: I think the bigger question is what is the labs cluster going to be used for once the production one is? [18:07:04] is available, that is [18:07:04] YuviPanda: we talked in here [18:07:24] at the day of the question [18:07:29] let me find the log [18:07:32] YuviPanda, experimental models mostly. For the changes that we want to experiment with that will need a compute cluster and real users. [18:07:46] ^ WMFlabs continued use [18:07:56] halfak: right, take a wild guess at % usage split? [18:08:08] halfak: also why can the experimental models not be deployed in prod? [18:08:14] what's preventing that from happening? [18:08:16] Probably 90/10 after we get people to convert to using ores.wikimedia.org [18:08:30] YuviPanda, they will likely have different performance characteristics [18:08:38] I want to keep ores.wmflabs.org pretty close to HEAD [18:08:44] and ores.wikimedia.org pretty far from HEAD. [18:08:51] right [18:09:02] so you're going to use it as some sort of staging environment as well [18:09:10] Kinda yea. Pre-staging though [18:09:19] hmm [18:09:28] would you deploy unmerged patches to ores/revscoring in this setup? [18:09:30] to test? [18:09:37] I see ores.wikimedia.org as critical infra that would be very bad were we to break something. [18:09:43] And I want the flexibility to sometimes break things. [18:09:50] right, but there's also ores in beta cluster [18:09:52] YuviPanda, na. [18:10:07] YuviPanda, fair point. Can we arbitrarily stand up new workers in beta? [18:10:10] so what's the difference between ores in beta cluster and ores.wmflabs.org? [18:10:12] halfak: yes [18:10:24] not sure there needs to be a difference. [18:10:34] I don't want to give up the domain, but I suppose that is solveable. [18:10:54] we can setup a http redirect, or if you feel strongly enough about it move the domain as well [18:10:58] I guess I want a wmflabs staging server for testing deployments before they go out to the end point that people will use. [18:11:09] right, and if .wikimedia.org is the endpoint people would use [18:11:13] you can use beta cluster for staging [18:11:15] I don't want to have tools hitting beta expecting it to work [18:11:25] But I'd like tools hitting ores.wmflabs.org expecting it to work. [18:11:26] YuviPanda: beta setup is smaller, sits with some other services in sca-01 so if we break sca-01 other things break too (citiod, mathoid) [18:11:29] which is the same thing other eople do [18:12:08] So, in this situation, I would want staging to exist before ores.wmflabs.org deployments [18:12:18] So when i imagine beta, that doesn't make sense. [18:12:21] so in general I think most services are auto deployed from master [18:12:38] Master of a configuration repo sounds fine to me. [18:12:42] err [18:12:45] I mean, in beta cluster [18:12:49] I'd have separate configuration repos for wikimedia.org and wmflabs.org [18:12:49] they are auto deployed from master [18:13:04] e.g. wiki-ai/ores-wmflabs-config [18:13:06] gimme a sec, let me gather my thoughts since I seem to have a lot on this topic [18:13:08] YuviPanda: not yet. let me find the pahb card [18:13:15] kk thanks YuviPanda :) [18:13:19] https://phabricator.wikimedia.org/T131857 [18:15:02] halfak: Amir1 I've decided to write this down collaboratively instead, since I think that's a better medium than chat [18:15:04] https://etherpad.wikimedia.org/p/environments-ores [18:15:06] please join me [18:15:20] of course [18:15:21] thanks [18:15:25] * halfak joins [18:16:51] Amir1: halfak can you name yourself on the etherpad? [18:17:19] YuviPanda: I'm Batman [18:17:28] I usually use it in etherpad [18:18:54] halfak: I added a qualifier to 'who gets paged' [18:20:15] +1 I think this will be a requirement for wmflabs ores. Want to deploy? Set up for paging. [18:20:22] halfak: I added a question section under the wmflabs ores [18:27:30] halfak: I'm thinking some more right now. [18:29:18] halfak: Amir1 I wrote more things [18:30:39] halfak: Amir1 let me know what you think of that proposal [18:30:58] * Amir1 is reading [18:37:25] Amir1: halfak I wrote a proposal at the end [18:42:24] halfak: Amir1 I think we've sorted out the question of environments now, and now we can come back to talking about deployment systems more clealry I think [18:42:39] let me know when you guys are done with that etherpad (I am now) and we can resume discussion here [18:43:53] All done. [18:44:00] This is exciting :) [18:44:10] halfak: :) [18:44:15] halfak: so now wrt deployment systems [18:44:22] thanks YuviPanda for putting this up [18:44:31] halfak: scap3 is obvious choice for wmf prod and beta cluster staging [18:44:36] +1 [18:44:44] No choice there. happy to conform. [18:44:52] halfak: and in an ideal world the choice for experimental as well, but since we do not live in ideal world we'll use fabric [18:44:55] :D [18:45:00] and that's ok since it isn't meant to match prod [18:45:05] or be used for staging [18:45:08] but is totally experimental [18:45:12] and tracks a totally different repo [18:45:13] * halfak just wishes that scap3 was better scoped [18:45:21] * YuviPanda pulls halfak out of rabbit hole [18:47:15] a stupid question: was Monday a holiday? IIRC it was the memorial day [18:47:20] Amir1: in the US it was [18:47:30] okay [18:47:45] halfak: Amir1 so we have to keep fabric working as is until we announce and ask people to switch to ores.wikimedia.org, which can take a while [18:48:08] YuviPanda: the thing is that puppet configs are going in a direction that leaves no choice except scap3 [18:48:30] halfak: Amir1 and then retool to make it work on the experimental cluster (i'd highly appreciate if we call it experimental than 'labs' to avoid confusion, given how overloaded that term is) [18:48:45] Amir1: I think with this etherpad we can convince akosiaris :) can you also provide me logs of your conversation? [18:49:06] YuviPanda: yup, where should I paste it? [18:49:28] Amir1: anywhere is fine :) [18:49:41] https://www.irccloud.com/pastebin/nReJITRd/ [18:49:49] the easiest way [18:49:50] :D [18:50:01] yeah [18:50:39] Amir1: halfak right, so I think what akosiaris made sense if ores.wmflabs.org was going to 'go away' afterwards, but given that you want to use it for experimental setups we can refactor our puppet to be good enough to work without it I think [18:51:12] yeah [18:52:01] thanks for listening [18:52:05] Amir1: halfak worst case, we can have a separate role that doesn't include the scap target for experimental and refactor common bits to a common role. This is fine since it isn't going to be used as staging [18:52:29] but that's a different story than *Right now* where we have to make sure fabric can still deploy [18:52:43] actually that was my proposal to halfak earlier today [18:52:44] unless you are ok with a deployment freeze [18:52:52] until we move everything to wikimedia.org [18:52:55] are you? [18:53:18] YuviPanda, probably not given the current pace [18:53:34] halfak: yeah, I agree. [18:53:45] Amir1: what was breaking fabric deploys? [18:54:05] the scap::target invoke in service::uwsgi [18:54:05] was the scap::target just breaking puppeT? [18:54:14] the latter [18:54:17] right [18:54:39] * YuviPanda ponders [18:56:18] halfak: Amir1 here's my proposal then: [18:56:28] a transitionary: 'use_scap' flag that we can disable in ores.wmflabs.org [18:56:34] for service::uwsgi [18:56:44] it is transitionary since we'll kill it after people move to prod [18:56:51] * halfak needs to run away. [18:56:52] and then we'll just use a different role for experimental [18:57:03] Amir1: if you can open a bug about this I'll comment there [18:57:03] Sorry. Will be back in a couple of hours. [18:57:06] o/ [18:57:10] halfak: np! I'll work with Amir1 and figure this out [18:57:10] o/ [18:57:16] YuviPanda: yeah sure [18:57:19] I had a phab card [18:57:34] YuviPanda: https://phabricator.wikimedia.org/T136488 [18:58:10] so if I get you right, we need to do it in service::uwsgi? [18:59:00] Amir1: yup, I'll make a patch [18:59:10] awesome [18:59:11] thanks [19:03:38] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2336803 (10yuvipanda) List of reasons why this is a problem: 1. Setting up a standalone scap3 master in a project that is not deployment-p... [19:03:45] Amir1: ^ does this sound ok to you? [19:04:06] * Amir1 looks [19:06:54] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2342181 (10Ladsgroup) {meme, src="southparkfan-approves", below=Great!} [19:07:22] I don't know who made that meme, it's funny [19:08:48] (edited the comment) [19:08:59] YuviPanda: It is great reasoning. Thanks [19:12:01] Amir1: I uploaded the patch [19:12:17] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 07Puppet: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2342223 (10yuvipanda) Patch takes slightly different approach, but same thing. [19:13:21] Thanks :) [19:13:59] Amir1: yw! I'll wait for akosiaris to chime in before merging it [19:14:40] yeah, probably tomorrow noon (my time) [19:14:46] in 12 hours [19:15:14] Amir1: :) ok! are things stable till then? can you still deploy? is puppet working? [19:15:35] YuviPanda: yeah, we made some workarounds for now [19:15:47] kkk [19:15:50] so probably it's working but not very robust though [19:16:30] * YuviPanda nods [19:16:43] Amir1: I'll ping akosiaris tonight and see if I can sort it out before you come by [19:16:52] let's take this off halfak's plate :D [19:16:58] awesome [19:17:08] yeah, he needs his plate [19:17:35] Amir1: what did you think of us switching from IRC to the etherpad to hash that one out? [19:18:11] it was great mostly lets us keep track of things [19:18:32] Amir1: yeah, and also was more appropriate to what we were talking about I thought [19:18:36] so later when akosiaris comes, he can reads our reasoning. But reading chat logs is not nice [19:18:42] yeah [19:19:10] Amir1: am going to go afk now [19:19:17] me too [19:19:29] thanks for the help YuviPanda [19:19:29] Amir1: have a good night/day! [19:19:32] yw Amir1 [19:19:37] you too [19:19:40] o/ [21:44:38] o/ Amir1 [21:44:45] Have a minute to review https://etherpad.wikimedia.org/p/ores_weekly_update ? [21:45:31] Also I'd like to set up a goal for Fiscal year 2017, quarter 1. (AKA July, August and September) [21:45:50] halfak: as an update on the fabric / scap situation, we have it sorted. I've a patch tha'll make scap3 transitionally optional, and I've a transition plan for experimental in https://phabricator.wikimedia.org/T136488. I'll get that sorted / merged with akosiaris tonight when he's up [21:46:11] I think that we should target (1) a deployment of ORES extension to N wikis and (2) a new communications push about ORES and the availability of the extension. [21:46:17] (where N is some large number) [21:46:32] <3 YuviPanda [21:46:40] Thanks for your help (and push-back) with this. [21:47:38] np halfak! when you have time, I want to hear your thoughts on the technique of switching mediums to make this easier (IRC -> Etherpad -> back), since I feel like we'd have gone in circles lot more otherwise [21:48:32] +1 I think it's awesome. [21:48:42] I like that we have an artifact to reference [21:49:04] Rather than just IRC logs -- or worse: just the words "We talked about it on IRC" [21:49:06] yup. and artifact also shows progress on us hashing out the issues. [22:17:42] OK time to post weekly update.