[00:04:01] Krenair: yeah, maybe so. I guess I should remove that "this is probably uncontroversial" part ;) [00:04:12] greg-g, I'm not saying it's controversial [00:04:26] I'm just saying I don't know if this will achieve anything [00:07:06] greg-g: ultimately I think that issue is just about SWAT deployers saying no [00:08:01] * greg-g nods [00:08:17] ultimately if the deployer is cool with pushing the change then nobody has been hurt [00:08:34] but when I see swats that need a scap I'm pretty supsicious [00:08:52] suspicious even :) [00:09:24] when I did swat, I very rarely had to run scap [00:09:43] *nod* it should be pretty exceptional [00:10:12] as in, a full scap [00:10:45] I guess they are all scap ... now :) [00:10:53] heh "needs a scap" will have to be deprecated ;) [00:11:08] "needs a slow ass l10nupdate build" [00:11:14] hah [00:11:28] unless it's fixing a message issue, of course [00:11:48] If SAL is to be believed, I've run scap about three times [00:11:50] even then I think that's beyond what SWAT was originally intended for [00:12:06] someday: one command that does the right thing [00:12:09] * greg-g nods [00:12:34] I'm sure that you've noticed that hardly anyone does MW patch deploys outside of swat anymore [00:12:47] which is maybe good, but its bad if people aren't learning how to deploy [00:13:05] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:13:34] I don't want things to revert to the old "only X can do that" just because people got used to it being easy [00:14:04] * bd808 should really volunteer to run the train some week to remember all the pain [00:14:21] it's better! slightly! [00:14:33] eh, I think since I've been here I've seen more people learn to deploy recently than when I first started. [00:15:00] this may be something to do with concentration bias [00:15:21] thcipriani: awesome. I know I've seen a lot of scap3 users [00:15:36] I still need to move some things off of trebuchet [00:15:46] scholarships and iegreview for sure [00:15:49] yeah, that's coming along. Deep into the long-tail. [00:16:10] mostly things that are getting deployed frequently use scap3 [00:16:20] with a few notable exceptions [00:16:42] parsoid got changed over right? [00:16:53] yeah! 40-some-odd servers [00:17:04] nice [00:17:34] so we know scap3 can work within an order of magnitude of what MW needs :) [00:17:42] :D [00:17:51] we have 10x that number of servers though [00:18:04] we were talking about how to get mw to scap3 today, game plan. [00:18:36] 1) lock some people in a room, 2) give them a root to help, 3) do it [00:18:44] kinda :Pj [00:18:46] -j [00:18:53] I want my pjs on now [00:19:02] honestly my feeling is: end of fiscal year seems like a reasonable goal. There are probably lots of unknown unknowns here. [00:19:05] Rowan has an awesome pair of footie pjs that I want [00:19:29] (I think it's quitting time on that note) [00:19:29] I think a good first step may be to: find a number to represent what's deployed [00:19:43] thcipriani: it's going to break in a zillion small ways. You just need to be ready to power through that with fixes [00:19:46] like we have no real overall version number for mediawiki deployments [00:20:46] I still think it should just be a git hash. use /srv/mw-staging as staging; copy into /srv/deployment/mw/...; commit; push [00:20:52] svn-style versioning? [00:20:53] I feel like it's probably reasonable, (and I think you suggested this before bd808 ) to move ...yeah [00:21:23] I would have done that a year ago if it was my project :) [00:21:44] I think use rsync on tin/mira locally: flatten the whole thing out, then add to git, and you can (mostly) use scap3 from there [00:21:48] actually 2 years ago if Erik hadn't pulled me off for SUL [00:22:05] * greg-g shakes his fist [00:22:09] it needs some l10nupdate integration somewhere right? [00:22:23] that would be part of what gets versioned [00:22:29] right [00:22:46] yeah, that's tricky, this is one of the unknown unknowns, but my feeling is that you get more info about the json files that are generated as part of l10n than you otherwise would. [00:23:06] and gitignore cdbs, like they are now (effectively) [00:23:06] its really exactly what we do now, except you replace rsync with a git fetch [00:24:01] hrm. symlink swapping fanciness? This is something we were talking about. [00:24:10] and it will make it soooo much faster [00:24:14] 10Deployment-Systems, 06Release-Engineering-Team, 15User-greg: Require an associated task with each SWAT item - https://phabricator.wikimedia.org/T145255#2624906 (10Dereckson) Changes without associated tasks seem to generally be fixes to errors in production (generally triggered by JS code, which seems les... [00:24:37] I wouldn't get fancy at first honestly [00:25:13] git fetch + git rebase $HASH will be jsut as fast as the current rsync for being "atomic" [00:25:24] indeed. [00:25:38] and I know we want that to be better, but changing one thing at time is easier to debug [00:25:52] 10Deployment-Systems, 06Release-Engineering-Team, 15User-greg: Require an associated task with each SWAT item - https://phabricator.wikimedia.org/T145255#2624914 (10greg) >>! In T145255#2624906, @Dereckson wrote: > The existence or not of a task "Deploy CentralNotice at ", task welcome for better depl... [00:26:13] adding in depooling via etcd first would actually make atomic change mostly moot [00:26:20] yup [00:26:32] we're trying to think about how to divide this into quarters this out right now. That might be a good q2 goal: move to git for transport. [00:26:32] I like the depooling/repooling solution to atomicness [00:26:52] I can't think of a smaller chunk, that's a huge one. [00:27:04] scap itself is all ready for that. It just needs an actual working depool/repool in place of the shitty hack I tried [00:27:19] so we'd wait for traffic on a server to entirely drain before we apply the changes and repool? [00:27:33] yeah [00:27:40] I don't know if depooling/repool == atomic deploys. [00:27:41] might that slow things down a bit? [00:28:03] it would probably require tuning the batch size a bit [00:28:06] just got the "come pick us up" text, have a good weekend ya'll [00:28:20] o/ see ya greg-g [00:28:49] like you can see a scenario where it might be a long time where half the servers have one version of the code, half have another version. [00:28:52] o/ greg-g [00:29:12] Krenair: it probably would at first, yes. And it would slow everything down a bit because I think we would also kill sync-file/sync-dir at the same time [00:29:30] in the depooling/repooling world and that still wouldn't be super atomic. Like per-server git might make it atomic, but not per deployment [00:29:33] so deploys would be safer [00:29:36] but it would speed up a full scap significantly I think [00:29:37] but slower [00:29:47] Hello. I confirm bd808/krenair point, the only times I needed a scap was exceptional and because l10n key changed. [00:30:50] thcipriani: yeah. the mixed cluster problem is a whole different thing [00:31:20] but honestly I think we should deal with that differently, because we should be saling on % of traffic and not named wikis [00:31:22] thcipriani, you're talking about the problem of different servers in the cluster having different versions at the same time? [00:31:24] *scaling [00:31:53] Krenair: yeah, it's a related but possibly tangential problem that we have now to some degree [00:32:04] yeah that's a whole different thing [00:32:32] when we discuss atomic deploys we generally mean dealing with the problem of one server having an old version of one file but a new version of another [00:32:34] the progression really should be group0, 10% traffic, 50% traffic, all traffic [00:32:50] mabye those numbers aren't quite right, but I think it gives the point [00:33:16] hrm, that would be nice. [00:33:50] and the time between g0 and N% should be pretty short like an hour or two [00:34:24] you think a deployment should take an hour or two to fully effect everything? [00:34:24] the bugs found in group0 are typically OMG broken. the 24 hour soak doesn't change much [00:34:44] anyway, moving to git in a quarter I think mainly gives us a version that represents the deployed code: which has a lot of neat side-effects (like a server can tell it has the wrong version!) plus we trade disk-io on proxies and deployment servers for network traffic. [00:35:04] Krenair: I don't think we are ready for full rolling deploy in a few hours yet, no [00:35:18] but I think we should get from almost no traffic to a reasonable traffic sooner [00:35:40] I'm not entirely sure it'll be any more or less atomic than it is currently considering rsync has all the --delay type flags added [00:36:06] I think it will be similar, not a real gain or loss there [00:36:14] agreed [00:36:20] but the iops gain on rsync will be huge [00:36:52] yeah, but the network traffic difference is also likely to be sizable [00:37:02] hard to beat rsync on transfer [00:37:19] you think? git is basically precomputed diffs [00:37:43] I'm not talking fresh clones [00:37:50] litterally git fetch + git rebase [00:38:03] I think it'll be comparable after we get past the initial setup [00:38:34] Oh, sure. the first sync is a doozie [00:38:36] the fresh clone thing...not looking forward to that. Was actually thinking about: if you don't have a /srv/mediawiki/.git: rsync it [00:39:07] should be able to sparse clone I think to bootstrap a new server [00:39:23] these aren't working copies for push, just replicas [00:39:33] yarp [00:39:34] but it would need a bit of testing [00:40:34] I know you folks are on it. I don't mean to armchair quarterback [00:40:35] well, I think we're still on that page in terms of a bare-bones plan. Once we have some kind of git transport setup, I think that buys us a lot of interesting stuff. [00:40:42] just be bold and get shit done :) [00:40:45] :D [00:41:26] Nah, we *just* had a meeting today about it. I said in the meeting: someone ought to run all this stuff by bd808 since he's spent as much time thinking about it as we have, likely [00:42:14] I feel like we're mostly lined up on our thinking here, so that's good [00:42:17] I hope :P [00:42:28] or we're all very wrong about this. [00:43:03] we all could be honestly, but then we would learn and do something more awesome [00:43:58] Or just become wandering vagrants :) [00:44:27] ranting incoherently about disk io [00:44:41] the only other direction I would be interested in would be something like FB's tupperware system with torrents of squashfs files [00:45:17] but at that point we might be better off looking at Docker/rkt containers [00:45:51] yeah. There are a lot more steps between here and there. I'd like to see if we can get all the niceties without all those steps. [00:46:29] and if we can't, then we'll have to take a few extra quarters I guess. [00:46:50] thcipriani: there is no deadline... [00:47:03] except my pizza just showed up :) [00:47:10] ha [00:47:34] ttyl [00:48:49] o/ [00:53:05] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [01:07:24] 10Continuous-Integration-Config, 10OOjs-UI, 13Patch-For-Review: Move OOUI out of the MediaWiki gate-and-submit queue - https://phabricator.wikimedia.org/T134946#2625014 (10Jdforrester-WMF) This seems to now be done somehow? [01:25:14] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:00:14] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [02:26:20] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:01:14] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [03:22:58] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:56:44] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3, 13Patch-For-Review: Create `scap swat` command to automate patch merging & testing during a swat deployment - https://phabricator.wikimedia.org/T142880#2625080 (10mmodell) [04:02:58] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [08:14:28] (03CR) 10Hashar: [C: 031] Add tox checks to pywikibot/bots/FLOSSbot [integration/config] - 10https://gerrit.wikimedia.org/r/309715 (https://phabricator.wikimedia.org/T145179) (owner: 10Dachary) [08:16:10] (03CR) 10Jean-Frédéric: [C: 031] Add tox checks to pywikibot/bots/FLOSSbot [integration/config] - 10https://gerrit.wikimedia.org/r/309715 (https://phabricator.wikimedia.org/T145179) (owner: 10Dachary) [08:16:56] (03CR) 10Hashar: [C: 032] "Moaaar bots everywhere :]" [integration/config] - 10https://gerrit.wikimedia.org/r/309715 (https://phabricator.wikimedia.org/T145179) (owner: 10Dachary) [08:18:00] (03Merged) 10jenkins-bot: Add tox checks to pywikibot/bots/FLOSSbot [integration/config] - 10https://gerrit.wikimedia.org/r/309715 (https://phabricator.wikimedia.org/T145179) (owner: 10Dachary) [08:21:37] (03PS1) 10Hashar: Whitelist Loic Dachary [integration/config] - 10https://gerrit.wikimedia.org/r/309721 (https://phabricator.wikimedia.org/T145179) [08:21:59] (03PS2) 10Hashar: Whitelist Loic Dachary [integration/config] - 10https://gerrit.wikimedia.org/r/309721 (https://phabricator.wikimedia.org/T145179) [08:23:27] (03CR) 10Hashar: [C: 032] Whitelist Loic Dachary [integration/config] - 10https://gerrit.wikimedia.org/r/309721 (https://phabricator.wikimedia.org/T145179) (owner: 10Hashar) [08:24:31] (03Merged) 10jenkins-bot: Whitelist Loic Dachary [integration/config] - 10https://gerrit.wikimedia.org/r/309721 (https://phabricator.wikimedia.org/T145179) (owner: 10Hashar) [08:35:39] g'morning hashar [08:35:44] g'night all [09:36:25] (03CR) 10Dachary: "Thanks !" [integration/config] - 10https://gerrit.wikimedia.org/r/309721 (https://phabricator.wikimedia.org/T145179) (owner: 10Hashar) [09:38:43] 10Continuous-Integration-Config, 13Patch-For-Review: FLOSSbot continuous integration - https://phabricator.wikimedia.org/T145179#2625274 (10dachary) @hashar thanks for the help, it is much appreciated :-) From my point of view this task is resolved. Do you see anything else requiring our attention ? [09:44:04] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [10:07:44] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:08:22] PROBLEM - Puppet run on deployment-ms-be02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [10:10:10] PROBLEM - Puppet run on integration-slave-trusty-1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:24:05] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [10:42:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:43:21] RECOVERY - Puppet run on deployment-ms-be02 is OK: OK: Less than 1.00% above the threshold [0.0] [10:45:09] RECOVERY - Puppet run on integration-slave-trusty-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [11:58:34] 10Continuous-Integration-Config, 13Patch-For-Review: FLOSSbot continuous integration - https://phabricator.wikimedia.org/T145179#2625379 (10hashar) Yes one less thing: start the bot and populate wikidata! :] Whenever you need the binary packages, either reopen this task or fill a new one. Adding them will b... [12:07:18] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:42:20] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:44:03] 10Continuous-Integration-Config, 13Patch-For-Review: FLOSSbot continuous integration - https://phabricator.wikimedia.org/T145179#2625413 (10dachary) > start the bot and populate wikidata! :] Will do :-) [12:44:23] 10Continuous-Integration-Config, 13Patch-For-Review: FLOSSbot continuous integration - https://phabricator.wikimedia.org/T145179#2625414 (10dachary) 05Open>03Resolved p:05Triage>03Lowest [13:24:40] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:48:02] Yippee, build fixed! [13:48:03] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #141: 09FIXED in 4 min 1 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/141/ [14:04:38] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:28:28] PROBLEM - Puppet staleness on deployment-salt02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [15:12:43] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:52:43] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:38] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:30:37] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:52:33] PROBLEM - Puppet run on deployment-mediawiki02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:32:33] RECOVERY - Puppet run on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:13:42] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [20:53:42] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [21:23:59] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:03:44] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:03:58] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [22:43:45] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:44:41] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:24:41] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]