[00:52:10] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3991583 (10bd808) If it makes anything any easier, I think we can easily tolerate a multi-hour (probably up to 24?) read-only period for wikitech while t... [00:55:09] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3991586 (10bd808) > Figure out about wikitech-static syncing. Can the script that dumps the whole database still run on silver? If so then this should be... [07:13:29] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3991942 (10Marostegui) First attempt to deploy puppet was reverted. First there was a typo nothing big: https://gerrit.wikimedia.org/r/#/c/413302/ And once tha... [07:33:16] https://gerrit.wikimedia.org/r/#/c/413304/ ? [07:33:44] check: https://phabricator.wikimedia.org/T184704#3991942 [07:40:00] so, we don't deploy it? [07:40:32] I reverted it because I didn't think it was a good idea to debug on the fly [07:40:39] Didn't want to rush things [07:40:51] no, no problem with that [07:41:12] just asking what did you want to do next, as you seem to know the issue [07:41:41] I guess we'd need to fix that puppet thingy and then we can deploy anytime I would say, I don't mind having tendril stopped for 2-3 hours during the day really [07:42:24] I think I converted the ferm thing from a profile to a resource after I proposed the patch [07:45:11] I am going to take a look at db2037 with broken puppet [07:48:24] would this work https://gerrit.wikimedia.org/r/#/c/413309/ ? [07:51:01] From what I saw the issue was with profile::mariadb::ferm, not with firewall [07:51:29] The firewall was just a typo that was fixed with my commit: https://gerrit.wikimedia.org/r/#/c/413302/ [07:53:21] yeah, that new one I guess should work [07:53:28] wanna deploy? [07:55:27] lets run the puppet compiler [07:55:45] oki [07:57:33] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10077/console [07:58:11] ERROR: Unable to find facts for host db1115.eqiad.wmnet, skipping [07:58:16] so not very useful [07:59:03] oh yeah, because it is a new host... [08:01:20] is puppet disabled on db1011? [08:04:44] no, I enabled it after reverting [08:04:45] let me disable [08:05:34] done [08:05:49] you can merge if you like [08:06:11] ok [08:14:50] db1115 looking good [08:14:51] no? [08:14:53] so far [08:15:17] seems stuck on Notice: /Stage[main]/Packages::Libaio1/Package[libaio1]/ensure: created ? [08:15:52] percona-tookit installation failed [08:16:15] I saw it is now on percona-xtrabackup [08:16:17] I think is the repo [08:16:31] xtrabackup is not installed on stretch because it was removed from debian [08:16:48] another great idea because of removing mysql [08:17:59] it is not puppet, apt-upgrade has problems [08:18:05] *apt-update [08:27:48] so we do the copy now? [08:29:50] sure [08:30:12] do you want to do it? [08:30:15] let me prepare it and stop mysql on db1011 [08:30:19] sure, I will do it [08:30:23] let's announce it [08:30:27] yeah [08:47:12] I wonder if dbmonitor will fail because of that? [08:47:28] no, we still get a 200 ok [08:49:40] I wonder how we should do- do we set a replica and disable events and/or set up row based replication [08:49:53] should we setup an independent master? [08:50:34] should we repoint to the new one right away? [08:50:55] I was wondering about that too [08:51:03] I am more inclined to say, let's point to the new one [08:51:16] if things fail, we can repoint back [08:51:21] agree [08:51:43] we could need some trials, db1011 was quite old [08:52:18] BTW, did you saw my comments on tendril yesterday? [08:52:30] yeah :( [08:52:35] I spent most of the time trying to back it up consistently, and failed [08:52:54] I was not entirely surprised, but I was convinced I was wrong XD [08:53:08] I think we should focus right away on prometheus-private [08:53:46] basicaly building blocks based on standarized solutions [08:54:12] and doing on our own the minimal things-a dashboard a tree [08:55:02] 123G copied in 12 minutes, so not bad [08:56:16] the disks were hd, right? [08:56:43] yep [09:05:47] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3992082 (10Marostegui) [09:05:58] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3992094 (10Marostegui) p:05Triage>03Normal [09:19:31] so I would add db2090 to s4 codfw [09:19:42] that is one of the new large ones, no? [09:19:51] yes, unused for now [09:20:23] and at some point, maybe make db2078 a multiinstance failover for all m* hosts [09:21:04] yeah, that's a good use of it [09:21:30] how many do we still have unused? [09:21:30] with that, there would be 1 left [09:21:33] right [09:21:47] we should check for gaps on load [09:21:57] I am a bit worried about enwiki growth [09:22:09] And I am a bit worried about enwiki codfw master [09:22:11] but we should compare current states [09:22:14] ha [09:22:51] we can upgrade all enwiki codfw to stretch and test a stretch master [09:23:00] :| [09:23:00] with a switchover, I mean [09:23:01] XDD [09:23:16] I would say we do that with s5 XD [09:23:24] why s5? [09:23:34] I was thinking s1 because I think there are some alrady done [09:23:56] yeah, but we will do a DC switchover "soon", do we want to try 10.1 master on enwiki as first shard? [09:24:19] s8 wouldn't be better, honestly [09:24:31] if you are worried about rollback, I would go for s3 [09:24:35] I said s5 because it has less traffic and would have left impact [09:24:41] less nodes and less probabilities of faliure [09:24:49] ah, s5 [09:24:56] I mixed it with wikidata one [09:25:03] s5 is dewiki sorry [09:25:22] no sorry, you said it well, it was me who had the mistake [09:30:29] btw: https://gerrit.wikimedia.org/r/#/c/413317/ [09:32:13] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3992141 (10Marostegui) [09:32:40] yeah, I saw it [09:32:48] apaprently dbtree uses it [09:32:51] not sure about dbmonitor [09:36:13] it does [09:36:22] $_ENV['db_host'] = 'tendril-backend.eqiad.wmnet'; [09:36:43] on modules/tendril/templates/config.php.erb [09:37:01] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3992142 (10Marostegui) For s2 I suggest db1076. The only non big server is db1060 (which will go away soon (T186320)) but is is sanitarium maste... [09:37:11] yeah, I was grepping too haha [09:57:11] <_joe_> marostegui, jynus if you see high load on some s1 slaves in the next hour or so, let me know [09:57:25] <_joe_> I'm running load testings on mwdebug1001, nothing should happen at all [09:57:30] <_joe_> but in case, lemme know [09:57:37] will do - thanks [09:58:16] could that wait? [09:58:29] we have degraded monitoring right now [09:58:51] marostegui: how much would you say it will take? [09:59:02] it is now 613G [09:59:10] of 900? [09:59:12] so I guess around 30 minutes [09:59:16] yep [09:59:27] _joe_: could you wait 30 minutes just in case? [09:59:44] <_joe_> ok [09:59:54] <_joe_> I'll just run the low-concurrency tests for now [10:00:01] thanks [10:00:10] <_joe_> they are useful anyways, and should be less than 0.1% of enwiki requests anyways [10:00:44] <_joe_> np, I'm not really in a rush, I will still have to wait next week for a full deploy of etcdconfig [10:01:09] <_joe_> I *think* we could start testing database configs via etcd in like a couple weeks [10:01:19] <3<3<3<3<3<3<3<3 [10:24:16] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3992241 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2090.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20180... [10:31:47] 50G to go! [10:42:55] mmm [10:43:00] Failed to start mariadb.service: Unit mariadb.service failed to load: No such file or directory. [10:43:57] ah it is jessie [10:44:19] jessie? [10:44:26] db1115 is jessie [10:44:33] with 10.0.34 apparently [10:44:34] we should reimage it [10:44:55] I don't want to do this migration again [10:45:03] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3992295 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2090.codfw.wmnet'] ``` and were **ALL** successful. [10:45:11] there is also something wrong, because there is no init.d/mysql [10:45:40] well, it will be missing beacuse I created puppet for stretch, not for jessie [10:45:49] ah right :) [10:46:03] then we can start db1011, copy the data from db1115 somewhere, reimage db1115 with stretch and copy it back [10:46:09] no [10:46:19] reimage without deleting /srv [10:46:38] ah true, you tested it and worked fine [10:47:03] I just setup db2073 to do that [10:48:04] you patch it or I do? [10:48:18] you can do it, I am with db2090 [10:48:21] oki [10:49:16] reimage the codfw one too, BTW [10:49:22] yep [10:49:36] oh, that is stretch alrady [10:49:49] how can one be on stretch and not the other? [10:50:03] oh [10:50:14] chris vs papaul, O guess? [10:50:18] not setup by us [10:50:22] yep, probably [10:50:48] we should have validated that on handover [10:51:36] can you take a quick look?: https://gerrit.wikimedia.org/r/#/c/413342/ [10:52:02] apparently, that didn't work to me [10:52:07] oh it didn't? [10:52:20] mmm these hosts have raid1 so, maybe it is different? [10:52:25] aside from that is a different recipe [10:52:41] you have to remove all recipes to reimage with custom formatting [10:53:28] db recipes are for hw raid [10:53:53] ah, I get what you mean [10:54:06] look at my patch for 2073 [10:54:12] it is there [10:54:57] you can maybe allow all db1*** hosts, for example [10:55:04] yeah, I was thinking about that [10:55:22] and I don't want to say it, but you know what I am thinking [10:55:30] haha [10:57:19] mark: are you joining us in the meeting? [10:57:24] if not I would say we can cancel it [11:00:48] I think I am going to edit ./modules/install_server/files/autoinstall/partman/raid1-gpt.cfg and that's it [11:00:59] temporarily [11:01:04] because otherwise I have doubts it will work [11:01:26] that doesn't work [11:01:33] it needs no recipe [11:01:48] you sure? [11:38:55] yep, I tested it for es200* hosts [11:39:00] needs no recipe [11:39:02] so you are saying I have to completely remove db1115 line? (You have done this before and I haven't, so that's why I am unsure :-) ) [11:39:05] cool [11:39:23] that is in theory [11:39:31] the problem is the regex behind [11:39:41] it will match there [11:40:19] look how I had to remove db207* (assuming my regex is correct, which I am not 100% sure) [11:40:28] anyway, if you have to go, I can continue [11:40:36] I do not want to leave tendril down for long [11:40:54] Yeah, I have to go in 20 mins [11:40:58] https://gerrit.wikimedia.org/r/#/c/413342/4/modules/install_server/files/autoinstall/netboot.cfg [11:41:04] I believe that will exclude db1115 [11:41:17] I agrre [11:41:25] I will deploy that and test? [11:41:30] sure [11:41:36] I am still here till 13:00 [11:41:51] yes, worse case scenario, it will fail, not delete data [11:41:58] so it should be fine [11:42:01] yeah, and "only" db1115 [11:42:01] XD [11:42:39] then disable notifications if it is not done, reimage and get on console [11:42:51] I downtimed it for 24h [11:42:51] I normally change / to ext4 [11:42:52] but better to disable yeah [11:43:10] doing it now [11:43:17] we should have a proper recipe for hw raid [11:43:43] notifications disabled [11:43:56] but it would be useless only for software raid 1 [11:44:46] we have to make better all of that, but probably not worth it until buster happens [11:44:53] going to merge: https://gerrit.wikimedia.org/r/#/c/413342/ [11:44:54] as partman will change again [11:45:51] we will revert all changes once our reimages happen correctly [11:46:01] yeah [11:46:35] let me know the state in which you leave stuff to continue them, meanwhile I will take care of db2090/2073 [11:46:46] yeah, will do [11:48:03] we should also ask chris to setup defaults for new hosts [11:48:10] as papal did [11:53:19] ok so, to sum up [11:53:34] just reimage :-) [11:53:40] db1115 partman change is merged and I have run puppet on install1002 the change is there, so you can reimage it when you have a chance [11:53:44] Oh, I can do that then [11:53:55] connect to console, too [11:53:59] db1089 low weight, slowly repooling it, you can forget about it [11:54:10] db1076 s2 low weight after restarting it, you can forget about it [11:54:13] or let me do it [11:54:16] db1067 s1, depooled for data checks [11:54:26] yeah, better if you take care of db1115 because I will leave in 5 mins :) [11:54:38] ok, not lauched reimage yet? [11:54:43] nope [11:54:44] what about tendril hosts, summary? [11:54:49] summary for that [11:54:54] db1011 mysql is stopped [11:54:58] db2093 nothing has been done [11:55:06] db1115 the copy is there, but needs reimage [11:55:11] any downtime taht could expire? [11:55:29] db1115 - notifications disabled + dowtimed till tomorrow [11:55:48] db1011 downtimed till tomorrow [11:56:08] db2093 - notifications disabled [11:56:11] I will try to make the new hosts work [11:56:20] I will come back online when I get back [11:56:22] if they don't, I will just restart db1011 [11:56:32] yeah, db1011 is intact, just mysql stopped [11:56:43] and puppet stopped [11:57:12] I don't claim I will be able to make things work there for sure, with so much code logic on stored procedures [11:57:26] if you get db1115 to work - https://gerrit.wikimedia.org/r/#/c/413317/ that is the DNS change patch [11:57:44] cool [11:58:03] Now, I gotta go, thanks a lot!! [12:02:42] 10DBA, 10Operations, 10Patch-For-Review: Puppetize tendril web user creation - https://phabricator.wikimedia.org/T148955#3992515 (10jcrespo) 05Open>03Resolved a:03akosiaris I'll consider this done- at least tracked on puppet. Better handling should be a goal on itself, and solved for all services. [12:05:35] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3992519 (10jcrespo) [12:06:54] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3892870 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1115.eqiad.wmnet'] ``` The log can be f... [12:21:15] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3992542 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2073.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20180... [12:34:16] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3992554 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1115.eqiad.wmnet'] ``` and were **ALL** successful. [12:40:32] Unknown storage engine 'CONNECT' [12:44:00] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3992587 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2073.codfw.wmnet'] ``` and were **ALL** successful. [12:45:41] installing libodbc1 fixes it [12:45:46] need to pupetize it [12:46:02] also we have wrong basedir, adding it to the list pf pending stuff [13:20:49] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3992630 (10jcrespo) So things pending to fix, aside from the above commits: * Fix package required for CONNECT- sudo apt-get install libodbc1 * Fix automonting... [15:41:14] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3992904 (10Papaul) @Marostegui Disk in slot 5 is blinking. [15:50:04] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3992982 (10jcrespo) db2030 or db2037? [15:58:25] 10DBA, 10MediaWiki-extensions-Renameuser: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863#3992993 (10Huji) [16:07:04] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993031 (10Papaul) No blinking disk on db2037 all looks good. And ILO is working on my end [16:32:37] just reading the updates on db1115 [16:33:02] anything I can help with now? [16:37:56] yeah [16:38:29] we need to deploy https://gerrit.wikimedia.org/r/#/c/413375/ to open a hole on cloud dbs [16:38:39] but needs cloud review [16:39:26] then there is the copy to db2093 [16:40:00] I can do the copy tomorrow morning [16:40:40] other than that, the many fixes I have deployed should be enough [16:40:50] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3993152 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:40:59] Nice work, really [16:41:00] I have changed the config a bit, increasing avaoiable memory [16:41:07] and enabled the binary log [16:41:14] with not consistent defaults [16:41:17] Yeah, saw that [16:41:26] that should be enough to setup replication [16:41:27] Should I stop db1115 and copy it to db2093? [16:41:31] tomorrow I mean [16:41:34] we can do mydumper too [16:41:46] no, mydumper would be too slow [16:41:53] and I could not even use it [16:41:58] ah toku [16:41:59] no? [16:42:04] because locking and tokudb and events and stuff [16:42:18] belive me, I tried it [16:42:21] haha [16:42:25] I totally believe you [16:42:27] metadata locking and otehr stuff [16:42:40] I ended up doing an inconsistencopy table by table [16:43:01] the main thing to have into account is to not start it with the event logging enabled [16:43:43] I would even go as far as disable netroking or deleting the federated tables or something radica [16:44:12] the gathering of statistics doesn't really scale to many hosts [16:44:28] I even have problems to make it work properly on a single host [16:45:44] I restarted the whole thing to check all changes worked [16:45:50] and not it is getting overloaded [16:45:51] Oh the events giving problems again? [16:46:00] Like with dbstore1001? [16:46:29] it could be just that it is slow at first [16:47:15] do you want me to copy it to db2093 tomorrow then? [16:47:31] let me see if I can make it work again [16:47:51] sure [16:49:25] maybe the binary log makes it too slow? [16:49:44] what is the slow exactly, the start? [16:50:04] the inserts into common tables [16:50:55] they are taking hundreds of seconds [16:51:35] maybe the binary logging adds extra locking [16:51:53] yeah, that sounds very plausible [16:52:14] let me disable the event scheduler [16:52:24] and start from 0 [16:52:30] ok [16:52:33] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3993182 (10Marostegui) Thanks a lot! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 9% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) physicaldriv... [16:53:27] yeah, as soon as I reenable it, it gets overloaded [16:53:39] the binlog, right? [16:53:50] I am going to try to restart without it [16:53:59] we may not be able to setup a replication [16:54:22] yeah, I was thinking that maybe we just need to have a tendril server on codfw but without replication, just ready to take over in case of need [16:54:31] sure [16:54:31] and manually add/drop hosts from it [16:54:39] better than nothing [16:54:40] modifying the add/drop scripts [16:54:45] yeah, way better than nothing [16:54:55] in which case, there is no need of a copy [16:55:00] let me try to see if I am right [16:55:03] indeed [16:55:10] maybe it is another variable I changed [16:57:22] we can still do "backups" of everthing except global_status and global_status_5m [16:57:31] and load it on the passive host [16:58:15] yeah, but if we are going to do the copy anyways, we can just as well copy all the stuff [16:58:18] but yeah, I agree [16:58:52] well, all is a problem [16:59:00] those tables are 900GB [16:59:08] oh those are the big ones [16:59:09] :| [16:59:12] the "real" data is only 1.2GB [16:59:21] the rest is what prometheus does [16:59:23] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993212 (10Papaul) a:05Papaul>03Marostegui 1- Power drain 2- Update all firmware on the system. [16:59:27] alternatively, we can remove that part [16:59:34] and probably all problems gone? [16:59:38] yeah, most likely [16:59:43] not sure how easy will be to remove all that [17:00:35] it is still happening with binlog off [17:00:40] oh really? [17:01:27] let's wait a bit, it is going down [17:01:56] maybe it is only on cold restart [17:02:08] plus toku does not preload anyhing [17:02:45] I will also disable the unaccesible hosts [17:03:00] processlist is going down at least [17:05:46] marostegui: I'd say to not copy it, at least not tomorrow [17:05:54] let's fix the issues [17:06:01] and then think what to do [17:06:20] agreed [17:06:22] don't worry about the QPS 0 [17:06:27] we didn't have a copy in codfw anyways [17:06:29] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993231 (10Marostegui) 1- The ipmi is still not working :-( 2- Thanks! Also thanks for checking the disks, the system boot up finely, wh... [17:06:32] so better to have eqiad happy [17:06:33] it takes 5 clock minutes [17:06:36] first [17:06:41] for calculations to give a number [17:07:07] maybe replication is possible and it just needs time to cache to memory [17:07:21] yeah, it would be a great win if we can manage to replicate it [17:07:23] memory usage is at 5G [17:07:53] and only slowly growing [17:08:17] yeah, let's see how we find it tomorrow morning [17:08:22] while it settles [17:09:02] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3993252 (10Marostegui) [17:09:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993251 (10Marostegui) 05Open>03Resolved [17:11:58] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3983430 (10Marostegui) Tracking task for the IPMI issue: T188016 [17:13:01] maybe you can consider a db2037 stretch upgrade [17:13:41] yeah, we have to do that [17:13:50] we could reimage it without srv formatting [17:19:31] I set as default engine innodb [17:19:35] that could be another problem [17:19:53] I am going to revert all config changes [17:20:46] ok [17:28:28] I think it was Aria [17:28:33] but I reverted everyhing [17:29:03] oh aria… [17:29:05] what black magic makes that happen, I don't know, and I am not sure if I want to know [17:29:07] our lovely friend [17:29:11] no, no [17:29:23] I mean I reverted *back* to Aria [17:29:45] it must be using some optimization or something [17:29:58] aaah right right [17:30:20] now processlist is in 40 rather than 900 [17:30:26] oh crap [17:30:35] maybe it does heavy create and drop of tables [17:30:35] maybe we need to start liking aria? [17:30:46] nope, we need to dislike the whole setup [17:30:51] that too XDDD [17:31:20] this should be back to normal https://tendril.wikimedia.org/host [17:31:32] we can try incrsing buffers again tomorrow [17:31:40] I honestly would not copy anything for now [17:31:44] yeah, it looks like like our normal tendril [17:31:49] yeah,let's not rush [17:32:00] we haven0t had a codfw host for years, so it is ok to wait to make sure we are happy with eqiad [17:32:07] https://tendril.wikimedia.org/host/view/db1115.eqiad.wmnet/3306 [17:32:21] that is going back to aria? [17:32:23] :o [17:32:23] if it uses aria and toku [17:32:46] and non relaible setting, I worry about replication [17:38:19] Yeah, I guess it is a patch over a patch and that for years... [18:37:28] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3993850 (10bd808) [18:38:55] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3351594 (10bd808) See {T188029} for near term plan to move off of silver. This will satisfy the immediate needs for FY17/18 Q3 goals. [18:39:16] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3993856 (10bd808) [19:08:21] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3994003 (10jcrespo) [21:44:45] 10Blocked-on-schema-change, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Schema-change: Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994420 (10Tgr) [21:45:05] 10DBA, 10Operations, 10ops-eqiad: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#3994435 (10Cmjohnson) A ticket has been created with Dell . You have successfully submitted request SR961176970. [21:49:23] 10Blocked-on-schema-change, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Schema-change: Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994447 (10Tgr) [21:50:10] 10Blocked-on-schema-change, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Schema-change: Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994420 (10Tgr) Adding @Dbrant, @JoeWalsh for visibility. [21:50:36] 10Blocked-on-schema-change, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Schema-change: Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994452 (10Tgr) [21:51:10] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994420 (10Tgr) [21:53:37] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994471 (10Tgr)