[05:36:00] 10DBA, 10Patch-For-Review: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996#3540181 (10Marostegui) db1102 is now replicating s4 and catching up. The triggers are in place and so are the filters. However I will run a check_private_data once it has caught up, to be on the safe side. [06:41:23] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3540214 (10Marostegui) s7 is an interesting case. All the slaves look consistent among them. However, for the differences existing on db1041 and db1079 ie: (arwiki.tag_summary) all the slaves look the same, but d... [07:29:39] jynus: you have 5 times for me? [07:29:42] 5 minutes [07:30:55] yes [07:33:17] Ok, I would like another pair of eyes for the change master on the labsdb servers basically [07:33:42] Going to stop replicaiton on db1064 (sanitarium2 and sanitarium3 s4 master) [07:34:01] ok, I thought it was 5 minutes to tell me how much I sucked :-) [07:34:22] xdddddddd [07:34:34] Not today! [07:34:34] :p [07:35:36] ok, db1064 stopped and db1102 and db1095 stopped at the same position [07:35:56] stopping and reseting s4 thread on db1095 (noting down position first) [07:37:36] but if you stop db1064 [07:37:51] that is trivial, because it should be the same value on both hosts [07:37:57] yeah it is :) [07:38:25] I just resetted s4 thread on db1095 to avoid any stuff (remember that db1095 has an unique binlog) [07:38:41] so if replication gets started by mistake on s4 on db1095 s4 stuff will get on the binlog [07:38:55] and would collide with the other thread coming from db1102 once it is set up [07:38:55] sure [07:39:07] ok, so let me build the change master command [07:40:32] change master to master_host='db1102.eqiad.wmnet', master_user='repl', master_password='xxxx', master_port=3314, master_log_file='db1102-bin.000010', master_log_pos=784914483 [07:40:40] that would be for labsdb1009,10 and 11 [07:40:42] can you double check? [07:41:23] actually [07:41:32] change master 's4' to master_host='db1102.eqiad.wmnet', master_user='repl', master_password='xxxx', master_port=3314, master_log_file='db1102-bin.000010', master_log_pos=784914483 [07:43:17] did you run the check_private_data script on s4 [07:43:21] yup [07:43:33] and also manually checked the user table for the new users [07:43:39] and they were created correctly sanitized [07:44:45] then it looks good [07:45:13] ok, going to run it then [07:46:46] ok, labsdb servers changed [07:46:58] going to recheck that replicaiton is reseted on db1095 -s4 and start replication on db1064 [07:47:56] next step is to set sql log bin 0 [07:48:02] and drop the database [07:48:10] yep [07:48:22] let's do that after we have confirmed replication works fine first [07:48:53] ok, s4 is confirmed gone on db1095 [07:48:57] going to start replication on db1064 [07:49:26] done [07:50:11] labsdb server catching up finely [07:50:12] strangely, s1 got delayed too [07:50:49] on the labs hosts? [07:51:36] yes [07:52:54] that is strange [07:53:04] you checked it via pt-heartbeat or show slave status? [07:53:25] https://tools.wmflabs.org/replag/ [07:53:37] ah [07:53:55] it could be labs-only lag, not its parents [07:54:19] why is the lag on s4 increasing if it is 0 actually [07:55:10] not on labsdb1001 [07:55:19] Ah right [07:57:53] i think we can drop s4 from db1095 [08:18:05] marostegui: I do not see an s4 connection on labsdb1001 [08:18:32] I see s2, s6, s7 and db1095 [08:18:36] 1001? [08:18:54] Right [08:19:01] I forgot 1001 and 1003 [08:19:06] Let me do those [08:19:14] what do you mean? [08:19:18] I forgot they already replicate from the new sanitariums [08:19:23] I didn't issue the change master there [08:19:30] ah, ok [08:19:42] so you did 9/10/11 [08:19:46] yep [08:19:47] but not 1 and 3 [08:19:55] correct [08:20:11] 1001 done [08:20:36] and 1003 done [08:20:44] I basically forgot that we moved those already to the new sanitariums [08:22:44] I am going to remove db1015 from tendril [08:22:50] thanks [08:23:00] I am going to drop s4 from db1095 [08:23:05] with log-bin 0 [08:23:20] yeah, careful here [08:24:06] you can even stop labsdb1010 just in case [08:24:18] yeah, i was doing that now XD [08:24:25] 1010 is not in use [08:25:13] root@db1095[mysql]> set session sql_log_bin=0; [08:25:14] Query OK, 0 rows affected (0.00 sec) [08:25:19] here we go with the drop [08:25:35] drop database if exists commonswiki; [08:28:25] done [08:28:41] disk space alert gone [08:28:59] replication working fine [08:30:23] I am thinking of deleting s1 and s4 from dbstore2001 [08:30:32] they are on dbstore2002 [08:30:41] or maybe s2? [08:30:53] and leave s1 with redundancy? [08:30:54] s2 and s3 have very low io requirements [08:31:22] s1 is more important, but I think the issue is between s1, s4 ,s5 and s7 [08:31:41] Yeah, I don't think we thought s7 had that many iops [08:31:44] (or at least i didn't) [08:32:18] I don't think it does, it is just the last one to arrive [08:32:31] plus we need extra for backups [08:33:06] alternatively, we can try ROW based replication [08:33:15] and see if that helps reading less [08:34:42] I mean, my proposal is not something I would like to do [08:34:52] but it is something I think we have to do [08:35:00] yeah, we have to drop stuff [08:35:01] definitely [08:35:47] s7 may be recovering, but 2 hours per day, and it may be a temporary status [08:35:56] and that is with s4 down [08:36:03] yeah [08:36:06] which is now 24 hours behind [08:36:15] let's get rid of s4 [08:37:22] maybe with s4 is enough? [08:37:29] it is not right now [08:37:36] oh yeah it is stopped [08:37:36] except for the memory usage [08:38:04] did you split its memory across the others? specially s7? or it is just stopped? [08:38:18] it is stopped [08:39:10] but I do not think 2 GB of extra memory will make a difference [08:40:35] i guess not, no [08:40:43] let's move another of the big ones then [08:40:46] as you said [08:40:57] I can do a last try changing innodb parameters [08:41:01] to flush less often [08:41:11] sure, worth trying i would say [08:49:52] 10DBA, 10Data-Services: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#3540367 (10Marostegui) [08:49:54] 10DBA, 10Patch-For-Review: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996#3540365 (10Marostegui) 05Open>03Resolved This has all been done and commonswiki dropped from db1095 and disk space is back to normal values. Replication remains stopped on labsdb1010 (not in use) and I h... [08:54:37] 10DBA, 10Wikidata, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Sprint: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460#3540391 (10Ladsgroup) [08:56:49] 10DBA, 10Data-Services: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#3540393 (10jcrespo) p:05Triage>03Low [08:56:55] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3540394 (10Marostegui) Maybe we should consider this fixed? it has not happened again since Thursda... [08:58:40] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3540397 (10Ladsgroup) 05Open>03Resolved [08:58:58] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276584 (10Ladsgroup) Thanks! Feel free to reopen in case it started to happen again. [09:00:13] 10DBA, 10Operations, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540402 (10Marostegui) This has not happened since T164173 got fixed I believe, so maybe it was indeed a direct cause. [09:03:30] 10DBA, 10Operations, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540404 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolving for now. [09:03:35] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540407 (10Marostegui) @RuyP let me know when you start [09:04:02] 10DBA, 10Operations, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540408 (10jcrespo) a:05jcrespo>03daniel [09:38:56] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540507 (10RuyP) >>! In T173159#3540407, @Marostegui wrote: > @RuyP let me know when you start [[ https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/DarwIn | Special:... [09:39:45] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540510 (10Marostegui) >>! In T173159#3540507, @RuyP wrote: >>>! In T173159#3540407, @Marostegui wrote: >> @RuyP let me know when you start > > [[ https://meta.wikimedia.org/wi... [09:44:20] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540535 (10Marostegui) Commons is done, and some lag is being generated. I have eased a bit the transactional options for the slowest hosts. Next big one is ptwiki on s2 [10:08:05] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540597 (10Marostegui) And s2 also got a lag spike on the slower slaves when ptwiki was done. But it is now gone. The pending wikis are pretty small,so we should be good. [10:18:29] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540606 (10Marostegui) All done - feel free to close this whenever you like. Thanks for taking the time to do this in a different timezone from yours! [10:20:54] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540608 (10RuyP) >>! In T173159#3540606, @Marostegui wrote: > All done - feel free to close this whenever you like. > Thanks for taking the time to do this in a different timezo... [10:21:16] 10DBA, 10Wikimedia-Site-requests: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159#3540609 (10RuyP) 05Open>03Resolved [10:26:11] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3540619 (10Marostegui) Hi, Is there anything pending here? [10:28:32] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3540631 (10JAllemandou) @Marostegui : We're waiting for august run to happen (at the first days of september) before closing, but we exp... [10:30:22] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3540632 (10Marostegui) Excellent! Thanks for the update! [10:36:27] 10DBA, 10Operations, 10ops-eqiad: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540652 (10Marostegui) I would like to propose db1076 (s2) as a candidate host to do the test once db1078 is back in the pool with the new disk. db1076 belongs to s2 and there are two more powerful hosts ther... [10:38:34] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#3540654 (10jcrespo) @chasemp I do not want to push for this, but I suspect this may already be done and should be resolved? Ca... [11:25:39] 10DBA, 10Operations, 10ops-eqiad: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540766 (10Cmjohnson) The disk was finally sent. HP added another report they wanted in addition to the AHS log. That report would have required powering the server off which is ridiculous for a failed disk.... [11:27:21] 10DBA, 10Operations, 10ops-eqiad: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540772 (10Marostegui) >>! In T173365#3540766, @Cmjohnson wrote: > The disk was finally sent. HP added another report they wanted in addition > to the AHS log. That report would have required powering the se... [12:55:23] marostegui: around? [12:55:34] o/ [12:59:17] can you take a look at the backups document [12:59:21] yes [12:59:23] I have written quite a log [12:59:26] *lot [12:59:30] not removed anything [12:59:32] great! [12:59:33] ok [12:59:37] let me seee [12:59:45] but changed the structure of whay you wrote [12:59:49] moving to to architecture [13:00:08] and focusing on methdology on one side and architecture on another [13:00:18] I think most of what you wrote was about architecture [13:00:22] yeah [13:00:38] it may need more fixes, because there was some overlap [13:00:51] "methods of backing up" [13:00:53] vs [13:01:04] "architechture of how to do those" [13:01:06] vs. [13:01:09] purchases [13:02:20] it doesn't have to be perfect, however [13:02:28] just good enough to send it to the list [13:03:33] the images may need a fix, which only you may have the sources [13:04:52] hopefully you like at least the general idea [13:05:05] yeah, i like it [13:05:07] still reading :) [13:05:19] yeah, I have not even read it myself [13:05:44] haha [13:05:47] the idea is that methodology and hosts [13:05:56] can be independent [13:06:01] so a single storage host [13:06:06] I added more stuff today by the way [13:06:14] can be used to both store snapshots [13:06:23] and xtrabackup [13:06:31] so I separated the logic from the architecture [13:06:45] although it may need more work as some things may be duplicated [13:06:59] feel free to alter as you go [13:07:12] I think it makes sense to have proposal and then each arquitecture [13:07:14] instead of mixing it [13:07:28] they are related [13:07:33] but not 100% [13:07:53] my biggest concern is that after learning more about requirements [13:07:59] and future requirements [13:08:06] have those in the table [13:15:44] What do you think adding a section of the proposal we'd like to have if we had no resources limit? [13:15:58] pfff [13:16:04] xddddd [13:16:08] not sure if woth it [13:16:22] so, which proposal would yo go for now? [13:16:25] after reading that doc [13:16:35] plus is is already there, if you chose the most expensive option [13:16:49] I only like snapshots because it is cheap [13:17:11] if space wasn't a problem , I would like to have pre-compressed tar.gz [13:17:24] so provisioning is solved autoamtically and lower TTR [13:17:31] so which one is that? [13:17:36] the thing is I do not think that is possible [13:17:44] that would be a combination of [13:17:59] binary tarballs [13:18:22] and multiple hosts with multiple block storages [13:18:31] (I am just being annoying, to see if we both are ready for the meeting) [13:18:47] but where mysql and disks are separated [13:19:06] so, let me see if I am getting you right here [13:19:17] so copies are done regularly to one or several remote disk-hosts [13:19:19] We are talking about #2 but getting binary tarballs somewhere else? [13:19:29] and from ther provisioned to the production hosts [13:19:47] but I do not think that is possible for several reasons [13:19:51] disk waste [13:19:54] and technology [13:20:17] snapshoting would be the compromise [13:20:33] but I do not like the idea of starting and stopping hosts, every 8 hours? [13:21:10] Yeah, that is not ideal [13:21:15] so I really do not come with predefined options [13:21:17] if I did [13:21:22] I wouldn't ask for a meeting [13:21:33] the only one I do not like is the delayed slave [13:21:44] but I leave it open in case no hardware is avilable [13:21:49] at all [13:22:02] all other are depending on purchases [13:22:13] so I just did the numbers of space needed [13:22:23] and hosts available now [13:23:15] For me, I would like option #2 + a server with just disk (maybe re-use dbstore servers) to hold the snapshots (or tar.gz, whichever way you want to call it) for X weeks [13:23:34] #2? [13:23:44] let's rename the options with letter [13:23:45] s [13:23:47] haha [13:24:00] The option of having one host per shard replicating [13:24:11] and being the source of backups or snapshots [13:24:22] writing to its own disk [13:26:26] but you are saying that only because you are compromising on disk [13:26:51] I was assuming there was no resource limits [13:27:01] so multiple copies are ok [13:27:58] that is true [13:28:01] I am cheating a bit [13:28:02] :) [13:28:11] if there were no disk constraints [13:28:21] a pre-compressed tar would be faster [13:28:29] oh, indeed [13:28:30] totally [13:28:38] there could be a compromise between both [13:28:54] like snapshosts for the latest generated one [13:28:57] sorry [13:29:05] tar.gz of the lastest one [13:29:11] for provisioning purposes [13:29:14] yeah [13:29:17] and snapshosts for the past [13:29:27] that would cover the most common usages [13:29:36] in all cases [13:29:42] but for the precompressed tar, we'd need to either stop mysql or go thru innodb recovery on the new host that would use that tar.gz to get provisioned [13:29:44] what I do not like [13:29:52] is the stopping mysql [13:29:55] exactly [13:30:00] which would be required in all cases [13:30:07] well, there is one way to avoid it [13:30:09] that I do nt like [13:30:13] migrate to mysql and use xtrabackup :) [13:30:24] or test mariabackup [13:30:30] that too [13:30:37] which is on the latest 10.1 wmf package [13:30:50] the thing is [13:31:08] sometimes the most optimized way is not necesarily the best [13:31:19] see multisource vs. multiinstance [13:31:57] yeah, that was a good lesson [13:32:14] do you want to give it a pass? [13:32:19] to the document [13:32:22] let me read it again [13:32:25] before I send it? [13:32:28] ok [13:32:40] I will read the whole document again XD [13:32:53] I did not touch much of the other stuff [13:33:06] just a couple of typos or errors [13:33:33] we didn't discuss much about the bacula vs failesystem vs. something else [13:33:34] *file [13:33:44] yeah, but I think that is all blocked [13:40:40] I am done [13:40:45] I think it can be sent [14:38:07] 10DBA, 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834#3541365 (10MarcoAurelio) [14:48:04] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 10TCB-Team, and 5 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3541444 (10Tobi_WMDE_SW) [15:20:25] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#3541630 (10bd808) This is still wishlist status on the #striker implementation side. The idea is to replace the current servic... [15:21:22] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#3541651 (10jcrespo) Oh, sorry. So it is done, but not by striker. Sorry for the confusion. [18:31:46] 10DBA, 10Operations, 10Wikimedia-Site-requests: Papa1234 → Karl-Heinz JansenPapa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3542365 (10Steinsplitter) [18:36:31] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3542408 (10Steinsplitter)