[05:23:43] 10DBA, 10Operations: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 (10Marostegui) 05Open→03Resolved >>! In T57385#5310801, @ArielGlenn wrote: > Um, it has? I just found it on meta, though empty. > > wikiadmin@1... [05:23:49] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [05:34:38] 10DBA, 10Data-Services: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) After this big batch of wiki compression only 3555 tables were left to be compressed - I am now trying to compress medium size wikis, between 20G and 100GB (a total of 3000 tables)... [05:48:43] 10DBA, 10Patch-For-Review: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 (10Marostegui) I have restarted db1109 to pickup STATEMENT as a binlog format. db1109 will be the candidate master once db1104 (current candidate master) get... [08:03:28] performance_schema | OFF on all pc* hosts, I just realixed [08:03:32] *realized [08:05:57] I wonder if it was ever enabled on the old ones [08:15:59] I am not touching that for the moment, but I am installing sys on all hosts [08:16:07] for consistency [08:16:10] sounds good [08:16:24] even if sys.processlist will return 0 rows with it disabled [08:16:39] because now there are hosts on the same section, some with it installed [08:16:42] and some with it not [08:17:18] or even instances on the same hosts [08:23:31] db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0993 [08:24:24] \o/ [08:24:29] what was the issue with the snapshots btw? [08:24:31] I didn't check [08:24:36] was it a metadata issue? [08:25:06] I think backups get stuck on db2097 [08:25:10] the host with memory issues [08:25:17] Ah right [08:25:28] The DIMM is supposed to arrive today as per the last message from papaul [08:25:33] and that creates an exception [08:26:12] maybe I should move s3 backups higher [08:26:14] as it takes more time [08:26:21] higher == earlier [08:26:41] so it doesn't "collide?" [08:27:10] also because the order is not perfect, and it may end up executing that on a single thread [13:28:11] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/521266/ [13:36:53] and for the x1 codfw failover: https://gerrit.wikimedia.org/r/#/c/521273/ [14:18:33] let's do x1? [14:18:39] when you are around [14:19:03] one sec [14:21:29] sure! [14:36:03] see my amend, it is the best way to explain :-) [14:36:36] Aaaaah!!! [14:36:38] yes! [14:36:41] I see :) [14:36:53] there was still some mistakes on the config [14:36:54] thanks :) [14:36:58] I haven't checked all [14:37:15] I've also left m4 with it disabled [14:37:20] yeah [14:37:21] as it was explicitly like that [14:37:26] it will go away anyways [14:37:43] see now if you get what I meant [14:38:00] Yeah [14:38:02] sometimes is easier to code than to try to explain [14:38:04] Now I do :) [14:38:07] definitely [14:38:11] the template has an if--then--else [14:38:16] on purpose [14:38:20] I am running a puppet compiler [14:38:23] that enables user_stats [14:38:28] when dit disables p_s [14:38:33] *it [14:38:35] yeah, I wasn't sure if that was actually intended or what [14:39:02] the # we are testing impact of performance schema [14:39:07] wasn't intended [14:39:39] I am going to merge, puppet compiler looks good [14:39:43] ok? [14:40:24] up to you, it just needs your review [14:40:32] yeah, I just checked :) [14:40:33] and it should be a safe change [14:41:04] Going to apply it, run puppet and restart db1132 and then we can move to x1 [14:43:21] when you finish, I thought a bit about https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/521226 [14:43:41] and I think we should apply it (although doesn't have to be today) [14:43:49] because this is a switchover script [14:43:59] not a failover one (on my way of speaking) [14:44:19] when there is a faiover one, flexibility should be there indeed [14:44:34] but not for a "best practices" scheduled script [14:47:26] Yeah, I know [14:47:41] I guess the cases where we'll go for a master with a different binlog format are kinda reduced [14:49:20] an option could be added [14:49:29] but I hate too many options [14:49:33] yeah, I think we need the --force at some point [14:49:37] but probably not needed now [14:49:44] let's play with x1 codfw then? :) [14:49:47] a single forve I would be ok with [14:49:54] also maybe a --test-only [14:50:04] yeah maybe [14:50:07] that's a good one [14:51:22] so, db2069 will be the new codfw master for x1 [14:51:29] so I will tell you what I want to do [14:51:33] and you tell me if it is correct [14:51:38] 10DBA, 10MediaWiki-General-or-Unknown, 10Core Platform Team Kanban (Clinic Duty Team): Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Anomie) 05Open→03Declined > which may kill performance for a lot of API queries. We should figure out if that's true I guess it wa... [14:52:12] I will do it in the same way I will be doing it tomorrow with m2 in eqiad [14:52:38] so first: move_replica.py db2115 db2069 [14:52:54] and that with all the codfw slaves to put them under db2069 [14:53:57] once that is done and all the slaves are under db2069 I plan to do: switchover.py --replicating-master --read-only-master db2045 db2069 [14:54:04] db2045 is the current master [14:54:30] please do, downtime if you need something [14:54:37] yeah of course [14:54:39] and let's check coords and compare later [14:54:47] but the above sounds good, right? [14:55:16] yes [14:55:19] cool [14:55:27] I will let you know how it goes :) [14:55:28] wait [14:55:32] waiting [14:55:37] if it is in 2 steps [14:55:44] it is --only-slave-move [14:55:48] and later --skip-slave-move [14:56:18] --only-slave-move? [14:56:22] move_replica doesn't have that [14:56:32] or you want to use switchover for both steps? [14:56:36] only-slave move is for switchover [14:56:44] it basically calls move() [14:56:48] Ah ok [14:56:59] just handles the what on its own [14:57:12] so I can do switchover.py --only-slave-move db2115 db2069 [14:57:15] they call essentially the same method in the end [14:57:19] cool [14:57:54] I will add these commands to the etherpad checklist for tomorrow too [14:58:00] but switchover does all checks [14:58:09] great - I will use that then [14:58:10] just to be sure [14:58:28] but it wouldn't really be a huge difference [14:58:42] although I think using switchover may be easier [14:58:45] totally [14:58:48] better to use just one [15:00:14] although --only-slave-move doesn't ask for confirmnation [15:01:03] I think I am not understanding correctly the essence of this [15:01:10] Let's see if you can help me! [15:01:19] So, I want to move db2115 under db2069 [15:01:34] As per the usage, I understand I need to use: switchover.py --replicating-master --read-only-master --only-slave-move db2069.codfw.wmnet db2115.codfw.wmnet [15:01:52] no, switchover always takes the current master and replica [15:02:01] and autoamtically searches the current master replicas [15:02:21] it is like step 1/2 and step 2/2 [15:03:06] So how do I combine switchover.py with a topology change? [15:03:18] Ah, I think I get it [15:03:24] db2045 db2115 [15:03:27] I think it is [15:03:37] No, I guess it is db2045 db2069 [15:03:40] db2069 is the new master [15:03:43] and db2045 is the current [15:03:44] sorry [15:03:56] then that, and it will move automatically the other 2 [15:04:00] And it will move everything under db2069 [15:04:02] yeah [15:04:02] right [15:04:04] lovely! [15:04:11] it is "easier" [15:04:22] but maybe not obvious [15:04:28] it is a lot easier [15:04:34] I am just used to repl.pl :) [15:04:48] with move you would have to do every one individually [15:04:56] yeah indeed [15:04:58] so, here we go! [15:05:01] and it doesn't do the smisync and other stuff [15:05:38] (and that is why the strict checks are handy :-) [15:06:55] SUCCESS: All slaves moved correctly, but not continuing further because --only-slave-move [15:06:56] I was running ./replication_tree.py db2045 [15:07:05] topology looks good [15:07:05] and got the whole process printed [15:07:08] going to do some checks [15:08:24] it looks good :) [15:09:26] ok, so going for: switchover.py --skip-slave-move db2045 db2069 then [15:09:36] look: [15:09:57] https://phabricator.wikimedia.org/P8723 [15:10:07] probably it would be nice to print that on every step :-) [15:10:24] hahaha nice progress bar [15:10:39] so the move is the actual dangerous new thing [15:10:51] the other steps have been virtually untouched [15:11:00] and it worked nicely! [15:11:03] well done! [15:11:07] default timeout? [15:11:10] yeah [15:11:13] I didn't change it [15:11:22] reducing it and the parallel stuff should get better times [15:11:38] yeah, but I wanted to see all the checks slowly as we had no rush really [15:11:41] +1 [15:11:49] just as a FYI [15:12:04] doing the failover itself now [15:12:35] switchover :-D [15:12:40] :p [15:12:51] ok, done, got an exception [15:12:56] oh [15:13:00] but looks like it worked, let me check [15:13:24] at the very end I got: https://phabricator.wikimedia.org/P8724 [15:13:54] interesting [15:13:59] is replication setup? [15:14:03] yeah [15:14:05] I ran this: /switchover.py --replicating-master --read-only-master --skip-slave-move db2045.codfw.wmnet db2069.codfw.wmnet [15:14:20] replication was setup correctly from what I can see [15:14:21] on both old and new replica? [15:14:28] *master [15:14:38] yeap [15:14:42] no GTID enabled though [15:14:49] on the old master [15:15:02] yeah, that happens later [15:15:12] yeah [15:15:19] Same for zarcillo and all that happens later too [15:16:25] will look at it when I finish the stop of the host [15:16:32] no worries [15:16:42] I will update zarcillo and tendril and change the events [15:18:24] db2101:3320, version: 10.1.39, up: 31d, RO: ON, binlog: STATEMENT [15:18:34] although that is not new [15:18:52] (the others are ROW) [15:19:16] probably a left over from all the servers setup [15:19:21] I will get that fixed [15:19:24] thanks for reporting it [15:20:03] so thinkit it could be something that is expected to be set [15:20:15] when not run with --skip [15:22:52] so, for tomorrow, how do you feel we should operate? [15:23:02] I think we can still use the new switchover [15:23:57] could you paste the line 271 of the version you used? [15:24:39] sure [15:24:51] if not result['success']: [15:24:51] print('[ERROR]: Original GTID mode was not recovered on the new master') [15:24:51] return 0 [15:25:11] oh, I see [15:25:14] silly mistake [15:25:20] that function returns a bool [15:25:25] not a result set [15:25:28] easy fix [15:26:47] sadly, I had not tested that combination [15:28:38] because normally a real master does not have gtid enabled [15:29:08] indeed [15:29:09] you were using 58efc9201945e63b5ea21a2, right? [15:29:22] so to rebase the path on top of that [15:29:59] commit 58efc9201945e63b5ea21a2fab3d0be72ca93771 [15:30:02] so yeah [15:32:52] https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/521300/1/wmfmariadbpy/switchover.py [15:32:55] ^fixed [15:33:07] we can try it twice again :-D [15:33:22] haha nice fix! [15:35:22] funnily, I had been biten by this on move_replica.py [15:39:11] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Papaul) 05Open→03Resolved Before {F29709098} After {F29709100} This is complete return tracking information {F29709113} [15:39:15] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10Papaul) [15:39:36] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) 05Resolved→03Open a:05Papaul→03jcrespo ` Mem: 515690 ` HW seems to be fixed, owning for the followup (software) st... [15:39:40] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) [15:40:14] 10DBA, 10Operations: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) [15:41:00] marostegui: if you update to cff87cd we should be good to go [15:41:16] doing it now [15:41:45] good [15:41:47] done [15:42:32] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) @Marostegui Thanks [15:44:11] want to do another try? [15:44:23] sure [15:44:33] I will do the same 2 steps [15:44:38] ok [15:45:47] moving slaves now [15:47:29] slaves moved [15:47:31] doing some checks [15:47:54] at least the new thing seems stable [15:48:09] handling gtid was also a new thing [15:48:29] ok, let's go for the switchover [15:48:35] the topology looks good [15:48:57] worked like a charm! [15:49:21] did it update zarcillo and tendril? [15:49:26] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10ArielGlenn) We don't really have daily stats of any sort in grafana but maybe we should (or have a place for it to live in any case... [15:49:31] It said it did, I am manually checking [15:49:38] he he [15:49:44] do not trust the code [15:49:53] haha just checking! it is the first time we use it! [15:49:59] indeed [15:50:25] worked fine! [15:50:26] :) [15:50:30] congratulations :* [15:50:33] :-D [15:50:57] we should leave however compare.py running [15:51:14] and saving the binlogs (althoug it will also be on the logs) [15:51:36] although with read only mode it is "easy" [15:52:59] can I edit the etherpad then? [15:53:06] yeah [15:53:11] I have moved stuff to manual steps [15:53:18] but please, take a look and check it [15:53:26] and same for the patch [15:53:45] I have removed the disable gtid [15:53:50] as it is done automatically [15:53:50] great [15:55:37] So I can see how things like row format and version checks could bee annoying [15:55:54] but then it is the question of if we could trust a --force [15:56:28] let's leave it aside for now [15:56:31] yeah [15:56:39] I think it is useful on some situations [15:56:40] and we should find a way to make things both safe and flexible [15:56:42] at least to have it available [15:56:55] yes, but on the other side, not checking could lead to human error [15:57:03] not an easy compromise [15:57:13] yeah, by the default it should check and complain [15:57:20] but something like --i-know-what-i-am-doing! [15:57:39] but if we use force all the time, checks are wothless [15:57:53] no, it will be the default [15:58:17] I don't know, I personally need to think more abou tthe best way t to proceed [15:58:34] yeah [15:58:38] we can continue discussing [16:22:42] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Addshore) So we (WMDE) actually made a daily namespace in graphite for tracking data daily instead of minutely. We have been using... [17:04:43] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10ArielGlenn) Are there visible graphs for these? Looking a bit at the repo scripts now. I guess we'd want something in src/commons/c...