[04:43:53] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [04:44:09] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [04:44:11] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [05:05:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [05:05:45] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [05:05:59] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [05:54:55] 10DBA, 10wikitech.wikimedia.org: Ensure labswiki and labtestwiki are up to date with MW schema changes - https://phabricator.wikimedia.org/T200140 (10Marostegui) 05Open>03Resolved I am going to close this as resolved as I have applied most of the recent/major schema changes we have done lately - T200140#44... [06:10:30] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) s4 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1121 [] db1103 [] db1097 [] db1091 [] db1084 [] db1081 [] db1068 [06:10:43] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) s4 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1121 [] db1103 [] db1097 [] db1091 [] db1084 [] db1081 [] db1068 [06:10:45] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) s4 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1121 [] db1103 [] db1097 [] db1091 [] db1... [06:11:11] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [06:11:25] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [06:11:41] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [07:22:43] thinking of creating a bash script to do all steps on the critical failover phase except the scap [07:23:18] to do all the kill, puppet, etc? [07:23:38] yeah, with cumin [07:23:49] that'd be nice [09:56:55] [ERROR]: The replica is too lagged: 49315 seconds, please allow it to catch up first [09:57:11] \o/ [09:57:50] (hey, I have to test all possible scenarios) [09:58:31] Of course! [09:59:33] :-) [10:01:08] I thought about a new method of implementing move(), but not enough time to do it know [10:01:21] I took at look at how orchestator does [10:01:50] and they store on the orchestator database the equivalency between binlog positions [10:02:26] I don't want to do that, but I think I can use a similar approach with syncing pt-heartbeat beats [10:02:59] yeah, as far as I remember the grab the pseud-gtid pos and then do all sort of things with it on the DB [10:03:10] Thinks might have changed, that's two years ago :) [10:04:59] the good thing with pt-heatbeat is that, as long as things are not too lagged, it is easy to find- very few transactions away from the whole second [10:05:21] you mean between heartbeats ids? [10:05:49] no, I mean to find the actual file (binlog pos) offset of a particular heartbeat [10:06:02] it cannot happen before the second it updates [10:06:02] ah yeah yeah [10:06:22] so the plan (not now) is to sync based on the latest heartbeat recived [10:06:47] then calculate the last edit as an offset of that [10:07:28] problem, is different binlog formats- lag at the time of the last update... [10:07:59] Oh yeah, binlog formats... [10:08:00] good point [10:08:25] it is not as easy to solve, so it will require more time at a latter time [10:12:29] I've merge --skip-slave-move, so that part is optional [10:12:34] *merged [10:12:54] documented the specific line to run [10:13:10] I was checking the etherpad and I saw that yes :) [10:14:00] the move was not on my original scope, now I can see it is not trivial- but at least I learned a lot of things [10:14:17] I will try now to make everthing except mediawiki a single step [10:15:24] nice! [11:27:36] Running the maintenance script gives me slow timer (for read not write) notice for s1 I already put ten seconds sleep between each batch. Should I put more? [11:38:16] do you read from the master? [11:38:45] It is ok to read from the master as long as it is very slow [11:39:17] slow here means == let other connections take most of the resources, like the sleeps you mention [11:55:39] That table jynus_ It reads from replica [11:55:52] Ignore "The table" [12:03:39] then slow reads ar normally ok [12:04:21] however, have into account from time to time servers are depooled, so it could start failing at any time if configuration is not reloaded [12:27:03] jynus: so it would be great if you let me know when you're depooling a node from s1 or s8 [12:27:22] Basically it's here: https://phabricator.wikimedia.org/T193873 [13:29:53] Amir1: I can coordinate with you for that as I soon will start with either s8 or s1 for other things and I will need to depool hosts [15:18:20] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:20:51] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:26:38] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:26:52] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10jcrespo) It works now fully, including pt-heartbeat handling: ``` root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1102 db1095 Starting preflight checks...... [15:27:17] marostegui: https://phabricator.wikimedia.org/T199224#4448012 [15:27:22] I am reading it [15:27:27] * marostegui wants to hug jynus [15:27:49] how long did it take? [15:28:14] I got stuck with the heartbeat execution [15:28:41] because cumin got stucked waiting to close ouput, etc. [15:28:47] Yeah, i read the other channel [15:29:28] now cumin takes most of the output [15:29:34] even if it is a very little sttep [15:29:36] so it takes like 5 seconds or less? [15:30:00] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:30:38] 2.794s [15:30:43] :___) [15:30:53] most of it on the cuming verbose output [15:31:09] I will check to reduce that at another time [15:31:57] I think the main issue is our arch is not very good [15:32:18] 10DBA, 10Epic: [META ticket] Automation for our DBs tracking task - https://phabricator.wikimedia.org/T156461 (10Marostegui) [15:32:20] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10Marostegui) [15:32:20] we should improve many things of how heartbeat works and mediawiki config assuming more atomation [15:32:30] Yeah, but this is a great step [15:32:33] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10Marostegui) This is brilliant. A great step towards getting the failover automation we have always talked about. Excellent job! Looking forward to seeing this working tomorrow. [15:33:06] I want to do a last test on codfw- cuttin replication from eqiad and settin the master in read-write [15:33:22] there is a chance for corruption, but I prefer to note it now [15:33:26] than tomorrow [15:33:58] It is very unlikely there will be corruption [15:34:03] Go ahead! [15:35:06] it is true we rarely see read only errors on kibana [15:35:15] so it should be very unlikely [15:36:34] go for it [15:37:09] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:39:41] I am going to do an es2017 -> es2018 failover [15:39:55] sure [15:40:00] I am here if you need help [15:40:24] it will disappear from tendril beacause I have to disconnect the replica [15:40:34] (it does not yet handle non-true masters) [15:40:38] (work for another time) [15:40:53] sure [15:42:36] es1014-bin.002506:745885174 [15:43:02] that is the coordinates to start later the replication cross-dcs [15:43:08] got it [15:43:38] I don't know if alerts will happen now [15:44:13] I will wait a second [15:45:14] they do [15:45:18] so works as intended [15:45:37] \o/ [15:46:26] do you want me to prepare the mediawiki patch and puppet patch for you? [15:47:02] oh, you want to do the failover for real? [15:47:16] I wanted to do a switch, and then switch back :-) [15:47:27] I wasn't sure if you were going to switch back :) [15:47:30] and not spend much time on it [15:47:36] also not touch the replica [15:47:44] coooool [15:47:44] move it with the current master [15:48:04] note I am still not happy with move() [15:48:17] so we will probably do that tomorrow with the usual way [15:48:44] sounds good [15:51:50] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:51:52] so I am going to run ./switchover.py --skip-slave-move es2017 es2018 [15:52:34] I need to set es2017 in read-write just before it [15:52:41] cool [15:54:22] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [15:54:31] mysql.py -h es2017 -e "SET GLOBAL read_only=0" && ./switchover.py --skip-slave-move es2017 es2018 [15:54:43] ready? [15:55:23] go for it [15:55:29] do a time [15:55:30] :) [15:56:09] [ERROR]: We could not start replicating towards the original master [15:56:22] es2017-bin.002103:916095865 slave: es2018-bin.002107:515019552 [15:56:42] ^that is the sync point [15:56:50] so it failed to configure replication to es2018? [15:56:56] I am checking why [15:57:21] Slave_IO_Running: Connecting [15:57:31] root@es2018.codfw.wmnet[(none)]> show slave hosts; [15:57:31] Empty set (0.03 sec) [15:57:32] lol [15:57:47] see why? [15:57:49] lol [15:58:01] interesting [15:58:02] XD [15:58:12] at least we have something to check for next failovers :) [16:00:31] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [16:01:16] fixed, and replication should be back working, consistently [16:01:43] \o/ [16:02:25] it didn't do the reset slave, was that expected? [16:02:54] on the old master? [16:03:07] mmm [16:03:34] yes, it is expected [16:03:43] it stops, then starts the other, then resets [16:03:56] we could recover a bit better [16:03:59] I guess [16:04:01] But it doesn't configure the old master to replicate from the new master? [16:04:06] it does [16:04:11] but it failed to connect [16:04:15] so it aborts [16:04:16] Ah right right [16:04:32] resetting manuall [16:04:42] and trying the inverse thing [16:04:47] cool [16:04:58] I applied the fix to the other codfw host, too [16:05:01] just in case [16:05:23] we need to review eqiad for tomorrow [16:05:27] we can do it tomorrow morning [16:05:42] I added it to the etherpad [16:05:43] oh, I can just apply the fix after we are happy [16:05:48] sure [16:05:55] I want to have a succesful run first [16:06:11] I think I could add this to a sanity check, too [16:06:13] yeah of course [16:06:45] ./switchover.py --skip-slave-move es2018 es2017 [16:06:50] ^I will now run this [16:06:55] cool! [16:09:12] what to check something, it looked good this time? [16:10:02] https://phabricator.wikimedia.org/T199224#4448134 [16:10:09] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10jcrespo) Codfw testing: ``` root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2018 es2017 Starting preflight checks... * Original read only values are as exp... [16:10:44] I will check heartbeat is working as intended [16:11:00] as right now codfw heartbeats are ignored by icinga [16:11:25] yeah [16:11:33] also, it does not enable gtid [16:11:39] but I am not sure if it should [16:11:45] no [16:11:47] (at least at the moment) [16:11:48] I wouldn't do that [16:12:00] We should do that manually probably once we are happy and see everything stable [16:12:05] As we normally do after a failover [16:12:06] I have the api ready in case we want to do at some point [16:12:22] also doesn't touch anything about semisync, etc. [16:12:28] I will add those to the checklist [16:12:33] yeah [16:12:38] all that for a second phase I would say [16:13:25] want to do another bounce? [16:13:30] or are we good? [16:13:51] (this should be trivial, now) [16:14:12] and the tool prevents from doing stupid things, too [16:14:14] let's do one more and then back? [16:14:17] cool [16:14:50] 17 -> 18 now [16:15:11] go! [16:15:17] SUCCESS: Master switch completed successfully [16:15:21] 6.193 [16:15:27] (that is in seconds) [16:15:31] jeeeez [16:15:33] :) [16:15:38] and again, a lot of roundtrips [16:16:18] 18 -> 17 (do you want a go yourself?) [16:16:34] yeah [16:16:40] I was going to say that :) [16:16:45] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [16:16:56] time /home/jynus/wmfmariadbpy/wmfmariadbpy/switchover.py --skip-slave-move es2018 es2017 [16:17:00] requires sudo [16:17:12] going for it! [16:17:15] you pilot this time [16:17:32] [ERROR]: pt-heartbeat execution was not successful- it could not be detected running [16:18:00] was that after the change? [16:18:15] after this [16:18:16] or on kill? [16:18:16] Trying to invert replication direction [16:18:16] Starting heartbeat section bes3 at es2017.codfw.wmnet [16:18:26] yeah, so the new start [16:18:47] but it is running on 2017 [16:19:12] so it started [16:19:19] but it was not detected as running [16:19:24] everything else went ok [16:19:41] yeah, es2017 has two slaves [16:19:42] so yes [16:20:00] actually [16:20:01] let me check the code [16:20:04] I think it was never killed on es2017 [16:20:10] Apr16 105:48 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia [16:20:22] mmm [16:21:32] I think I know the bug [16:22:43] let me try again the process [16:22:48] sure [16:23:04] 17 -> 18 [16:23:08] yeo [16:23:10] yep [16:23:44] Could not find a pt-heartbeat process to kill [16:23:52] yeah, it is still running [16:23:53] but it is there [16:24:05] it is being shown on the output [16:24:15] let's blame cumin :p [16:24:35] maybe the way it is executed [16:24:45] it changes the detection slightly [16:24:50] so it does not match the regex [16:24:56] yeah, could be [16:26:09] you ok if I logoff? I need to do some stuff [16:26:12] 10DBA, 10Data-Services, 10MediaWiki-Change-tagging, 10Patch-For-Review: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 (10Marostegui) [16:26:16] sure [16:26:22] I will fix this, which is almost there [16:26:25] Thanks [16:26:27] and will be tommorow [16:26:29] don't worry [16:26:32] it will be nothing [16:26:32] You're doing a great job [16:26:42] I am really happy to see this moving forward :) [16:27:11] I will see you tomorrow for the failover! [16:32:49] apparently there is a --defaults-file=/dev/null [16:32:57] which makes it not detect it [16:33:01] strange [16:33:07] but very easy to solve [16:35:40] so there is what puppet runs, and what it is actually running [16:35:58] and that is a problem thoughtout our infra :-) [17:09:01] 10DBA, 10Patch-For-Review: switchover es1014 to es1017 - https://phabricator.wikimedia.org/T197073 (10jcrespo) [17:09:03] 10DBA, 10Patch-For-Review: Failover db1052 (s1) db primary master - https://phabricator.wikimedia.org/T197069 (10jcrespo) [17:09:05] 10DBA, 10Epic: [META ticket] Automation for our DBs tracking task - https://phabricator.wikimedia.org/T156461 (10jcrespo) [17:09:18] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10jcrespo) 05Open>03Resolved a:03jcrespo So the issue we had with pt-heartbeat was that the code ran by puppet wasn't actually running there- it doesn't get killed and restarted if it i... [17:25:55] 10DBA, 10Patch-For-Review: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 (10Marostegui) Thanks for the work you've put here. This is a nice work towards getting less downtime and less human errors while doing critical failovers. [19:48:54] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T200287 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now Thank you! ``` root@db1069:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [19:49:45] 10DBA, 10Operations, 10ops-eqiad: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Open>03Resolved The disk got replaced and this is all good now: T200287#4448846 [22:21:33] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:22:35] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:23:13] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) Creating a separate task presenting our questions as an RFC: {T200297} [22:29:58] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:41:02] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [23:17:24] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight)