[06:16:26] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Don't worry, as soon as we arrange a date/time, I will stop it, so we are sure that no lag will happen before the failover. I will leave the screen running and just kill the process so you can... [06:28:30] 10DBA, 10Operations, 10ops-eqiad: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) Chris, can we do this today as this host will be the future s3 primary master? [07:44:26] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [08:25:02] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [08:50:28] hey jynus we've got a bunch of things ahead :( [08:50:42] Can you take a look at: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/484611/ [08:51:59] thanks! [08:52:35] next question….tomorrow 8AM (7AM UTC) for s3 failover would work for you? [08:52:40] I have made all the patches for db1078 already [08:52:48] https://phabricator.wikimedia.org/T213858 [08:54:42] are managers ok with that? [08:55:05] they were yesterday on the hangouts at around 10pm [08:55:11] let me ask them [08:55:40] I asked them but got no definitive answer [08:56:33] just asked [09:01:50] for some reason I got disconnected [09:03:40] should I restart and upgrade the replicas? [09:03:48] I finished codfw just now [09:03:54] if you want to take the eqiad ones [09:03:57] I think there are not many left [09:04:03] I did some during christmas [09:04:14] And later re-check our etherpad and the patches [09:04:18] sorry for all the homework :( [09:05:08] I think I have connection issues [09:08:38] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) a:03Marostegui This has been done [09:10:05] check ongoing connections throught the proxy, some services may need a restart [09:10:20] yeah, I was wondering if phab needs so [09:10:54] let them die naturally first, but if it still has tcp connections, it has to be restarted [09:11:01] yep [09:11:35] there is nothing from what I can see [09:12:56] on mysql itself I see just 2 connections [09:13:00] coming from dbproxy1003 [09:13:05] let's give it some time [09:13:08] they are sleeping [09:14:11] backups may have had issues because tendril restart [09:14:27] this is connected due to restart of backup sources [09:14:44] yeah [09:16:04] I will later test the switchover script on test-s4 [09:16:20] thanks [09:20:46] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) >>! In T212487#4867178, @Marostegui wrote: > @elukey For those databases that... [09:31:14] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @Anomie I have stopped the script as we are most likely going to go ahead with the failover in EU morning (still waiting for the managers to confirm) [10:35:21] I think I need more coffee to run mysql_upgrade on s3 [10:35:43] haha [10:36:13] db1077 is partially repooled, will go with db1095 next while I wait for it to heat up [10:36:33] and later db1123 [10:36:47] ok [10:37:16] I will wait for cmjohnson1 for https://phabricator.wikimedia.org/T209815 to be done later when he is onsite [10:37:58] I was wondering if to reboot it [10:38:22] let's wait for the firmware upgrade [10:38:27] so we can do it all at once [10:39:26] I will try to have all other hosts in a good state by then [10:39:53] I didn't start es1019 [10:40:01] not sure if we should [10:40:36] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) @mmodell ` ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still see two connections (they have been there for ho... [10:40:39] I think chris is done with it [10:41:09] I will renew the downtime at least until friday [10:41:16] oki [10:52:09] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) >>! In T213865#4883806, @Marostegui wrote: > @mmodell > ` > ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still... [11:20:21] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [11:24:16] db1068 correctable errors seems to have gone down, so I guess that is good? [11:24:33] yeah, same behaviour happened in decemeber [11:24:34] december [11:24:39] and one day they were fully gone [11:25:05] I created the task to keep it in mind, will close it if they go to 0 [11:28:42] I am done with es1019. I forgot to close the task. I will update it in a few minutes. [11:29:10] np, cmjohnson1 we thought you were busy enough [11:29:25] but didn't want to do sth without your green light [11:40:13] going for db1095 next [12:01:58] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10jcrespo) So this is solved? [12:25:03] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) Yes [12:25:26] 10DBA, 10Operations, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) 05Open→03Resolved [12:52:01] jynus: let me know when I can depool db1078 [12:53:26] if you are going to deploy to eqiad-php [12:53:35] we need to merge the repool first [12:53:50] I was waiting for swat to finish [12:53:57] sure, no worries [12:54:03] I need to coordinate with cmjohnson1 still [12:54:27] so should I depool db1123 or do I wait? [12:54:35] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Just to confirm: Date: Thursday 17th January Time: 07:00 AM UTC - 07:30 AM UTC (we expect not to use the full 30 minutes window) **Impact: All those wikis will go read-o... [12:54:47] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [12:55:17] I think you can go with db1123 as I haven't arranged a time with Chris yet [12:55:33] Marostegui: what do we need to coordinate? [12:56:36] cmjohnson1: db1078 will be the new db master for s3 tomorrow, so I would like to get its firmware upgraded today if you have time: https://phabricator.wikimedia.org/T209815 [12:57:06] Sure. Take it offline and I will do now [12:57:26] ok, let me coordinate with jynus XD [12:57:29] jynus: ^ [13:00:17] cmjohnson1: we are going to depool db1078 and I will ping you once it is off [13:00:32] Ok [13:17:12] cmjohnson1: db1078 is now off, you can proceed [13:21:33] I will give es1019 a kick [13:22:01] Great! [13:22:21] cmjohnson1: when done turn db1078 off and I will take it from there :) [13:22:25] *on [13:22:35] marostegui: did you do a full-upgrade? [13:22:37] yep [13:24:03] we will see if mgmt lasts this time... [13:32:09] marostegui and jynus...sorry this is going to take a little longer than I thought. I forgot it was a HP server. I have to run the service pack but I need to be there to do it.....I'm on my way now but will be about 1 hour w/traffic [13:32:44] cmjohnson1: no worries [13:32:51] I will extend the downtimeç [13:33:16] I can do db1124 [13:33:38] cool [13:33:51] it will affect several hosts but we have to do at some poinmt [13:34:02] yeah [13:34:09] and better now without s3 updates [14:08:42] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) @Marostegui I have now everything in my home gzipped, I'll move it to stat1007 and... [14:09:27] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) cool, let me know when you think it is safe for me to delete my files [14:09:52] you can delete anytime --^ [14:09:57] ok thanks [14:13:31] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) Just for the record, I am spending an incredible amount of time fixing dbstore1002 on the last few days si... [14:13:41] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [14:14:27] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) As spoken on IRC...I have deleted my files. [15:10:44] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) 05Open→03Stalled stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case. [15:12:03] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Cmjohnson) I ran the Service Pack on db1078, all firmware is up to date including BIOS and raid controller. The server is currently powered off [15:12:56] cmjohnson1: will you power it on for us? [15:12:59] or do you want me to? [15:13:10] i can power it on if you are ready for it [15:13:16] you asked to leave it powered off [15:13:20] yep! go for it [15:13:23] ok [15:13:26] yeah, it was a typo, i wanted to say "on" XD [15:13:28] Sorry! [15:13:52] hah..no worries it's coming up now [15:13:55] thank you! [15:16:29] marostegui are you able to ssh? [15:17:11] yes! it works [15:17:17] I will take it from here [15:17:20] Thanks a lot [15:17:20] cool [15:17:47] I will resolve the task once I repool it [15:37:37] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) 05Open→03Resolved Thank you so much! The server is back in the mix. [17:29:21] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10jcrespo) switchover script works as expected (tested on db1111/db1112): `lang=sh, lines=10 ./switchover.py --skip-slave-move db1111 db1112 Starting preflight checks... * Original rea... [17:35:59] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Awesome news! We have to include it on the steps list on our etherpad, which I wrote yesterday evening and needs to be reviewed by you, as it was late in the day, so error... [17:38:26] ^I just did that [17:38:49] put the things not needed on "Equivalent manual steps" to skip [17:41:01] I will also try for next time to do the topology changes automatically integrating the repl functionality [17:43:51] thank you! [17:43:55] i have a question [17:44:00] about your line 33 [17:44:06] check my line 34 :) [17:47:08] I want to asume that either we abort or go through [17:47:34] so we leave gtid and semisync enabled before the actual failover [17:47:53] well, I don't know if I want it [17:47:56] I am proposing it [17:48:20] But we always disable gtid [17:48:24] You want to try not doing so? [17:48:35] yes, we disable it for migration [17:49:18] it is a small thing so no big deal [17:49:25] as long as we leave it for later [17:49:35] yeah, it is on the list, to put it back ON [17:49:43] what I don't want is to keep users waiting for it [17:49:49] so either reenable it in advance [17:49:57] or doit as the last thing [17:49:57] users waiting for what? [17:50:31] outside of the "critical time" [17:50:56] I am not sure what you mean, why would users wait for GTID? [17:51:08] don't worry, we can talk tomorrow [17:51:49] sure [21:24:44] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10cwdent) [21:49:57] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10awight) How many rows are in the tables? There should be a timestamp column, often `ts`, to query. It's probably worth keeping an archival dump if it might be real data. [23:30:06] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Pchelolo) 05Resolved→03Open Hm, trying to deploy the service again I see the same issue: `... [23:40:12] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10greg) >>! In T212487#4867200, @Marostegui wrote: > I asked @chasemp about `fab_migration`...