[04:23:50] marostegui: jynus I can verify that I can access labsdb1004 from tools, so no need to massage VLANs or firewalls [04:24:00] I do find that it has less databases than 1005 tho. not sure if that's expected [04:46:53] jynus: marostegui https://gerrit.wikimedia.org/r/#/c/337775/ will switch the aliases we ask people to use to labsdb1004 from 1005 [07:00:11] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3028418 (10Marostegui) >>! In T153743#3026114, @jcrespo wrote: > I've added a workaround that makes no sense but that works for now ,we need to revisit it... [07:15:59] yuvipanda: https://phabricator.wikimedia.org/P4935 [07:16:03] I guess it is not too worrying [07:19:13] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10Marostegui) ``` 04:23 < yuvipanda> marostegui: jynus I can verify that I can access labsdb1004 from tools, so no need to massage VLANs or fi... [08:30:49] 07Blocked-on-schema-change, 06Collaboration-Team-Triage, 10Notifications, 13Patch-For-Review, 07Schema-change: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428#3028544 (10Marostegui) x1 is done too, so I believe this ticket can be closed. I am not going to paste all... [08:31:00] 07Blocked-on-schema-change, 06Collaboration-Team-Triage, 10Notifications, 13Patch-For-Review, 07Schema-change: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428#3028545 (10Marostegui) 05Open>03Resolved [08:34:34] 10DBA, 06Operations: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#3028551 (10MoritzMuehlenhoff) My two cents: From a high level view I personally prefer the systemd unit to be in the Debian package since it's part... [09:38:26] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3028611 (10Marostegui) After a chat with Jaime we have moved those old databases in labsdb1005 to: `labsdb1005:/srv/tmp/old_dbs` . They didn't have an... [10:00:44] 10DBA, 10Analytics, 06Labs: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#3028656 (10JAllemandou) [10:10:03] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028719 (10Marostegui) I assume the following is going to happen in all the hosts as we have MIXED everywhere (but some specific cases like the sanitarium2 masters): ``` root@neodymium:~#... [10:26:33] Hello! qq - I am reviewing the analytics ACLs on cr1/cr2, and there is a rule called prelabsdb-mysql listing some ips. One is not used anymore, one is now kubernetes1003, and the last one is db1057 [10:26:40] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028736 (10jcrespo) Use "--no-check-binlog-format"- pt-t-c forces binlog format already if using super for itself, it should only cause issues for multi-level slaves, but you can check lat... [10:26:51] I think it is all old garbage but I wanted to double check with you [10:26:59] (my team does not remember) [10:28:17] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028741 (10jcrespo) Also, let's centralize the dsn tables on tendril or any other central place- so we do not have garbage tables in the future scattered all around. [10:29:28] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028742 (10jcrespo) Are you sure also you are using an updated version of pt-table-checksum, one without the binary bug? [10:46:14] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028759 (10Marostegui) >>! In T154485#3028736, @jcrespo wrote: > Use "--no-check-binlog-format"- pt-t-c forces binlog format already if using super for itself, it should only cause issues... [11:03:25] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028802 (10jcrespo) > Good point: pt-table-checksum 2.2.20 I do not know when that was fixed- `grep -A 15 ' CREATE TABLE checksums' $(which pt-table-checksum)` should force the table or t... [11:06:41] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028813 (10jcrespo) >>! In T154485#3028759, @Marostegui wrote: > That makes sense, however we'd need to truncate the table after using it as it will be used to check specific slaves from d... [11:09:35] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028822 (10Marostegui) >>! In T154485#3028802, @jcrespo wrote: >> Good point: pt-table-checksum 2.2.20 > > I do not know when that was fixed- `grep -A 15 ' CREATE TABLE checksums' $(wh... [11:43:15] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3028898 (10Marostegui) For the record I have created the dsns tables on tendril (and the first test with pt-table-checksum on m3 is using it). The only one that has data so far is dsns_m3... [11:48:47] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#3028935 (10Marostegui) Not sure if it is worth altering the master (db1049) anymore as it is going to be decommissioned soon: T134476 Probably not worth the risk and the time. [14:33:00] 10DBA, 06Operations: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029269 (10Marostegui) [14:49:22] 10DBA, 06Operations: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029289 (10Marostegui) Server rebooted fine it showed this on dmesg which I am not completely aware of what it means : ``` [ 32.823256] hpsa 0000:08:00.0: Acknowledging event: 0xc0000000 (HP SSD Smart Path configuration ch... [15:00:40] 10DBA, 06Operations, 10ops-eqiad: Replaced BBU for db1060 - https://phabricator.wikimedia.org/T158194#3029382 (10Marostegui) p:05Triage>03High [16:05:49] 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 07Performance, 07Schema-change: Logging needs an index to optimize searching by log_title - https://phabricator.wikimedia.org/T68961#723244 (10Huji) [16:06:28] 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 06Performance-Team, and 2 others: Logging needs an index to optimize searching by log_title - https://phabricator.wikimedia.org/T68961#723244 (10Huji) [16:24:53] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3029816 (10Marostegui) For the backup data: es1017 looks like a good candidate: ``` marostegui@es1017:~$ df -hT /srv Filesystem Type Size... [17:02:01] yuvipanda jynus time has come I believe :-) [17:03:33] I'm here [17:04:16] I guess we need for this: https://gerrit.wikimedia.org/r/#/c/337775/ to be deployed :) [17:04:28] before we can stop 1005 and copy it over [17:04:50] not yet, first we announce it [17:05:02] then we put the master in read only [17:05:07] ah :) [17:05:10] then we repoint [17:05:21] then we shutdown it [17:05:36] we also need a puppet change while the server is down [17:06:29] I think db2062 didn't boot to a proper state [17:06:39] a proper state? [17:06:46] hello [17:06:48] I'm here [17:07:14] marostegui, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db2062 [17:07:35] yuvipanda, let's announce the start of the work [17:07:40] on IRC [17:07:50] not sure if you usually do it twice on mail [17:08:01] jynus: oh wow…I will take care of it later then thanks for the heads up [17:08:25] jynus: yeah, I do. let me do that [17:09:08] done [17:09:30] jynus: am ready to merge and test the DNS failover whenever you want :) [17:10:01] yuvipanda, wait [17:10:14] we need to disable 3 users [17:10:24] ok! [17:10:37] (the 3 that cannot replicate to 4 [17:11:06] I've changed the topic on labs [17:11:17] thanks! [17:11:28] so [17:11:38] I got the screen with the nc commands on es1017 and the iptables rule for the transfer ready [17:11:43] good [17:12:20] I will change the permissions of the conflictive accounts to root now [17:13:00] That is s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.% [17:13:12] on labsdb1004 [17:13:20] ping me if I am going to do something stupid [17:13:28] haha [17:13:34] so it makes all sense :) [17:16:02] [s51412__data]> create table test (i int); [17:16:08] ERROR 1005 (HY000): Can't create table `s51412__data`.`test` (errno: 13 "Permission denied") [17:16:32] \o/ [17:16:56] I checked that I can create and drop tables, still [17:17:17] so, next stop, setting master as read only [17:17:20] nice [17:17:23] and change replication direction [17:17:25] * marostegui taking notes to ask jynus tomorrow a few questions in our meeting [17:17:57] and note the binlog position of both servers [17:18:06] this should be done as fast as possible [17:18:14] yuvipanda, prepare the patch for merge [17:18:28] yes sir! [17:18:30] so this is as little disrruptive as possible [17:18:34] but wait for our ok [17:18:42] we have to confirm replication works the other way [17:18:51] while in read only mode [17:19:23] jynus: yup! [17:19:34] patch ready whenever you are [17:20:20] marostegui, I do the changes, I assume? [17:20:31] yep, I am taking notes on an etherpad [17:20:37] setting labsdb1005 in read only [17:20:39] but i can doble check the binlog position too [17:21:24] log.124340 | 21527357 for the current master [17:21:28] 1005: log.124340 21527357 [17:21:29] it is in read only mode [17:21:32] yep [17:21:37] copy that to the etherpad [17:21:51] done [17:21:53] and 1004 too [17:22:06] slave is up to that with that [17:22:28] local master pos on 4 is : log.059837 | 11232939 [17:22:29] https://etherpad.wikimedia.org/p/labs-migration [17:22:49] good, we agree [17:22:58] no, reseting the replication [17:23:01] on 4 [17:23:32] and running change master on 5 [17:25:21] check the etherpad [17:25:24] for coords [17:25:31] marostegui? [17:25:36] yes [17:25:40] did you see my comment? [17:25:43] maybe i got disconnected [17:25:49] I did [17:25:53] makes sense? [17:25:58] it was the same number [17:26:03] then it looks good [17:26:09] running on 5 [17:27:39] I see it now [17:27:45] replication looks good [17:27:54] yep :) [17:27:55] we will see if it breaks :-) [17:27:58] XD [17:28:01] no ssl yet [17:28:05] we can merge, yuvipanda [17:28:15] oh wait [17:28:17] ok [17:28:26] it is still read_only=ON [17:28:27] we can put 4 in read/write [17:28:34] yep [17:28:35] merge now, doing it now [17:29:12] doen now [17:29:19] let's repoint to labsdb1004 [17:29:43] see how replication and tools react, etc. [17:29:55] I've merged, it takes a little time for it to propogate anyway. let me force a puppet run [17:30:01] I know [17:30:10] we can put a proxy here in the future [17:30:19] if we get the money :-) [17:30:37] :D [17:31:38] some users seem to be using persistent connections [17:31:54] I can "help" changing the server once the change has been propagated [17:32:05] hahaha [17:32:07] "help" [17:32:07] XD [17:32:23] let's get that copy prepared meanwhile [17:32:30] and the puppet role change [17:33:19] once we stop the server I am ready to hit the enter and start the copy [17:33:26] good [17:33:38] then review my puppet change when it is ready [17:33:57] I will copy /srv/postgres and /srv/labsdb into two different tar.gz [17:34:12] it is already right? [17:34:14] the postgres one is just 67M :-) [17:34:28] No, I haven't started the copy [17:34:31] I do not think there is a real postgress there [17:34:48] I can do it now, there is no process no [17:34:49] have you checkjed the contents [17:35:03] indeed, there is nothing there XD [17:35:12] postres is on 1006 and 1007 [17:35:23] it has 4T for it there, I am glad we are going to reimage :) [17:35:36] ok, once we are ready we can stop mysql and I will start the copy of /srv/labsdb [17:35:44] and 1004 [17:35:56] I thought the role was wrong [17:36:00] but it is right [17:36:07] it may need a check [17:36:58] this is simpler than I thought [17:37:01] mariadb10 => false [17:37:13] to nothing (we have 10 as default) [17:37:55] primary DNS complete, awaiting secondary DNS puppet run to finish [17:38:01] cool [17:39:34] just curious - are we also upgrading to mariadb 10/ [17:39:39] yes [17:39:54] people asked/complained about it [17:40:06] about not being 10 [17:40:10] not about the upgrade [17:40:32] there are some tools blocked by it (e.g. wanting innodb fultext search) [17:41:08] nice [17:41:10] \o/ [17:41:21] jynus: marostegui ok, DNS done [17:41:29] create database test; and it apears on the slave! [17:41:49] I drop it and it drops! magic! [17:42:03] ok, I will kill all connections on the previous master [17:42:19] we've got a bunch yes [17:42:20] and wait for people to complain because they have not programmed [17:42:33] their services to reconnect [17:43:37] heh [17:46:20] 1 stuborn user reconnected [17:46:25] i see how the connections reduced yep [17:46:27] from 90 to 12 XD [17:46:32] several, actually [17:47:17] I see people using the swithover, though [17:47:53] and i see the same user again there [17:48:14] replication is up [17:48:37] yuvipanda, should we wait for checking important tools, or should we stop mysql already? [17:48:57] for me I would put it down already- less downtime [17:49:38] replication broke, but I think what it is [17:49:42] jynus: I think as long as we're sure we won't completely lose data, I say we stop it [17:49:52] we can fix it later [17:50:06] shutting down 1005 [17:50:09] yuvipanda: we won't lose data [17:50:21] I just checked PAWS, and it's just reconnected and come bac up [17:51:28] 5 is down [17:51:33] jynus: i see the process is now down, you want me to start the copy? [17:51:38] yes [17:51:43] ok [17:52:27] started [17:52:36] copying /srv/labsdb to es1017:/srv/tmp [17:53:26] I do not like using production hosts [17:53:38] but there is not much now in the dc [17:53:58] yeah, me neither, but until we get the new dbstores...:( [17:54:01] let me disable puppet [17:54:05] and merge the change [17:54:08] cool [17:56:18] yuvipanda, we do not really need you around anymore until 1-2 hours [17:56:23] marostegui, agree? [17:56:30] yep [17:56:35] I think it will take around 1h to finish the copy [17:56:42] ok then! I'll go shower and stuff :) [17:56:59] I will also check I can decompress the tar.gz (not the whole of it, but just a few files) [17:57:01] I'll check back in at most 1h but possibly earlier [17:57:07] feel free to call me if needed [18:51:38] 1h into the transfer and we have copied half of the dataset [18:51:48] :D [19:47:20] almost done [19:47:34] w00t [19:48:32] yeah, 60G to go :) [19:54:13] let verify the tar when we are done [19:54:18] unrelated, I'd like to do https://phabricator.wikimedia.org/T146718#3028336 later this week. [19:54:30] yeah, I will check I can decompress it [19:54:36] for a few minutes and then crtl+c [19:54:53] +1 [19:55:01] \o/ [19:55:59] yuvipanda, https://phabricator.wikimedia.org/T157359 has higher priority [19:56:06] it is part of our chosen goal [19:56:32] jynus: yeah, I agree. [19:56:39] they copy is done [19:56:42] let me verify the tar [19:56:49] yes, take your time [19:57:02] last thing we want is to lose it [19:57:08] * yuvipanda nods [19:57:37] extracting [19:59:07] let it reach something other than the binlogs [19:59:19] yeah [19:59:21] no worries [19:59:26] it is still extracting binlogs :) [19:59:34] i am going to let it run for a while [19:59:47] and meanwhile will merge this: https://gerrit.wikimedia.org/r/#/c/337911/ [19:59:53] if it gets verified sometime soon [19:59:56] :( [20:09:05] 10DBA, 06Operations, 13Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#3030736 (10jcrespo) The user part should be fixed, or fixed when all trusties are decommissioned. The group part will take effect starting on stretch. This is mostly done... [20:10:53] oh, you merged it [20:11:06] I was waiting for the verify looking at: https://integration.wikimedia.org/zuul/ [20:12:04] it has extracted a few databases already, I think it is fine [20:12:19] ok for me [20:12:33] wait for dhcp change to apply [20:12:34] root@es1017:/srv/tmp/labsdb/data# du -sh . [20:12:34] 27G . [20:12:57] (not counting the binlogs) [20:13:59] dhcp updated [20:14:02] ok [20:14:06] let's reimage [20:14:15] * marostegui crosses his fingers [20:14:20] labsdb1005 [20:14:29] you do it or I do it? [20:14:34] I can du it [20:14:36] I shall cross fingers too [20:14:36] oki [20:14:46] oh, the reimage is the easy party [20:14:53] search me the ticket number, please [20:14:56] sure [20:15:25] https://phabricator.wikimedia.org/T157358 [20:15:33] T157358 [20:15:34] T157358: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358 [20:16:14] wmf-auto-reimage -p T157358 labsdb1005.eqiad.wmnet [20:16:18] ^ok ? [20:16:31] I had the fear of doing db1005 by accident [20:16:40] looks good to me [20:16:45] but there is no db1005, so it was not a huge issue [20:17:00] and we'd need to decomission it anyways if it existed :p [20:17:23] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['labsdb1005.eqiad.wmnet'] ``` The lo... [20:17:29] it is running now^ [20:17:30] \o/ [20:21:23] I can see it installing [20:21:27] nice! [20:21:41] I think it is jessi [20:21:57] hopefully! [20:32:41] finished, running puppet [20:33:11] yeah I am watching it live too [20:33:12] like a film [20:37:54] do you want to copy it back? [20:37:59] yep [20:38:03] is it back already? [20:38:24] wait [20:38:28] it may restart it once [20:38:30] 2017-02-15 20:38:15 [INFO] (jynus) wmf_auto_reimage::submit_job: Submitted job '20170215203815311253' on target '['labsdb1005.eqiad.wmnet']' with action 'system.reboot' and params '[]' [20:38:34] yep :) [20:38:36] it just did [20:38:51] that is useless for us [20:39:06] it would make sense if the data and mysql was already there [20:39:35] jynus: https://phabricator.wikimedia.org/T136192#3030801 is one of the users whose db isn't on 1004 (is large). I'm going to repsond to them saying it'll be back once maint completes. is that accurate? [20:40:34] yes, tell them it is under maintenance, and that it will be abvailable in some hour's time [20:41:09] ok [20:41:45] server is back [20:41:53] let's copy suff back [20:41:56] ok [20:42:09] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3030805 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labsdb1005.eqiad.wmnet'] ``` and were **ALL** successful. [20:44:02] started [20:44:31] ETA? [20:44:39] 1:15 [20:44:56] good, will take a break, let's come back in 1 hour [20:45:06] for the "fun" part [20:45:06] yeah, going for a break as well [20:45:09] need some fresh air XD [20:45:32] (user ack'd my response) [20:46:05] good [20:46:15] again, nothing to see here until 1 hour [20:46:37] after that the actual upgrade, which is where thing can go wrong [20:46:52] 5.5->10 upgrade [21:17:30] 10DBA: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3030925 (10Marostegui) The first test with phabricator_file database and generated a peak of 500 seconds lag on db1048 and db2012 while checksuming the biggest table of the database: file_... [21:40:24] almost there, 12 minutes [21:42:01] w00t [21:44:07] marostegui, give a look at https://gerrit.wikimedia.org/r/337990 [21:44:58] checking [21:45:19] any particular reason? [21:46:10] well, it has a 5 MB one now [21:46:27] oh, i thought it had 128M XD [21:46:30] oh, no [21:46:36] that is some random file [21:46:40] on /src [21:47:20] 128M is ok, I suppose [21:47:23] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10Papaul) [21:47:29] I just didn't want only 5 [21:47:55] jynus: I am fine with 500M I was just wondering why the increase from 128 to 500M if it was for something specifically [21:47:55] we could do a general check of the options, that may have not been checked for a while [21:48:10] specially anything that requires a reboot [21:48:26] what did you do for all the migrations 5.5->10 that you did in the past? [21:48:34] what do you mean [21:48:35] start with —skip-networking and then mysql_upgrade and hope for th best? [21:48:39] I mean in this case [21:48:53] because tools was restricted to 5.5 options for a long time [21:49:21] start with skip networking and --skip-slave-start [21:49:48] then we'll see [21:49:58] * marostegui crosses his fingers again [21:50:07] 3 minutes left for the transfer [21:50:34] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031031 (10Papaul) [21:52:01] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10Papaul) [21:55:10] well, the transfer is done guys [21:57:50] so, can I start the db? [21:57:58] or will you? [21:58:09] I will do it [21:58:21] I am tailing the error log [21:58:41] it is up [21:58:44] yep [21:58:49] complaining about p_s [21:58:55] but that is to be expected [21:59:05] let's run upgrade [21:59:37] ok [22:00:11] running [22:02:09] it may take some time [22:02:16] but notice anything strange [22:02:20] *note [22:02:29] yeah, it is in a screen so we can easily check it [22:02:31] so far so good [22:03:41] unix socket authentication worked nicely [22:04:25] finished [22:04:29] let me scan the log [22:04:34] to see if there is anything strange there [22:04:58] log = output [22:06:06] it all went fine [22:06:26] that is good, no corruption, no anything [22:06:47] let's stop and start with only —skip-slave? [22:06:50] let's restart for the changes to take place, but still skip networking and slave start [22:06:57] oki :) [22:07:02] I do not want anything yet connecting [22:07:18] I do not care for production, but this is so public... [22:07:22] restarting [22:07:34] no errors [22:08:07] so I would start the slave [22:08:20] and I think the filters will cure the replication error we had later [22:08:47] we still have —skip-networking [22:09:02] oh, will that prevent the slave connection? [22:09:24] not sure, we can try to start the slave anyways :) [22:09:32] I thought that only affected the open port [22:09:38] try, please [22:10:00] started [22:10:03] and looking good [22:10:09] that is exactly what I wanted [22:10:13] catch up with no users [22:10:22] dup entry [22:10:30] :-( [22:10:31] is that the one you were expecting? [22:10:35] nope [22:11:02] but this happened to me before and I have not solved the problem [22:11:13] some kind of config incompatibility [22:11:35] but I supposed it was becase different versions [22:13:10] 37746 rows, I would say backup + ignore + start slave [22:13:39] go for it then [22:14:00] let's hope there are no more :| [22:14:08] backed up [22:14:11] let me know if / when you want to switchover tools.labsdb again [22:14:21] yuvipanda, if everyhing is ok, soon [22:14:55] ready for set global sql_slave_skip_counter = 1 then? [22:15:04] no no [22:15:08] we will not skip [22:15:14] just ignore the whole thing [22:15:14] https://gerrit.wikimedia.org/r/#/c/338012/ is ready to go whenever [22:15:45] not sure what you meant with ignore, sorry :( [22:15:59] you will see now [22:16:02] XD [22:16:14] I take control now, ok? [22:16:21] go ahead [22:18:18] I would say looks good now [22:18:28] but as soon as I say that, something will happen :-) [22:18:41] haha, so what did you do? just jumped to the next position? [22:18:54] made a backup, then ignore the db [22:19:17] when we are up to date, we can discuss details- resover the versiono on 4 [22:19:21] recover this version [22:19:27] roll in the binlog [22:19:28] etc. [22:19:30] ignored the db? [22:19:33] yep [22:19:44] as i said, only 1 table [22:19:47] how? [22:19:51] oh [22:19:57] that is the simples part [22:20:10] run show slave status [22:20:13] :-) [22:20:22] aaaah [22:20:24] haha :) [22:20:25] gotcha [22:20:43] I've done too many bad things in the past that you should not do it! [22:20:49] hahaha [22:20:57] this is prohibited on production [22:21:08] No, I didn't think of that, I was trying to think of something else, I was like: what is he doing… [22:21:11] XD [22:21:13] but the reason why the whole copy is because we cannot guarantee [22:21:23] users breaking their own db [22:21:34] because myisam, and shooting their own foot [22:21:50] but the rest of the users shouldn't pay for the problems of a few [22:21:52] so ignore that [22:21:53] yeah [22:21:54] agreed [22:22:07] later we put a ticket to recover [22:22:19] or we can have a look at why it happened [22:22:20] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031214 (10RobH) [22:22:27] it is a 33K table [22:22:34] it is not worrying [22:22:35] yeah, it is tiny [22:22:53] it is not like the revision table on commons [22:23:11] that probably you have not run pt-table-checksum on it yet [22:23:15] (I have) [22:23:30] ok, looking good [22:23:35] when we catch up [22:23:36] nope, not yet, I have been playing around with m3 today only [22:23:38] we will set [22:23:47] :-) [22:23:51] read only on the master [22:23:55] and do the dance again [22:24:03] and then sleep \o/ [22:24:05] we also have to rememver [22:24:08] *remember [22:24:21] to put the user tables as read-write [22:24:26] true [22:24:44] so what is the order? [22:24:57] catch up + read only + restart [22:25:12] i would say read_only on the master, change replication again, and then set the users to writable [22:25:20] well, yes, catch up of course first [22:28:15] but we need to restart [22:28:19] but the skip stuff [22:28:21] *for [22:28:26] yeah [22:28:44] probably read only first [22:28:49] or we will be here forever [22:28:56] 3 minutes more in read only will not hurt [22:28:59] yeah, let's do that [22:29:05] let's restart once it caught up [22:29:17] and once we've set read_only [22:29:59] ups, replication broken [22:30:01] yeah [22:30:56] same stuff [22:30:59] I would say [22:31:10] it is a bigger table, but let's do the same [22:32:04] worst case scenario, it is a backup from 20 minutes ago [22:32:38] yes [22:33:44] ok, let's set 4 as read only [22:34:15] ok, I will do it [22:34:24] just did [22:34:27] ah :) [22:34:31] ok, now [22:34:38] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031361 (10RobH) @Papaul: The task description currently has a section for db70 reading: db2070 [] - setup new port configuration ge-6/0/18 [] - remove old port configuration ge-5/0/ However, I cur... [22:34:41] let's break replication [22:34:42] let's use the etherpad [22:34:43] yep [22:34:49] https://etherpad.wikimedia.org/p/labs-migration [22:34:57] binlogs [22:35:24] yep, we are in an agreement :) [22:36:09] doing reset slave on 5 [22:36:24] ok [22:36:34] yuvipanda: you around? not yet but coming soon :) [22:38:01] marostegui: kk [22:38:20] :) [22:38:32] marostegui, can you check the change master params? [22:38:38] doing it [22:38:47] looks good [22:39:07] I Am running that on 4 [22:39:16] excellent [22:39:29] if I find the password :-) [22:40:07] I can leave it on a screen if you like [22:40:17] i have it here [22:40:36] error connecting [22:40:43] ah [22:40:44] yes [22:40:45] cane you give it a second look? [22:40:47] we didn't restart [22:40:57] he he [22:40:59] good catch [22:41:04] I was getting nervous [22:41:14] let's reastart, in read write already [22:41:17] 1005 [22:41:22] ok [22:41:34] pending: change the databases to be writeable [22:41:42] yes [22:41:46] I can do that at the same time [22:42:14] ok [22:44:00] so did you restart [22:44:08] or were you waiting on me? [22:44:13] I was waiting, let me do it [22:44:21] yes, sorry [22:44:23] I wasn't clear [22:44:31] I was doing the chowns [22:44:33] no worries, better be safe than sorry at this point of the night :) [22:44:35] we can restartt [22:44:40] restarting [22:44:44] you mean so earlt? [22:44:51] haha [22:44:53] restart done [22:45:21] slave connected [22:45:25] (labsdb1004) [22:45:55] yea [22:46:01] we will see in the future [22:46:10] becase those 2 dbs [22:46:34] are we at this point ready for yuvipanda to push that change? [22:46:42] yes, I think we are [22:47:15] good! [22:48:04] yuvipanda [22:48:14] yeah rebasing now [22:48:18] sorry [22:49:57] puppet run is happening now [22:50:00] eta about 1min 30s [22:50:08] \o/ [22:51:34] I will do a backup of these 2 dbs [22:51:40] on the slave [22:51:43] ok [22:51:44] to have another copy [22:54:56] restart of nscd ongoing [22:55:02] :) [22:57:44] jynus: marostegui done [22:57:53] \o/ [22:57:55] cache issues, etc? [22:58:20] jynus: we did the best we can [22:58:24] I will kill existing connections, if any to the again slave [22:58:28] which is to restart nscd [23:02:40] I can see writes happening on 1005 just fine [23:02:41] :) [23:03:19] is templatetiger and stuff back? let me look [23:03:40] yes, it should [23:03:51] everthing that was originally there [23:04:03] specially templatetiger [23:04:19] that was never switchover [23:04:24] yeah [23:04:26] I see it now [23:04:58] so, there are 2 dbs that could be weird [23:05:12] right now, reverted hours or minutes [23:05:31] but we have not lost any data, we just need to know how the user wants it [23:06:13] s52004__hocr and s52421__commonsdelinquent_p [23:07:03] in what sense? [23:09:30] jynus: I'll respond to cyberpower. He has a history of blaming infrastructure for things and is not a very good use of your time [23:09:38] yes [23:10:15] those databases were reverted a few minutes in time [23:10:32] so we were not blocked by replication failures [23:10:48] we want the users to see if they are ok with the current state [23:11:11] and if they say "I miss rows", we can add them (we have those logged) [23:11:21] we have like 7 backups [23:11:33] at different points in time [23:12:38] I will add ticket to the users to clarify that [23:12:43] jynus: thanks! [23:12:43] so do not worry about that [23:13:08] jynus: is the window and stuff over now? or are we upgrading 1004 now too? [23:13:10] again, nothing was lost, it is that we needed to do some things in order to assure the consistency of the db for other users [23:13:32] yeah, and it was no more than 3-4 minutes I believe [23:13:32] 1004 is already jessie and 10.29 [23:14:30] aaaaah [23:14:35] I didn't realize that [23:14:50] I "do not need you" :-) [23:14:53] to upgrade that [23:14:55] :( [23:14:56] right [23:15:03] because normally it is not active [23:15:19] we should focus next on the osm machines [23:15:27] jynus: right. [23:15:30] which I think nobody knows much about them [23:15:41] Do you guys think we are done now for this maintenance? [23:15:45] alex knows most about them but he's out for a while [23:15:53] yep [23:16:31] then I think I am going to go to bed :) [23:16:41] Thank you guys for all the smooth maintenance!! Very well done :) [23:17:20] jynus: marostegui \o/ ok [23:17:25] so, should we send an email, yuvipanda ? [23:17:32] jynus: am doing that now [23:17:51] let me read it before you send it [23:19:09] jynus: > This is complete, and we only had a few minutes of it being readonly. Let us know if anything is amiss! Thanks :) [23:19:10] just that [23:20:40] should we mention the couple of account I will contact with a ticket? [23:20:54] or it is ok to just create the tickets [23:21:01] maybe it is better to do that by ticket? [23:21:04] comment that we are now on MariaDB 10 [23:21:10] that is good stuff [23:21:13] yeah! [23:21:22] some people dependend on that [23:21:38] jynus: just ok to create tickets, and I can try to contact them elsewhere as well. Doing it in announce will probably lead to a lot of people pinging back and assuming this refers to their db [23:21:50] tools db is now on 10.0.29 [23:21:56] just add that [23:21:59] (Mariadb) [23:22:39] > This is complete, and we only had a few minutes of it being readonly. We are now running on MariaDB 10, so new features might be available to you! [23:22:39] Thanks to DBAs (Jynus / marostegui) for pushing it through! Let us know if anything is amiss! Thanks :): [23:22:50] marostegui, good night [23:23:08] good night guys, well done :) [23:23:08] going to send it [23:23:18] thanks, yuvipanda thanks for all the help [23:23:40] jynus: have a good night's sleep! Let's figure out postgres later. [23:24:41] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3031456 (10jcrespo) This is mostly done, no major incidents- servers where only in read-only for a few seconds before and after the maintenance, for switc... [23:37:32] 10DBA, 06Operations, 10ops-codfw: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031499 (10Papaul)