[02:34:46] 10DBA, 10MediaWiki-Database, 10Patch-For-Review, 10Schema-change: clean up timestamp database fields - https://phabricator.wikimedia.org/T42626#3791135 (10Krinkle) Looks like a schema change patch exists at . Unsure about its status and/or opinion from DBAs, and in r... [06:20:06] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3791378 (10Marostegui) Hey Anomie, No problen. I only finished dbstore1001 and dbstore1002 for s2. So I am happy to start with s3... [06:23:33] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3791381 (10Marostegui) [08:06:10] why reimage db1099? [08:06:30] to convert it to multi-instance [08:06:35] as it was the plan [08:07:12] yes, but reimage? Is it jessie? [08:07:30] We are going for stretch+10.1 for the multi-instance [08:07:53] cannot it be upgraded in place and keep the existing data? [08:08:14] I am keeping the existing data by copying it to dbstore1001 and then back [08:08:17] I prefer a full reimage [08:08:25] ok, you are doing the extra work [08:08:32] so nothing to say about it [08:08:45] I prefer a full reimage if possible [08:12:33] :-) [08:13:10] Don't really get the meaning of: "so nothing to say about it" ;) [08:13:25] that I will not complain [08:13:35] Ah ok ok [08:17:05] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3791486 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1099.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017112... [08:20:09] heads up for https://gerrit.wikimedia.org/r/#/c/393723/ [08:34:08] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3791572 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1099.eqiad.wmnet'] ``` and were **ALL** successful. [08:41:18] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3791593 (10Marostegui) The following wikis have to be excluded from the ALTERs in s3, as they are new and were already created with... [08:58:54] hello people [08:59:16] I did some last minute sanity checks, the log db on dbstore1002 can be dropped [08:59:29] \o/ [08:59:33] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3791657 (10Marostegui) I would appreciate a review for: https://gerrit.wikimedia.org/r/#/c/393725/ @jcrespo @Anomie There is still... [08:59:50] elukey: you want to do it yourself or you want me to? [09:00:37] marostegui: if you have time that'd be great, just to be sure that I don't mess with anything [09:00:48] ok [09:00:52] I checked 'show full processlist \G' and processes running [09:01:00] yesterday I should have removed all the eventlogging stuff [09:01:11] but if you want to triple check it would be great :) [09:01:19] ok [09:01:20] let me see [09:02:26] ah I think we can also merge https://gerrit.wikimedia.org/r/#/c/393597/1 ? [09:03:08] did youn run a puppet compiler for db1047, db1046 and dbstore1002 for that change just in case? [09:03:41] ok, let's drop log from dbstore1002 then? [09:04:10] marostegui: going to do it now (the pcc) [09:04:16] good [09:04:17] yeppa, let's drop it! [09:04:18] \o/ [09:04:47] root@DBSTORE[(none)]> set session sql_log_bin=0; [09:04:47] Query OK, 0 rows affected (0.00 sec)< [09:05:11] drop database if exists log; [09:05:11] ok? [09:05:27] looks good? [09:05:38] yep [09:05:43] there we go [09:05:51] running [09:06:02] root@DBSTORE[(none)]> drop database if exists log; [09:06:03] Query OK, 379 rows affected (9.01 sec) [09:06:42] \o/ [09:07:07] 10DBA, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3791689 (10Marostegui) ``` root@dbstore1002:~# mysql --skip-ssl Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB... [09:07:12] marostegui: https://puppet-compiler.wmflabs.org/compiler02/9003/ [09:07:15] seems good [09:07:16] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=dbstore1002&var-network=eth0 [09:07:21] no space gained yet [09:07:28] now [09:07:30] I do see space [09:07:31] yeah [09:07:44] \o/ [09:08:00] elukey: what about db1046? [09:08:07] sorry to be a pain XD [09:08:16] ahahah sure I can run it now [09:09:54] marostegui: noop - https://puppet-compiler.wmflabs.org/compiler02/9004/ [09:11:10] ah last thing, I also completed https://gerrit.wikimedia.org/r/#/c/393220/3/modules/role/files/prometheus/mysql-misc_eqiad.yaml [09:11:19] that was missing s/db1047/db1108 [09:11:53] cool [09:13:14] it was a long road people but with your help we made it, thanks a lot! [09:14:43] 10DBA, 10Performance-Team, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Patch-For-Review, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3791691 (10jcrespo)... [09:40:13] marostegui: one thing we may have to change is how we do schema changes [09:40:31] what do you mean? [09:40:35] the order for the next ones mybe should be vertically [09:41:00] s3 codfw -> s3 eqiad -> s2 codfw -> s2 eqiad -> etc [09:41:18] that is how we do it now, right? [09:41:36] I think we do all of codfw [09:41:42] and then all of eqiad [09:41:47] yes [09:41:57] I am not understanding what you mean I think :) [09:41:59] I mean whole shards at a time, on both dcs [09:42:05] Aaaaaah [09:42:09] Right right [09:42:14] we do not start with another shard until one is complete [09:42:17] yeah [09:56:59] I am going to move db1095:3315 below db1082 [09:57:31] cool! [09:59:01] then I will stop db1082 and db1087 in sync and create the link to s8 and the filters [10:00:47] 10DBA, 10Data-Services: Provide a new s8 master for sanitarium - https://phabricator.wikimedia.org/T177274#3791741 (10Marostegui) [10:00:50] correction, I am going to move db1095:3306, default-master-connection=s5 [10:01:25] hehe that is what I understood from your comment by saying 3315, I assumed you meant s5 :) [10:05:53] I am not sure I can stop db1082 and db1087 in sync easily [10:07:07] oh right they are replicating from different masters [10:07:29] maybe move them to the same level, do the change and then move one of them back to s8? [10:07:44] marostegui: cannot [10:08:06] db1095 needs the filters in place [10:09:04] I will stop the servers, replicate until a heartbeat and sync them like that [10:09:20] Ah [10:09:22] I see what you mean [10:09:39] they have different hearbeats? [10:09:43] filters would be, for s5: existing + wikidatawiki.%,heartbeat,% [10:09:57] for s8: existing + dewiki.% [10:10:03] that is on ignore [10:10:11] yeah, I was going to ask :) [10:10:15] yes, that makes sense [10:10:30] so s8 replicates the heartbeat for the 2 [10:10:47] which is not ideal because if s5 breaks, [10:10:59] yes, but that will allow you to do the change you were sayin [10:10:59] well, s8 will not replicate anyway [10:11:02] g [10:11:23] well, db1082 can break and not db1087 [10:12:09] on the other side, when db1087 stops replicating s5 [10:12:23] heartbeat will not be updated on failover [10:13:01] and if I make any step wrong, I trash 4 servers [10:13:53] should we maybe put all the steps in a etherpad and review them carefully? [10:14:03] or if you are confident, no need :) [10:18:07] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3791768 (10Marostegui) db1099:3318 is now replicating [10:18:43] I honestly would prefer not to setup anything before failover [10:19:08] on switchover, we have perfectly gathered coordinates [10:19:28] and if anything happens, a reboot, etc. we may lose filters [10:19:36] that's true [10:19:38] I prefer to replicate on s5 [10:19:47] and with time [10:20:04] and during the failover everytning will be stopped so it is just easier to do the change [10:20:07] we setup s8 once s5 actually stops replicating it [10:20:09] with no filters [10:20:17] marostegui: yes [10:20:32] but even if we do it later, it will be easier, I think [10:20:49] I do not like adding complexity [10:21:00] if it doesn't remove it later [10:21:04] and this doesn't [10:21:20] yeah [10:21:28] and the change will not add a lot more time during the failover [10:21:32] on swithover there will be no filter [10:21:44] just a new connection [10:21:47] exactly [10:22:45] I would prefer to take the time to separate s5 and s8 to separate instances [10:22:51] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3791773 (10Marostegui) [10:24:32] to create one connection per shard you mean? [10:25:34] divide db1095 into several instances, as we plan to do [10:25:53] ah yes [10:26:16] Oh I just remembered that we have to move db1095:s3 under db1072 [10:26:18] I will try to do that this week [10:26:21] although that also cannot happen before the split [10:27:19] I can do that if db1072 is row already [10:27:50] yeah, it is row, although I would double check the binlog instead [10:27:54] to be 100% sure [10:27:56] but it should be [10:28:14] s/instead/itself [10:28:26] yes, it is [10:28:30] at least the last log [10:32:11] can I depool db1044- indefinitely? [10:33:03] yeah! [10:33:29] once db1095 is moved under db1072 let's leave it running for a week and then decom db1044 I would suggest [10:36:34] by stopping db1044, it is much more easier [10:36:56] because I cannot replicate directly from master- technically, I can, but I will not [10:37:11] yeah, just stop them in sync and voila [10:37:25] * marostegui out for errand+early lunch [10:37:31] actually not, I cannot stop db1072 [10:37:36] but it is easier [10:41:33] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791841 (10Magnus) [10:44:02] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791854 (10Magnus) Addendum: ``` MariaDB [wikidatawiki_p]> select @@hostname; +------------+ | @@hostname | +------------+ | labsdb1003 | +------------+ 1 row in set (0.00 sec) ``` [10:47:37] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791841 (10jcrespo) @Magnus -unless you can confirm the same happens on the new hosts (labsdb1009, 10 or 11), I will close this as "won't fix"- new labsdb hosts should be used, and the old ones are o... [10:56:50] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791908 (10Magnus) "sql" is the recommended shell command to connect to the replicas on Toolforge, correct? "sql wikidata" connected me to labsdb1003. I do expect the default settings to give me so... [10:58:55] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791911 (10Magnus) OK: ``` MariaDB [wikidatawiki_p]> select @@hostname; +------------+ | @@hostname | +------------+ | labsdb1011 | +------------+ 1 row in set (0.01 sec) MariaDB [wikidatawiki_p]>... [11:15:41] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791960 (10jcrespo) > I do expect the default settings to give me something useful. Is that too much to ask? That is certainly a bug, but it was not the one originally reported "Toolforge Wikidata d... [11:21:42] 10DBA, 10Toolforge: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486#3791989 (10jcrespo) Production masters provide the same result, empty sets- ``` root@db1070[(none)]> use wikidatawiki Database changed root@db1070[wikidatawiki]> SELECT * FROM wb_terms where term_f... [11:37:37] hoo, Amir1 if you have one second, I have some complains about missing rows on wikidata [11:37:52] T181486 [11:37:53] T181486: Toolforge Wikidata database replica corruption? - https://phabricator.wikimedia.org/T181486 [11:38:01] it could be just a misunderstanding [11:38:28] but I do not know much about the table to say if something is wrong [11:49:37] jynus: I'm around [11:49:41] let me see [11:57:25] jynus: Added my notes there, hope that's helpful [11:57:42] cool [11:57:44] thanks [11:57:59] can that happen because jobs failing? [11:58:07] or during periods of read only? [11:59:43] I will let you decide if you want to resolve or followup either you- or someone else [12:12:18] jynus: probably because of readonly or stuff like that, it can happen and there should some sort of safety measures [12:12:27] I will talk to our PM to see what can be done [13:48:16] 10DBA, 10Data-Services: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser - https://phabricator.wikimedia.org/T179628#3731402 (10chasemp) Thanks @jcrespo for explaining. Meta point I see here that we keep coming back to and that any variance from has created large hurdles is `Replica databases are sup... [14:28:58] 10DBA, 10Data-Services: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser - https://phabricator.wikimedia.org/T179628#3792548 (10jcrespo) Yes, `CREATE TEMPORARY TABLES` can create lag on the replicas for long running queries (locking). [14:31:24] https://tendril.wikimedia.org/host/view/db2039.codfw.wmnet/3306 [14:31:28] See replication lag [14:31:36] also dbstore1002 broke again [14:31:39] yeah [14:31:45] is that the table that also broke last week, no? [14:32:10] all those *link tables are not very reliable on tokudb slaves [14:32:30] do you want me to fix it and reimport it? [14:32:59] I am not sure if it is worth reimport it, as we should reload the whole thing [14:35:28] I was thinking maybe we should reimport in innodb? [14:35:30] those tables that fail? [14:36:34] nah, the errors is old [14:36:39] it is not because it is tokudb [14:37:07] but because crashing/not well maintained [14:37:13] both 1001 and 1002 are unreliable [14:37:18] then I guess it is easier to fix it thatn wasting time reimporting [14:37:54] well, first it has to be fixed no matter way [14:37:56] *what [14:38:06] because we need it to replicate [14:38:38] yes [14:44:38] it was a single row out of a set of 20 [14:45:05] so, do I leave db1044 replicating or stopped, as it is now? [14:45:06] Pretty much like past week [14:45:24] Sorry to be nitpicking on the patch, it is just a mania I have [14:45:32] I don't care [14:45:49] as in, tell me which way you prefer [14:46:12] I would leaving replicating [14:46:16] ok [14:46:31] But I am not saying we should do it my way :) [14:47:00] oh, no, we are doing it your way- which means you will be decommissioning it :-) [14:47:05] haha [14:48:41] I will merge now https://gerrit.wikimedia.org/r/393755 [14:49:32] go for it [14:54:28] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3792646 (10jcrespo) [14:54:42] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#2714228 (10jcrespo) a:03Marostegui [15:38:37] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 10TCB-Team, and 10 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3792739 (10Addshore) @Legoktm any chance you can take another look at this? :) [15:47:21] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 10TCB-Team, and 11 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3792759 (10Addshore) [16:19:48] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3792931 (10Marostegui) db1099:3311 is now replicating [16:19:59] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3792932 (10Marostegui) [16:27:23] not sure if still around, but as we may have time, should we get rid of the extra s3 multi-instance server? [16:27:41] o/ [16:27:48] not now [16:27:57] to put on the TODO ehterpad [16:28:03] we do not have any multi-instance on s3 [16:28:07] on codfw yes [16:28:09] en eqiad no [16:28:11] on codgge [16:28:13] arg [16:28:15] codfw [16:28:15] haha [16:28:16] yeah [16:28:20] agreed [16:28:30] I would like to have the least amount of servers [16:28:43] with redundancy [16:28:47] we can actually get rid of both [16:28:49] given the current situation [16:28:55] well, I mean 1 physical [16:28:57] 2 instances [16:29:09] basically, merging the 2 with s3 [16:29:26] 85 and 92, I think [16:29:31] I would leave it like eqiad [16:29:40] yes, that is the idea [16:30:00] well, not sure if the pairs are exactly the same [16:30:08] but we can check, and set it equaly [16:30:23] No, I mean, s3, only with 2 big ones, like eqiad [16:30:30] yes [16:30:36] taht is the idea [16:30:42] well, 3 big ones [16:30:48] with the master [16:30:51] and a slow [16:30:57] yeap, didn't count the master [16:31:06] not sure if sizes match fully [16:31:12] because we have more servers [16:31:17] but less powerful ones [16:31:26] I will add it to the todo [16:31:36] but I will do the eqiad config first [16:31:47] I want to clean up eqiad.php [16:32:00] it is a pain right now when there is an ongoing issue [16:32:05] yeah, it needs some big clean up [16:32:12] actually, s3 is the only one that looks clean XD [16:32:45] he he [16:32:47] yeah [16:32:53] that is why I thought about this [16:32:59] I will update etherpad [16:33:02] good! [16:33:04] thanks [16:33:05] and give a look at the tickets [17:22:33] https://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-to-handle-hot-rows/ is interesting in indeed. Though that reservation example is a bad one in really IMO (e.g. user workflow with multiple HTTP requests mapping to a long-running DB transaction, heh). [17:22:49] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3793300 (10Cmjohnson) Replaced disk 3 and it's rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online, Spun... [17:24:07] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3793321 (10Marostegui) Thanks! [19:22:53] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3794049 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good - thanks Chris! ``` root@db1051:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [19:29:46] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794066 (10Papaul) @Marostegui Tracking information shows 10:30am CT as delivering time and it is almost 2pm. I contact UPS they let me know that due to the pass holidays the package will not be delivered un... [19:46:43] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794136 (10Marostegui) No problem at all! Thanks for the heads up! [19:50:29] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794148 (10Papaul) a:05Papaul>03Marostegui @Marostegui it looks like UPS give me wrong timing i got the park. Disk replacement complete