[06:19:26] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3784765 (10Marostegui) a:03Papaul @Papaul can we get this replaced? Thanks! [08:28:49] Can I get a review for: https://gerrit.wikimedia.org/r/#/c/393175/ [09:45:37] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3785079 (10Marostegui) [10:07:51] Unable to find pt-heartbeat row for {db_server} [10:07:56] there are ongoing production errors [10:08:10] must be db1092 [10:08:17] on enwiki [10:08:20] enwiki?? [10:08:40] db2080 [10:08:45] ok, it is not eqiad [10:09:07] but how is enwiki failing if we are not touching it? [10:09:21] but yes, 9 errors on codfw on enwiki [10:09:43] like the other day, maybe? [10:09:59] but we haven't moved any enwiki host to s5 in codfw no? [10:10:23] the ones I have seen are not multiinstance slaves [10:10:35] db2080,. db2081, db2079 [10:10:43] those are not multi-instance [10:10:44] I think they are enwiki domains [10:10:48] but wikidata databases [10:10:58] see the servers [10:11:18] yeah, those are s8 [10:12:08] and why is db1092 not failing? [10:12:16] for the same error [10:12:35] on codfw, wikdata is pooled on s8 [10:12:40] on eqiad, it is on s5 [10:12:49] Aha! [10:14:43] I need to change the topology [10:15:27] ? [10:17:52] can you depool all s8 slaves on eqiad? [10:17:58] yep [10:18:00] doing it now [10:18:50] I will restart db2085 and db2086 [10:21:56] https://gerrit.wikimedia.org/r/#/c/393204/ [10:24:23] we have enough resources left, right? [10:24:54] we will leave them out for the weekend? [10:24:56] maybe we should [10:25:01] if so, let me give some more api resources [10:25:01] pool [10:25:09] db1100 as api, main hybrid [10:25:16] yes, i was thinking about that [10:25:27] or mostly apiu [10:25:35] with like 50 main [10:25:40] yeah, agreed [10:25:48] "we will leave them out for the weekend?" most likely not [10:25:53] ok [10:26:01] but we need them depooled for the topology changes [10:26:21] new patch sent [10:32:17] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3785176 (10Marostegui) [10:33:34] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3762970 (10Marostegui) [10:34:00] do you want me to downtime those hosts for 2 hours? [11:31:37] 10DBA: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785328 (10jcrespo) [11:33:32] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785328 (10Marostegui) The idrac console showed when I logged in: ``` [30996087.770298] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0. [30996102.596599] megaraid_sas 0000:03:00.0:... [11:34:14] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785346 (10Marostegui) dmesg: ``` [Fri Nov 24 10:14:53 2017] TCP: request_sock_TCP: Possible SYN flooding on port 5666. Sending cookies. Check SNMP counters. [Fri Nov 24 10:17:06 2017] INFO: task jbd2/sda1-8:934 blocked for... [11:36:04] 10DBA, 10Operations, 10ops-codfw: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785349 (10Marostegui) [11:36:06] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785348 (10Marostegui) [12:19:00] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785509 (10jcrespo) I would do a quick data check on enwiki around the time of the issue (compare.py) to see that no data has been lost, but other than that, this is fixed. [12:28:31] I am going to test the master change procedure on db1063 [12:28:45] ok! [12:30:52] oh, I do not see s8 entries on db1071 [12:31:03] what do you mean s8 entries? [12:31:10] on heartbeat table [12:31:49] I think 71 is inserting s5 entries [12:32:00] because pt-heartbeat has not been restarted [12:32:06] --shard=s5 [12:32:07] yeah [12:32:09] you are correct [12:32:22] well, that would break slaves the same way [12:32:59] I think I can stop it [12:33:21] and shouldn't break any host [12:33:40] as s8 hostrs should get its entries from codfw [12:35:04] yeah, stop heartbeat (and disable puppet), move the s8 eqiad hosts, then reenable puppet [12:35:36] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3785521 (10Marostegui) >>! In T162593#3599903, @jcrespo wrote: > I am not fairly confident that the main tables on most relevant servers are the same. I have not checked and fixed every table and every server, but it sho... [12:35:57] gah [12:35:59] wrong ticket [12:36:25] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785536 (10Marostegui) >>! In T181293#3785509, @jcrespo wrote: > I would do a quick data check on enwiki around the time of the issue (compare.py) to see that no data has been lost, but other than that, this is fixed. I am r... [12:41:30] shout if you see any issue: https://tendril.wikimedia.org/tree [12:41:38] (s5 tree) [12:42:31] checking that [12:42:33] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785559 (10Marostegui) I have compared the last value from enwiki at: `171123 22:16:12` (155663487 )till the last one I just selected from the table (155702786). And no differences were found. Servers compared: es2018 with e... [12:43:16] I have only migrated db1063 [12:43:22] yeah [12:43:24] will do the rest if I see no problem [12:43:30] I was checking that it is indeed STATEMENT, db10'71 [12:43:31] they should be depooled [12:43:33] looks good :) [12:43:36] yes, they are depooled [12:43:39] yeah, anything [12:43:42] config misses [12:43:47] errors on mediawiki [12:44:01] pt-heartbeat for s8 is not working but that is known [12:44:30] it is needed for practical reasons, will be reenabled soon [12:45:40] oh, I hve to delete the existing s5 record [12:45:56] I think it will be updated and not inserted [12:46:43] other than that, I cannot see anything wrong [12:48:14] now that I remember, I have not started replication on es1014 [12:48:33] should I do that, or better reset it in the only host that is still replication codfw -> eqiad [12:48:59] that woudl be es1011 [12:49:19] ah yeah, let's stop it there [13:06:35] 10DBA, 10Operations, 10ops-codfw: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785600 (10Marostegui) [13:06:37] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785599 (10Marostegui) 05Open>03Resolved [13:12:29] 10DBA, 10Patch-For-Review: es2018 crashed - https://phabricator.wikimedia.org/T181293#3785603 (10jcrespo) a:03Marostegui [13:15:52] moritzm: All my long running tasks in neodymium are finished [13:16:36] did sarin got restarted in the end? [13:16:47] yeah, sarin is now running 4.9.51 [13:17:24] I'll doublecheck sometime next week and then reboot neodymium as well, maybe you can move long-running DBA tasks to sarin for a few days? [13:18:43] yes, no problem [13:18:53] just wanted to know if we could do that now [13:19:07] (moving the tasks, not the reboot) [13:19:15] because it had already been rebooted [13:19:21] you answered already [13:19:52] ok [13:20:21] it's pretty nice that with cumin we have effectively two maint hosts with equal powers [13:20:29] jynus: should we repool the s8 hosts as in: revert: https://gerrit.wikimedia.org/r/#/c/393204/ [13:20:36] before that rebooting neodymium/saltmaster was much more risky [13:21:05] marostegui: I would not repool *all of them* [13:21:50] but maybe one by one and then leave them with low load [13:21:56] as in [13:22:14] each one as the main s5 host [13:22:23] for some time? [13:25:06] we can actually do that on monday [13:26:47] if we finally go for the split, we have time to warm them up [14:33:27] I also did https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1 [15:24:17] 10DBA, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785829 (10elukey) [15:24:44] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785832 (10Addshore) Thanks! I only asked as the title of this ticket references testwikidatawiki not wikidatawiki [15:27:29] 10DBA, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785843 (10elukey) a:03elukey [16:21:28] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785922 (10Marostegui) Yeah, we decided to go for wikidatawiki on codfw, as it is the passive DC :-) [20:25:27] 10DBA, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3786146 (10Capt_Swing) The `shawn` table belonged to Shawn Walker, a research intern in 2011. These tables can be safely deleted.