[05:21:27] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Identify tools hosting databases on labsdb100[13] and notify maintainers - https://phabricator.wikimedia.org/T175096#3765539 (10bd808) MassMessage sent to c3.labsdb database users ([[https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogo... [06:05:59] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3765554 (10Marostegui) 05Open>03Resolved a:03Marostegui I am going to fix this as resolved, as the scope of the ticket is finished. As nothing else... [06:55:51] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3765612 (10Marostegui) [07:07:02] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3765613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1101.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017111... [07:26:57] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3765637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1101.eqiad.wmnet'] ``` and were **ALL** successful. [08:21:27] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3765701 (10Marostegui) [08:58:59] 10DBA, 10Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3765720 (10Marostegui) New update from the bug: ``` Andrei Elkin closed MDEV-12012. ------------------------------- Fix Version/s: 10.1.30 10.2.11 (... [09:55:42] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3765921 (10Marostegui) db1101.s7 is now replicating [10:22:22] labsdb1009 crashes on start [10:23:07] uh?? [10:23:35] maybe the transfer corrupted some files? [10:23:50] not sure [10:24:01] check journalctl -u mariadb on labsdb1009 [10:24:09] let me see [10:24:40] I had exactly the same issue a week ago, when transferring data from db1051 I think to a multi-instance host [10:25:03] I re-did the transfer and it started fine the second time, so I guess a transfer corruption was my cause [10:25:16] did you start with skip slave right? [10:25:32] yes [10:26:11] Maybe try to run a checksum? [10:26:24] To see if there is any differences on the files? [10:26:31] Or just try another transfer [10:26:53] not possible to checksum, I started labsdb1010 yesterday [10:27:19] no, i meant a checksum against what was copied from labsdb1010 to the intermediate location [10:27:40] and now I cannot leave labsdb1009 pooled, so I suppose I have to point everything to labsdb1011 [10:27:54] yeah :( [10:28:00] wait don't you try again? [10:28:21] there is not much to lose (apart from time) [10:36:54] another question, just thinking about posibilities: was replication manually stopped on all slaves on labsdb1010 before stopping mysql? [10:37:15] not that it should matter, because it should get stopped once mysql starts to shutdown, but who knows [10:40:39] I would say yes, but I cannot swear it [10:43:30] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3766086 (10russblau) 05Resolved>03Open No one contacted me during this whole process, until after everything had been "resolved". I won't run the jo... [10:47:21] I wonder if I should reset slaves, in case it happens again [10:50:14] I don't think that would be necessary really, if they are stopped and started without replication [10:50:22] but maybe it is a safer option, don't know [10:50:39] do you think the errors would happen again, but in a non-fatal way? [10:51:34] yeah, like everytime we do without resetting slave [10:51:40] maybe a different behaviour in 10.1? [10:52:57] have we copied from 10.1 ever? [10:53:21] Actually, it shouldn't really matter, because it happened to me from 10.0 to 10.1 [10:53:25] as i said earlier [10:53:48] but it has happened just once, and I have copied lots of time from 10.0 to 10.1 to set up all the codfw multi-instance hosts [10:56:29] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3766125 (10Marostegui) They have been removed from config and from live replication. [12:05:56] marostegui, jynus: ok to reboot the dbmonitor hosts or is it a bad time currently? [12:07:41] go ahead [12:07:44] now perfect time [12:08:15] doing that now, then [12:13:52] completed [14:52:15] jynus: https://gerrit.wikimedia.org/r/#/c/391830/ [14:55:31] I am checking errors on kibana [14:55:53] Top DB Servers : 10.64.32.222:3311 [14:56:20] there are errors, but not more than usual- in fact I would say there are less [14:56:38] yeah, I have had always kibana opened this day and the trend looks as usual [14:58:03] 36 errors in the last 24 hours vs 62 [15:06:12] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test testwikidatawiki on s8 - https://phabricator.wikimedia.org/T180694#3766687 (10jcrespo) [15:07:24] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3766687 (10jcrespo) [15:07:28] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3766703 (10Marostegui) 05Open>03Resolved [15:09:14] so while labsdb transfer is ongoing what do you want me to do, setup a codfw config? setup a server? [15:09:47] yeah, let's try to get codfw configuration draft? [15:09:51] ok [15:11:29] should we abandon the other one and aim for an easier one? [15:11:32] to avoid risks? [15:12:11] we can keep the other one [15:12:13] as in [15:12:16] for later [15:12:19] or for discussion [15:12:22] not for now [15:12:30] what is the state of db2038 ? [15:12:34] it is depooled? [15:12:38] let me see [15:13:02] maybe my repo is outdated [15:13:35] My bad, i left it, let me fix it [15:14:01] was it used as source of backups? [15:14:09] I can pool it back as new master [15:14:13] if that is ok [15:14:20] no, it is a rc slave [15:14:22] so it has partitions [15:14:26] let me leave the file as it should be [15:14:44] ok [15:15:05] so db2045 as master? [15:15:10] for s8? [15:17:46] only yes [15:19:02] db2045 looks like a decent candidate, 282 uptime days, so no HW issues apparently [15:19:11] (old kernel and all that, but you know what i mean) [15:19:42] either 45 or 52 [15:20:40] both look good, i would go for 45 but after a reboot maybe? [15:21:25] git pull to get the latest codfw.php file :) [15:31:21] yes, reboots on codfw are easy, so not issue there [15:34:06] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3766801 (10Marostegui) [15:36:12] after review- (I would need to give it a second look), I would literally deploy that to codfw [15:36:50] yeah, i am checking it [15:47:05] 10DBA, 10Operations, 10ops-eqiad: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10Marostegui) [16:00:02] 10DBA, 10Cloud-Services, 10Community-Wikimetrics, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3766900 (10jcrespo) [16:10:58] should we change the topology before merging? [16:11:01] or after? [16:11:22] i would say before, no? [16:11:35] so we can clearly see it tendril and make sure it matches the config [16:11:45] ok, I will prepare some repl.pl executions [16:11:55] \o/ [16:12:08] we should think about an eqiad master, too [16:12:50] yeah, I thought about db1071 (which i think you actually used it in your first draft) [16:12:54] because it is 160GB one [16:14:29] so I will setup replication db1063 -> db1071 -> db2045 -> rest of codfw [16:14:46] I will depool db1071 to avoid issues [16:14:57] db1071 is using statement, right? [16:15:02] I will check [16:15:02] ah no [16:15:03] row [16:15:06] row? [16:15:29] haha MIXED [16:15:30] yeah [16:15:40] not the time now [16:15:57] but actually we do not need STATEMENT, except it is a problem to think about that now [16:16:24] I will depool and restart [16:16:30] if it is going to be depooled, i would already convert it to STATEMENT [16:16:33] yeah [16:17:32] oh, and I already did the schema change there, great! :-) [16:17:55] maybe I should pool other host to avoid having s5 with less servers [16:18:09] db1104 as api? [16:18:12] for s5? [16:18:14] yeah, you can use db1104 [16:18:25] or db1100 [16:18:29] whichever you prefer [16:18:44] I will warm up more servers [16:18:55] and test them with more load, one never knows [16:19:03] good idea yeah [16:21:09] https://phabricator.wikimedia.org/T132515 [16:21:44] All three hosts were shutdown and thanks to @Cmjohnson got the thermal paste applied. [16:21:59] so I assume ok [16:23:32] yeah, db1071 has a good uptime, 514 days [16:27:02] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3766973 (10alanajjar) [16:27:10] upgrade to 10.0.33? [16:27:24] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3766985 (10alanajjar) [16:27:53] for the master? [16:28:00] db1071 [16:28:21] not sure…, that means upgrading all the slaves that will be in s8 hanging from it [16:28:41] not that it is a hard task, just saying…do we want a master with 10.0.33 already?¿ [16:29:06] what versions are there on s5? [16:29:22] let me see [16:30:01] 3 with 29 [16:30:04] the other 32 [16:30:07] master is 29, then we have 30, 28,32 and the multi instance with 10.1 [16:42:44] db1022 is a decomissioned host? can't find it in site.pp and searching for db1022 in phab doesn't show a decom ticket. I'm wondering since it's still listed in https://servermon.wikimedia.org/hosts/ [16:43:10] but hasn't received a puppet update since 1 month, 3 weeks either [16:43:48] phabricator search is broken [16:43:59] https://phabricator.wikimedia.org/T163778 [16:44:02] moritzm: ^ [16:44:20] what, why didn't I find this? [16:44:38] phabricator search is broken [16:45:03] ah [16:45:04] like, not "it works badly" [16:45:08] so as per chris, it is gone. https://phabricator.wikimedia.org/T163778#3688148 [16:45:11] it is really broken at the time [16:48:59] I think this decom missed the "puppet node deactivate" step, I just ran that on puppetmaster1001, let's see if that fixes it [16:49:11] probably, tell chris [16:49:29] we get rid of the responsability after we set it as spare and create the ticket [16:49:39] not in a bad way, but to avoid errors [16:49:40] yeah, when this properly cleaned out the node, I'll add a note to the ticket [16:50:05] as in, we somtimes could have done more, but were told not to because it create coordionation problems [16:50:24] give chris a heads up, to be fair, he has to control lots of servers, so human mistake is likely [16:51:24] 10DBA, 10Operations, 10ops-eqiad: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3209308 (10MoritzMuehlenhoff) JFTR: The host was still showing up in puppetdb (e.g. via https://servermon.wikimedia.... [16:51:52] yeah, it's fine these things can slip through, we have replaced quite some servers [16:52:11] we are in the middle of decomming 50 only on eqiad [16:52:30] phabricator broken doesn't help [16:53:00] I cannot even serach if that has been reported yet [16:53:55] catch-22 :-) [16:59:29] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3767076 (10Ladsgroup) I can help [17:10:02] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3767125 (10jcrespo) Thanks, Ladsgroup, for starters we were thinking of preparing the topology changes for s8 on codfw (which if it br... [17:15:18] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767146 (10elukey) I am currently reviewing what tables to drop on db1047 and which ones to copy over to db1108, and this is what I gath... [17:17:28] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767179 (10jcrespo) ops are ours, we can handle that- just leave things as you found them. test is probably a mistake and probably shoul... [17:29:30] Hi, how long will the dewiki database due to maintenance been locked? [17:29:44] jynus: ^ [17:32:23] 10DBA, 10Operations: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10jcrespo) [17:34:21] <_joe_> doctaxon: it's an hardware failure [17:36:46] 10DBA, 10Operations, 10Patch-For-Review: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10fgiunchedi) I retrieved the kernel logs from syslog servers at the time of the incident: {P6332} [18:58:32] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3767641 (10Cmjohnson) [19:02:59] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3767656 (10Marostegui) All s5 is being reverted because of: T180714 (nothing to do with the crash) but the crash left half the schem... [23:21:40] 10DBA, 10Operations, 10Patch-For-Review: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768544 (10greg) [23:37:27] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768579 (10greg) [23:38:01] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768584 (10greg) p:05Triage>03Unbreak!