[05:06:46] 10DBA, 10Operations, 10Patch-For-Review: sql config differs between mwmaint1001 and deploy1001 - https://phabricator.wikimedia.org/T199009 (10Marostegui) 05Open>03Resolved a:03Marostegui This is now fixed after merging both patches. [05:37:28] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 (10Marostegui) [05:59:12] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 (10Marostegui) [05:59:14] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 (10Marostegui) [05:59:27] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 (10Marostegui) [06:08:22] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 (10Marostegui) [06:08:24] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 (10Marostegui) [06:08:38] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 (10Marostegui) [06:10:41] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 (10Marostegui) s1 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1001 [] dbstore1002 [] db1124 [] db1080 [] db1083 [] db1089 [] db1119 [] db1... [06:10:55] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 (10Marostegui) s1 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1001 [] dbstore1002 [] db1124 [] db1080 [] db1083 [] db1089 [] db1119 [] db1067 [... [06:10:57] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 (10Marostegui) s1 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1001 [] dbstore1002 [] db1124 [] db1080 [] db1083 [] db1089 [] db1119 []... [06:12:54] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 (10Marostegui) [06:13:09] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 (10Marostegui) [06:13:22] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 (10Marostegui) [07:06:01] 10DBA: Optimize logging table - https://phabricator.wikimedia.org/T197459 (10Marostegui) [07:06:27] 10DBA: Optimize logging table - https://phabricator.wikimedia.org/T197459 (10Marostegui) 05Open>03Resolved This is all done [07:11:58] there were connection problems on db1122 just before 7UTC [07:13:47] yeah, i saw the spike [07:14:13] Probably related to the optimization I was doing there I guess (but I only saw 6 errors) [07:21:55] optimization could affect queries or replication, but not really connections [07:22:25] but the connections seem stable [07:22:42] they are aborted clients [07:23:10] it seems that new connctions took 20 seconds [07:23:45] Yeah, but maybe a side effect, something like a IO spike or similar [07:24:19] strange [07:24:45] yeah, but it would a too much of a coincidence that it happened just around the time I issued the optimize [07:24:56] Although it is the only server that had those errors among all the other ones I have been doing [08:34:37] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [08:47:42] errors on db1076 [08:47:57] and db1074 [08:49:31] yeah, it is confirmed, it is due to the optimizations [08:49:33] but it is strange [08:49:37] because I have done like 20 servers [08:49:44] might be a race condition or something [08:51:36] what about db1125 ? [08:51:56] it is lag coming from the optimizations of its master [08:52:48] it would have paged if alerting was configured as we were recommended [08:52:55] I downtimed it [08:53:07] then we have icinga issues again [08:53:56] Or maybe it expired already [08:54:08] can you check [08:54:11] Or maybe I downtimed db1124 instead [08:54:13] it is ok if it exprited [08:54:22] Sorry, I will check later, I am busy now fixing other stuff [08:54:26] ok [08:54:37] I will get back to it once I am done with this issue [13:36:45] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 (10Marostegui) [13:36:59] 10DBA, 10Patch-For-Review, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 (10Marostegui) [13:37:37] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 (10Marostegui) [13:41:38] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 (10Marostegui) a:03Marostegui [14:09:59] if/when you have a couple of minutes I have a couple of question for you ;) topic: datacenter switch [14:11:46] yeah [14:13:15] 1) (easy one) regarding DB sections the only difference from last switchdc is that we have also S8 now, right? [14:13:52] s*,es*,and x1, yes one more: s8 [14:14:20] the problem is scripts are being writing right now to make things different [14:14:29] -mediawiki read only [14:14:34] and check lag [14:15:05] also de idea you proposed of running the heartbeat remotely [14:15:15] but that doesn't affect the existing scripts [14:16:22] RO should be settable via etcd, at least the global one, not yet the per-section one, but I'll keep an eye on the evolution of things [14:18:03] 2) I guess that the role of $::mw_primary has changed over time, and I see is still used in a couple of places [14:18:23] I still don't use it for icinga [14:18:25] I plan to [14:19:17] but even if it wasn't, the worse case is alerts being fatal vs irc only [14:19:27] the plan is to move that to etcd [14:19:49] I think it is not the only usage [14:19:56] but I think it is the only important for us [14:20:30] and I think I could change it in a day, it is just changing a puppet variable with a call to an api [14:20:47] OR [14:21:02] yeah the only question is when we should merge the patch during the switch, but shouldn't be a problem [14:21:11] the puppet patch? [14:21:31] why not seek remove the variable? [14:21:54] if we can, even better! [14:21:55] it is not that it should be so difficult and we are not going to do it now [14:22:06] as I sad, I can remove it in a day of testing [14:22:09] *said [14:22:17] but I think I am not the only one using it [14:23:04] I maybe the one using it [14:23:08] actually [14:23:17] no [14:23:23] manifests/realm.pp:$mw_primary = $app_routes['mediawiki'] [14:23:26] that is not me [14:23:30] the others are [14:23:41] and it is easy to fix [14:24:06] I already asked joe what I wanted to do and he was ok [14:24:18] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/345346/ [14:24:32] confctl --object-type mwconfig tags 'scope=common,name=WMFMasterDatacenter' --action get all [14:24:44] and add it as client to https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster [14:25:11] in the worse case scenario, I made the script fail gracefully [14:25:19] ack [14:25:26] but this is easy [14:25:29] and the last question I have for now is [14:26:06] I haven't done it before because testing alerting [14:26:13] is problematic [14:26:17] 3) do we still need the warmup? it's not documented on wikitech. Take into account that we'll not need to wike memcache caches, it's not clear yet if APC needs to be wiped though [14:26:36] volans: that is not my expertise [14:26:55] not I asked for the warmup [14:26:58] *nor [14:27:14] I guess ask Krinkle? [14:28:03] if there is not a wipe memcache caches [14:28:20] jynus: I meant point 1 in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Days_in_advance_preparation [14:28:20] we could do the warmup beforehand anyway? [14:28:53] well, that is done everyday [14:28:59] it is more of a general reminder [14:29:09] :) [14:29:17] "make sure you not restart your databases the day after" [14:29:28] maybe I can clarify that better [14:29:52] that was written the first time [14:29:54] ack [14:30:11] (sorry have a meeting, my reply might be delayed) [14:30:30] also we may want to do testing, but net really part of the failover process [14:49:05] volans: what is the goal's ticket? [14:52:20] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Ladsgroup) Last round:... [15:08:05] jynus: T199073 [15:08:06] T199073: Perform a datacenter switchover - https://phabricator.wikimedia.org/T199073 [15:08:11] (sorry still in meetings) [15:08:17] we are in a meeting too [15:26:21] marostegui I have a disk ready for db1069....okay to replace or wait for you? [15:28:02] in a meeting, I will put it offline in a bit and let you know when you can go ahead [15:28:18] cmjohnson1: ^ [16:10:20] 10DBA, 10Operations, 10ops-eqiad: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) @Cmjohnson you can now proceed, I have set the disk offline: ``` Adapter #0 Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 W... [16:11:35] 10DBA, 10Operations, 10Puppet: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10jcrespo) [16:18:58] marostegui: disk swapped [16:19:04] awesome! thank you! [16:22:21] (I think an automatic ticket may be generated automatically) [16:22:23] ignore it [16:33:34] 10DBA, 10Operations, 10ops-eqiad: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) disk swapped by chris: ``` root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0 Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 28% in 16 Minutes. ``` [16:33:53] 10DBA, 10Operations, 10ops-eqiad: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) [16:34:08] 10DBA, 10Operations, 10ops-eqiad: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) disk swapped by chris: ``` root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0 Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 28% in 16 Minutes. ```