[04:42:58] 10DBA, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Marostegui) p:05Triage→03Medium [05:11:52] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) [05:12:16] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) The switchover was done, db1131 is no longer the primary master Times: RO started: 05:00:39 RO finished: 05:01:58 Total RO: 1 minute and 19 seconds [05:13:39] load on db1088 is a bit on the high side [05:13:48] processes: 75 [05:15:01] I will reduce a bit the load, but as we are running with one less slave, that's kinda expected [05:15:06] db1131 is depooled for HW maintenance [05:15:08] ok, fair [05:15:19] hopefully back tomorrow [05:15:54] another unrelated thing I am seeing is: Collection failures: db1117:13322 [05:15:56] done [05:16:13] I guess that host needs a prometheus restart [05:16:19] But it wasn't touched recently I think [05:16:37] that is weird, let me see if something changed there recently [05:17:00] it was restarted 4 days ago [05:17:02] from what I can see [05:17:18] I am going to get some breakfast [05:17:44] I will restart prometheus [05:17:49] see what happens [05:18:45] thanks [07:12:37] does https://phabricator.wikimedia.org/source/operations-puppet/history/production/ work for you? [07:12:52] work as in, it has the latest merges? [07:13:17] mmmm [07:13:31] there have been a few more after 431c03cc0dfb [07:13:37] indeed [07:13:37] at least 1 more [07:13:59] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet shows them [07:14:10] yeah [07:14:10] so I think it is a difussion issue, I will report it [07:14:15] broken sync? [07:14:16] yeah [07:15:48] github mirror does look updated [07:17:19] Update Frequency 6 m, 45 s [07:17:30] but [07:17:38] Initialization Error [07:17:53] Pull of 'operations-puppet' failed: Command failed with error #128! COMMAND git fetch --no-tags --update-head-ok -- '********' '+refs/heads/*:refs/heads/*' STDOUT (empty) STDERR error: insufficient permission for adding an object to repository database ./objects fatal: failed to write object fatal: unpack-objects failed [07:19:05] anything on SAL that might have changed since yesterday? [07:19:31] I am going to report it, that error only happens there [07:29:19] reported as T257895 [07:29:26] T257895: Diffussion (Phabricator) operations-puppet repo synchronization error - https://phabricator.wikimedia.org/T257895 [07:31:00] This is how backup monitoring exclusions ended up: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1cd5aee3ff46cda2a1a5396266c24c51dc0ec2b5/modules/profile/files/backup/job_monitoring_ignorelist [07:32:44] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) I have added a rule to ignore labtestpuppetmaster2001 backup monitoring: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1cd5aee3ff46cda2a1a... [07:35:52] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) reviewdb is just backed up, but otrs backup on the same instance has yet to finish to make... [07:42:07] once the review db backups finish and they are sent to bacula, I will focus on otrs clone setup, if you confirm me db1077 is available already? [08:01:08] jynus: yep, db1077 is yours [08:01:22] taking over it [08:01:45] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10Marostegui) >>! In T255715#6303435, @jcrespo wrote: > reviewdb is just backed up, but otrs backup on... [08:14:04] I've updated https://wikitech.wikimedia.org/wiki/Bacula#Monitoring to reflect the latest changes to bacula monitoring [10:30:01] 10DBA, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [10:32:46] 10DBA, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [10:59:19] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [11:01:56] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) I have finished s5 (including t... [11:49:05] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [11:54:53] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [12:00:41] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) Bacula is running now. [12:15:37] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [12:18:32] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [12:37:09] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [12:37:58] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [12:40:12] marostegui: well, i now have a disgustingly clear view of how horribly manual this process is ;) [12:40:34] kormat: the es switchover is a bit worse, because you have to interact with MW [12:40:45] the other ones, are a bit easier thanks to jaime's script and dbctl [12:41:24] switchover.py + replication_tree.py are ❤️ [12:41:43] yeah, definitely [12:55:27] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) 14th July: ` -rw-r--r-- 1 dump dump 1.1G Jul 14 00:46 dump.s4.2020-07-14--00-20-39/commonswiki.cu_changes.000... [12:58:51] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [12:59:23] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) 05Open→03Stalled Only primary masters pending. Stalling till the DC switchover is done. [13:03:00] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) a:03Marostegui This is how the table looks like after the change on dbstore1005:3318 ` root@cumin1001:/home/marostegui# mysql.py -hdbstore1005:3318 -e "show create table wikidatawiki.pagelink... [13:34:04] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) [13:46:09] kormat: I hope you may have have some time in the future to integrate dbctl into switchover.py [13:48:21] once we have full automation of some task we will be able to attack things in 2 ways: simplify them (I automated task as they were, didn't have the bandwidth to change the process) [13:48:43] and 2) try to apply them for unscheduled failovers [13:48:47] but that is long term [13:49:42] +1 [14:05:45] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) a:05jcrespo→03Marostegui `lines=10,name=db backup metadata root@db2093.codfw.wmnet[zarc... [14:06:49] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) Last note: after db drop, grants for 'dump' users should be dropped too, so no backups are... [14:13:00] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a freezed snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) [14:13:08] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a freezed snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) p:05Triage→03Medium [14:15:46] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a freezed snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) I was planning on doing this slowly with @akosiaris so at the same time he learned about streamlined db provisioning system, but I... [14:17:07] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10Reedy) [14:20:43] 10DBA: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10jcrespo) [14:20:47] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) [14:23:32] 10DBA: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10jcrespo) Given 90% of m2 is OTRS database, I will setup db1077 with buster/MariaDB 10.4 on db1077 at T257928, and that will allow testing of the eventual upgrade of the primary instance to it. If that works correctly, I t... [14:28:09] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) https://en.wiktionary.org/wiki/freezed {icon hand-peace-o spin} [14:44:25] I am thinking of stopping backing up for long term snapshots (yes, I know I was the one that proposed that) to dedicate resources for: 1) more retention for logical backups 2) short term snaphots of everything, including misc hosts [14:45:27] I change my mind from time to time, but the more deeper I go into backups, the more I understand the different services needs [17:19:54] 10DBA, 10OTRS: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915 (10eyazi) I'd like to list the benefits, of which most are already stated here, and add my two cents to the topic. Switching storage from DB to FS will: — increase update speed (no altering in large tables) — increase data... [21:23:59] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257983 (10Majavah) [21:31:33] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257983 (10wiki_willy) [21:31:38] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10wiki_willy) [21:31:52] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257983 (10wiki_willy) a:03Jclark-ctr Duplicate task of T257253