[05:09:48] PROBLEM - MariaDB sustained replica lag on db2072 is CRITICAL: 3.254e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2072&var-port=9104 [05:10:19] ^ downtime expired, MCR changes [05:13:50] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) [05:30:04] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [07:04:20] bacula error is because first backup has not yet finished- I will ack it but it should fix itself in a few hours [07:04:40] we'll see if it completes or if hw (backup2001) keeps giving problems [07:30:23] it seems backups to backup2001 are stuck again [07:30:45] stuck as in...? [07:31:00] nah, something is happening, but it is very slow [07:31:40] maybe the first time the filesystem is used it is allocated virtually by the raid [07:32:02] but it is taking many hours to do backups than in other cases takes minutes [07:34:46] I could debug if there was any io stats on host overview that was useful :-( [07:40:14] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) [07:42:32] there is a significant difference in backup sizes of x1 on both datacenters [07:42:57] compression? [07:43:03] let me paste [07:43:32] https://phabricator.wikimedia.org/P12359 [07:44:45] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) [07:46:32] on disk, those two tables are the same on eqiad and on codfw [07:46:50] the same, as in same content or same physical size? [07:47:03] size [07:47:06] on eqiad: ) ENGINE=InnoDB AUTO_INCREMENT=29344064 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8 | [07:47:11] on codfw: ) ENGINE=InnoDB AUTO_INCREMENT=29344064 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED | [07:47:32] maybe one got fragmented? [07:47:49] I am doing a count(*) [07:48:18] 25293126 vs 25293126 [07:48:22] so same amount [07:48:23] on both dcs [07:49:45] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) [07:52:05] when backups finish I can try to recompress it [07:52:09] and see if it affects it [07:52:21] cool [08:13:04] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) [08:18:04] 10Blocked-on-schema-change, 10DBA: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 (10Marostegui) s3 is the only one pending - will do it once the DC switchover is done next week. [08:41:52] 10DBA: Replication broken on db1110 - https://phabricator.wikimedia.org/T261276 (10Marostegui) [08:43:30] marostegui: want me to take care of restoring db1110 from backup? [08:43:42] let me check what happened, but yeah, we probably should [08:43:50] alrighty [08:44:11] that's corruption not data corruption [08:44:29] well, not at first, who knows what could had happen [08:44:40] jynus: don't worry, it is being handled :) [08:58:18] 10DBA: Replication broken on db1110 - https://phabricator.wikimedia.org/T261276 (10Marostegui) p:05Triage→03Medium a:05Marostegui→03Kormat So the error started: ` Aug 26 08:38:55 db1110 mysqld[4559]: 2020-08-26 8:38:55 495344457 [ERROR] InnoDB: Database page corruption on disk or a failed file read of t... [08:59:30] Important, which version? [09:06:13] 10.4, but it doesn't seem related to labsdb crashes [09:06:37] or to db2125 (which was followed by a HW error) [09:07:08] marostegui: thanks for catching that notifications were still disabled for db2125. i should have noticed that myself [09:10:00] np! [10:54:40] 10DBA: Replication broken on db1110 - https://phabricator.wikimedia.org/T261276 (10Kormat) Recovery done, mariadb upgraded (and mysql_upgrade run before and after mariadb upgrade), rebooted. It's now up and catching up on replication. [10:55:30] 10DBA: Replication broken on db1110 - https://phabricator.wikimedia.org/T261276 (10Marostegui) Thank you [11:06:54] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [11:07:49] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) [11:12:42] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) [12:12:24] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) I am starting the comparison on some sections, which will also help warming up those tables and get them in memory. [12:37:18] jynus: so it turns out that hashes are ordered in ruby; they retain the order in which the keys were inserted. and puppet apparently inserts the entries into the hash in the order they appear in the file (which makes sense). that's kinda nice. [12:37:39] yes, that is what I meant, after a certain version [12:38:06] "Ruby since version 1.9 (released dec 2007)" [12:38:18] ah hah. nice :) [12:38:42] the issue was I was mixing template variables and ruby variables [12:38:47] * kormat nods [12:39:23] so by having this, the merge on wmfmariadbpy gets unblocked [12:39:40] and with this my work on wmfbackups also gets unblocked [12:40:30] I will let you handle how to implement it on other hosts, but you will want that thought before the clouddb new hosts refactor [12:54:06] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [12:54:37] jynus: i think i'll have a look at refactoring some of the puppet stuff today, once your CR is in [12:55:15] so should I merge now the cumin one? [12:55:54] yeah, please [12:57:01] maybe I can assign https://gerrit.wikimedia.org/r/c/operations/puppet/+/620899/18 to you and then you decide if to reuse it or abandon it? [12:57:20] it will be useful at least for the cloud db work as reference [12:57:25] yeah go for it [12:57:54] infact, the hiera changes will be useful [12:57:57] anyway [12:59:53] Notice: /Stage[main]/Mariadb::Wmfmariadbpy/File[/etc/mysql/section_ports.csv]/ensure: defined content as '{md5}4d469c9c5c7503276385e9be717df6ec' [13:00:10] \o/ [13:00:51] what about https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/620291 ? [13:02:10] oh yeah - +1'd, merge away [13:02:28] https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/620319/3 too? [13:02:35] that should be a noop [13:03:34] yep, SGTM [13:04:33] thanks, sorry to insist so much but this was blocking adding extra functionality to backups, which I needed as one of my goals [13:04:52] I can now work on that, before of after creating a new repo [13:05:10] do you by any chance have repo creating powers? [13:05:30] thanks for working on this, it was badly needed [13:05:47] mention the open issues you brought up on the cloud patch [13:06:12] not sure if they will be open to wait for a larger refactoring, depnding on their scheduling [13:06:41] i do not. but there's a few people around who do. hashar/_joe_ for example :) [13:06:41] but now there should be a single place where port mapping exists [13:06:49] o/ [13:07:11] rather than 6 different locations [13:08:04] if that is related with the dirty change I have made to run wmfmariadbpy integration test, I went a bit wild cause CI runs mysqld on a fixed socket name which is somewhere under /tmp (instead of /run) [13:08:32] hashar: btw i had a look at your integration CR for wmfmariadbpy. i think we need to work on documenting the required env for the integration tests, and then we can see how easily we can satisfy that in CI. i'll keep it mind for the future [13:09:13] kormat: this is exactly why I didn't want to put so much burden on your shoulders [13:09:21] just the minimal patch to unblock me [13:09:35] there will always be time for refactoring as long as the patch is not too dirty [13:10:17] manuel and you can now talk to analytics and cloud and propose a route forward [13:11:24] I think we both agree, as I understood, that our deployment system, mixing puppet and debian packages has some weaknesses [13:11:56] it's not ideal, indeed [13:13:22] but i think it's manageable with some care [13:13:53] +1 [13:18:35] yeah, I would like to hear kormat's ideas on https://gerrit.wikimedia.org/r/c/operations/puppet/+/622444 [13:20:09] i think it generally looks fine, so long as they use the new mapping that jynus just added [13:20:40] there was this thing I was speaking before [13:20:55] kormat: also I have made the test to use whatever is defined in the releng/tox-mysqld container which spawns its own mysqld. But you could have the testsuite to spawn 1..n mysqld as needed [13:20:55] not sure how to implement that not all sections should be allower without duplicating definitions [13:21:01] *allowed [13:21:19] i'm going to send a series of puppet patches reducing duplication in the existing mariadb profiles, [13:21:22] e.g. how to limit which of the sections from this list are available on each profile [13:21:30] and either wmcs merge their CR first and i handle the change for them, [13:21:36] or they update the CR based on what i get merged [13:21:41] either is fine by me [13:22:19] yeah, just make sure to coordinate with bstorm to avoid race conditions there [13:24:07] i'll reply to her CR [13:24:23] thanks [13:33:04] done [13:46:27] 10DBA: Replication broken on db1110 - https://phabricator.wikimedia.org/T261276 (10Kormat) 05Open→03Resolved Host is fully repooled, icinga is all green. [14:05:53] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Kormat) [14:06:21] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Kormat) I'll be the contact person for the data-persistence team for this. [14:08:54] kormat: let's move that task to the blocked/external column if that's ok [14:09:21] done [14:09:26] :* [14:10:02] and created 3 cal entries to remind myself :) [14:10:23] thanks [15:11:04] 10DBA, 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10jijiki) 05Resolved→03Open [15:14:07] 10DBA, 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10jijiki) 05Open→03Resolved Reopened the wrong task, re-closing. Nothing to see here, move along. [16:48:30] RECOVERY - MariaDB sustained replica lag on db2072 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2072&var-port=9104 [21:44:01] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) Great! Looks like we're close to wrapped up with this. >>! In T246716#640...