[00:43:38] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [00:51:16] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:28:20] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The defragment finished: ` root@db1115:~# xfs_db -c frag -r /dev/md2 actual 8276, ideal 8176, fragmentation factor 1.21% ` Starting MySQL now without the... [03:31:04] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) As soon as we start the events, the server goes crazy again [03:48:39] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Changing the isolation level to 'READ-UNCOMMITTED' seem to have worked and the server is now under control after restarting the event_scheduler. Also, I... [03:55:25] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Changed back to `READ-COMMITTED` as otherwise the Host view isn't really useful as it doesn't show the replication status and lag for most of the hosts. [03:56:57] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The load can be see with this image and the general bad status: {F34435112} [03:58:51] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) And the host is back stuck with 1000 connections again [04:02:29] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) db1118 tables are good, will pool this host next week. [04:03:23] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I have increased the buffer pool to 50GB (we were using 20GB), but those big tables are InnoDB, so let's see if this helps. [04:19:19] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The server seems "stable" at around 600 connections, which is way higher than it normally is but so far it is not increasing. I really think we've reached... [04:28:10] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Disabled the `*schema` events (ones run every minute and the other one everyday), we don't use them for anything: ` CREATE DEFINER=`root`@`localhost` EVEN... [04:46:23] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The host is around 400-600 connections all the time. I am not going to make more changes, it is slow but functional at this point. [04:56:26] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Seems a lot more stable now {F34435137} [05:00:19] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) [05:08:04] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1080.eqiad.wmnet` - db1080.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [05:10:20] 10DBA, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) This is ready for DC-OPs [05:10:59] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:18:30] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s8 eqiad: [] labsdb1011 (not needed) [] labsdb1010 (not needed) [x] labsdb1009 (not needed) [x] dbstore1005 [x] db1177 [x] db1172 [x] db1167 [x] db1154 [] db1126 [x] db1116 [x] db11... [05:18:55] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:11:08] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) I checked all other tables, they were good. [12:40:57] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Thanks @Ladsgroup - keep me posted! [12:42:33] 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) @Ladsgroup if we are going to keep track of the testing database deletion on {T281548}, we can probably ignore T278614#7022985 and close th... [12:48:28] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) We have created a document to try to come up with a movement plan and see if we can do this during the next DC switchover (T281515). Once we've got the pla... [13:06:49] 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) 05Open→03Resolved a:03Marostegui Let's call it done "Create production databases for mailman3" is clearly done. [13:24:21] * sobanski stepping out for a while, reachable via usual channels if needed [13:24:50] sobanski: don't worry, tendril will be here waiting for you, no matter how far you go [13:25:30] Like the Richard Marx song [13:26:17] haha [14:25:17] 10DBA, 10AbuseFilter: Check whether `FORCE INDEX page_timestamp` is still needed in LazyVariableComputer.php - https://phabricator.wikimedia.org/T281579 (10Daimona) [16:13:07] FYI: I removed 30M rows from watchlist of commonswiki, it's 16% whether to shrink it or not, I let you decide but least, it won't grow for a couple of years hopefully [16:15:17] that's nice [16:15:26] we can always do a test and see if it is worth it [19:41:23] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [22:53:15] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10thcipriani) >>! In T274463#7046750, @Dzahn wrote: > Frankly, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshott...