[00:43:38] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[00:51:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[03:28:20] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The defragment finished: ` root@db1115:~# xfs_db -c frag -r /dev/md2 actual 8276, ideal 8176, fragmentation factor 1.21%  `  Starting MySQL now without the...
[03:31:04] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) As soon as we start the events, the server goes crazy again
[03:48:39] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Changing the isolation level to 'READ-UNCOMMITTED'  seem to have worked and the server is now under control after restarting the event_scheduler.  Also, I...
[03:55:25] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Changed back to `READ-COMMITTED` as otherwise the Host view isn't really useful as it doesn't show the replication status and lag for most of the hosts.
[03:56:57] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The load can be see with this image and the general bad status: {F34435112}
[03:58:51] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) And the host is back stuck with 1000 connections again
[04:02:29] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) db1118 tables are good, will pool this host next week.
[04:03:23] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I have increased the buffer pool to 50GB (we were using 20GB), but those big tables are InnoDB, so let's see if this helps.
[04:19:19] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The server seems "stable" at around 600 connections, which is way higher than it normally is but so far it is not increasing. I really think we've reached...
[04:28:10] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Disabled the `*schema` events (ones run every minute and the other one everyday),  we don't use them for anything: ` CREATE DEFINER=`root`@`localhost` EVEN...
[04:46:23] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The host is around 400-600 connections all the time. I am not going to make more changes, it is slow but functional at this point.
[04:56:26] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Seems a lot more stable now {F34435137}
[05:00:19] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui)
[05:08:04] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1080.eqiad.wmnet` - db1080.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga...
[05:10:20] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) This is ready for DC-OPs
[05:10:59] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:18:30] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s8 eqiad: [] labsdb1011 (not needed) [] labsdb1010 (not needed) [x] labsdb1009 (not needed) [x] dbstore1005 [x] db1177 [x] db1172 [x] db1167 [x] db1154 [] db1126 [x] db1116 [x] db11...
[05:18:55] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui)
[09:11:08] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) I checked all other tables, they were good.
[12:40:57] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Thanks @Ladsgroup - keep me posted!
[12:42:33] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) @Ladsgroup if we are going to keep track of the testing database deletion on {T281548}, we can probably ignore T278614#7022985 and close th...
[12:48:28] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) We have created a document to try to come up with a movement plan and see if we can do this during the next DC switchover (T281515). Once we've got the pla...
[13:06:49] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) 05Open→03Resolved a:03Marostegui Let's call it done "Create production databases for mailman3" is clearly done.
[13:24:21] * sobanski stepping out for a while, reachable via usual channels if needed
[13:24:50] <marostegui>	 sobanski: don't worry, tendril will be here waiting for you, no matter how far you go
[13:25:30] <sobanski>	 Like the Richard Marx song
[13:26:17] <marostegui>	 haha
[14:25:17] <wikibugs>	 10DBA, 10AbuseFilter: Check whether `FORCE INDEX page_timestamp` is still needed in LazyVariableComputer.php - https://phabricator.wikimedia.org/T281579 (10Daimona)
[16:13:07] <Amir1>	 FYI: I removed 30M rows from watchlist of commonswiki, it's 16% whether to shrink it or not, I let you decide but least, it won't grow for a couple of years hopefully
[16:15:17] <marostegui>	 that's nice
[16:15:26] <marostegui>	 we can always do a test and see if it is worth it
[19:41:23] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[22:53:15] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10thcipriani) >>! In T274463#7046750, @Dzahn wrote: > Frankly, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshott...