[07:31:53] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) This has happened again: ` root@cumin1001:~# mysql.py -hdb1115 -e "show processlist" | grep Copy 35178632 root 10.64.32.25 tendril Connect 15049 Copying to tmp table insert into processlist... [07:34:40] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) [07:35:01] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) p:05Triage→03High [07:35:35] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Right after restarting mysql: ` 07:35:12 up 5 days, 2:39, 2 users, load average: 492.22, 411.17, 215.39 ` [08:03:42] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) All the events like `es1020_eqiad_wmnet_3306_schema` have been disabled across all the hosts as they were running every minute and we don't really use them. [08:09:28] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Also noticed lots of: ` delete from global_status where server_id = @server_id and variable_name not like '%.%' ` They were taking ages to complete cause that table was around 15GB (and I don't think we e... [08:10:56] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Most of the connections are getting stalled now on: ` 69415 root 10.64.32.13 tendril Connect 1084 updating delete from global_status where server_id = @server_id and variable_name not like '%.%' 0.000 6943... [08:30:36] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) I have finally truncated also `global_status` as its data isn't really super used and it gets repopulated by the cronjobs and by the events - the table was around 10GB: An example of its data: ` mysql:root... [08:47:47] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) And it is non again almost unresponsive [08:51:59] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) I have given double its buffer pool size: from 20 to 40GB to innodb and from 30GB to 40GB to tokudb. [08:57:33] 10DBA, 10Schema-change, 10Tracking-Neverending: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Aklapper) [09:06:46] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) At this point I think we've either hit some sort of limits on concurrency (we have 2273 events enabled) or there's some underlying HW issue that is making the host perform slower than usual. Random connect... [09:25:30] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) There are many of these entries: ` May 9 19:25:39 db1115 smartd[807]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 77 May 9 19:25:39 db1115 smartd[807]: D... [09:44:33] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) Fully rebooted the host, and started it with the event_scheduler disabled. I went to tendril's web and caught this query: ` select `server_id`, `variable_value` from tendril.global_variabl... [09:58:16] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) The CPU's were running powersave as scaling_governor I have changed them to performance. I don't expect miracles, but... The problem is still that not very big transactions take lots to finish and everyth... [10:47:14] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) This is quite insane - how can we have that crazy amount of rows for just one month of data: ` mysql:root@localhost [tendril]> show explain for 283543; +------+-------------+---------------... [10:48:37] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Fully rebooted the host, and started it with the event_scheduler disabled. I went to tendril's web and caught this query: ` select `server_id`, `variable_value` from tendril.global_variables where `varia... [10:49:35] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) This is quite insane - how can we have that crazy amount of rows for just one month of data: ` mysql:root@localhost [tendril]> show explain for 283543; +------+-------------+-------------------+-------+---... [10:51:59] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) ` mysql:root@localhost [(none)]> show explain for 291612; +------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+ | id | select_type |... [11:04:31] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) So looks like we have an event to purge that table but doesn't seem to be working at all based on the dates below: ` mysql:root@localhost [tendril]> show create event tendril_purge_global_status_log_5m; +-... [11:49:53] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) So looks like the above event was never able to execute when the scheduler is enable due to, basically, locking issues due to all the concurrency: ` daemon.log:May 10 07:35:01 db1115 mysqld[9874]: 2020-05-... [12:23:28] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) 05Open→03Resolved a:03Marostegui I decided to finally fully truncate `global_status_log_5m` as we don't really use those values anyways. truncate + optimize fixed the thing. Tendril is now back at it... [12:23:29] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [12:32:21] 10DBA: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 (10Marostegui) [12:33:00] 10DBA: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 (10Marostegui) p:05Triage→03High Setting it to high as this has (probably) the cause of two recent incidents on tendril [12:34:42] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Btw, the effect of truncating those tables on the used disk space: {F31811071} [12:52:56] 10DBA, 10Schema-change, 10Tracking-Neverending: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10matthiasmullie) [14:30:46] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10jcrespo) I would suggest to move zarcillo database away- it is very annoying to have the db down for an unrelated (to backups) reason, which will result in backup alerting on Monday. Unless you give me a good reason... [14:39:42] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Fine with me [17:00:29] 10DBA, 10Operations: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) [19:06:02] 10DBA, 10Beta-Cluster-Infrastructure, 10Performance-Team: Enable GTID on beta cluster mariaDB once upgraded - https://phabricator.wikimedia.org/T139044 (10Krinkle) [19:27:34] 10DBA: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10Privacybatm) > I think transfering, handling mariadb and handling the firewall as logically separate units that should be also separate on its own files. That will make the files smaller instead of a monolithic unit. What do you think? I... [19:53:53] 10DBA, 10Data-Services, 10Patch-For-Review: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617 (10bd808)