[06:04:50] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142794 (10Marostegui) 05Resolved>03Open This has happened again, so maybe the BBU is indeed faulty. ``` root@db1048:~# date ; mysql --skip-ssl -e "show slave status\G" | grep... [06:06:44] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142797 (10Marostegui) a:05Marostegui>03Cmjohnson [06:25:20] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3142820 (10Marostegui) The eventlogging script on db1047 is failing due to: ``` Thu Mar 30 06:07:49 UTC 2017 localhost ContentTranslationError_11767097, createERROR 1005 (... [06:28:59] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142822 (10Marostegui) ``` ˜/icinga-wm 8:27> RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds ``` ``` Battery State: Optimal ``` `... [06:32:53] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3142825 (10Marostegui) db1094 (s7) is back in the pool with all the UNIQUE converted to PK for all the wik... [07:26:36] 10DBA: s2 on db1069 stuck on a query for trwiki flaggedrevs_tracking table - https://phabricator.wikimedia.org/T161781#3142891 (10Marostegui) [07:26:55] 10DBA: s2 on db1069 stuck on a query for trwiki flaggedrevs_tracking table - https://phabricator.wikimedia.org/T161781#3142904 (10Marostegui) 05Open>03Resolved [08:28:09] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Add autoincrement id to EventLogging MySQL tables. {oryx} - https://phabricator.wikimedia.org/T125135#3143065 (10jcrespo) That would be huge win! And it would make the terbium back-filling unnecessary finally! [08:44:32] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3143073 (10jcrespo) Should we increase open_files_limit or do you think this was a one-time issue due to the rename process? [08:45:59] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143079 (10jcrespo) I believe there was yesterday maintenance or trouble on Phabricator. I would ask RelEng first. [08:46:06] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3143081 (10Marostegui) I thought about it, but it has not happened until we did the rename thingy yesterday. So I would leave it for now. [08:47:48] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143084 (10Marostegui) Yep, the deployment page said there was a phabricator update so maybe that put more stress on the server and made the BBU fail (again)? Because the fact tha... [10:30:46] 10DBA, 10Wikidata, 13Patch-For-Review, 15User-Daniel, and 2 others: Use redis-based lock manager for dispatchChanges on test sites. - https://phabricator.wikimedia.org/T159828#3143259 (10hoo) >>! In T159828#3140724, @daniel wrote: > @hoo Can you confirm that we are actually running multiple instances of di... [11:45:08] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3143360 (10Marostegui) db2019 (codfw) master is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2019.codfw.wmnet commonswi... [13:56:11] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3143554 (10Marostegui) db1090 (s2) has had the following tables with UNIQUE -> PK: ``` categorylinks image... [14:04:53] 10DBA, 13Patch-For-Review: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485#3143594 (10jcrespo) ``` $ python3 compare.py dbstore1002 db1036 svwiki text old_id --from-value=27558001 --to-value=27559000 --step... [14:07:58] 10DBA, 13Patch-For-Review: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485#3143599 (10jcrespo) They are not toys! They are useful: ``` $ ./sql.py -h dbstore1002.eqiad.wmnet svwiki -e "SELECT * FROM text WHE... [14:31:01] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3143638 (10Ottomata) Wow, ok. Thanks. [14:33:26] jynus: you playing with db2033? (x1)? [14:33:33] it is lagging due to: 10401884 | system user | | NULL | Connect | 513 | Waiting for semi-sync ACK from slave | NULL [14:39:08] oh, that host has been having timeouts for days now [14:39:59] nope, not me [14:40:15] it could be semi sync with a large timeout [14:40:28] I enabled it there recently and we are missing a slave [14:41:12] but I think that is a manifestation, not a consequence [14:41:15] yeah, it has been having timeouts since 27th march [14:41:19] but let me disabled it [14:41:22] just in case [14:43:17] I have disabled it [14:43:28] if it continues happening, it is something else [14:43:34] it caught up instantly [14:43:46] I was checking the disks and all that and that lokos fine [14:43:49] if it stops, we need a slave until we reenable it again [14:44:22] it caught up in no time [14:44:31] we should add the new slave and dbstore1002 as slaves [14:44:39] and then reenable it again [14:44:55] we also need to import x1 to dbstore2002 [14:45:04] do we? [14:45:22] why not? we have space there [14:45:36] sure [14:45:47] I thought it was full already [14:45:49] it is not high priority or anything, but i think we should [14:46:08] we don't have space for another shard, but for x1 we do [14:46:12] there is 1.2T available [14:48:22] yeah, ok for me [14:48:30] just not really a priority [14:48:37] no, not at all [14:48:38] those host are nice for backup [14:48:59] but not really usable for production traffic [14:49:13] yeah, I meant it as a backup of course [14:49:16] we can do it at the same time the new host is setup [14:49:19] yeah [14:49:22] https://www.drupal.org/node/2856362 [14:49:45] "Percona XtraDB Cluster will deny a request if an operation is performed on a table without an explicit primary key." [14:50:06] primary keys, primary keys, primary keys, lovely primary keys [14:50:21] hahaha [14:50:37] mysql should never let you create a table without PK! [14:50:48] https://www.youtube.com/watch?v=M_eYSuPKP3Y [14:51:02] hahahaha i love that one!!! [14:52:07] I used to refer to that one when I was in amsterdam and we always had cheese sandwiches :) [14:54:47] among our weaponry are: indexes, and an almost fanatical devotion of primary keys: https://www.youtube.com/watch?v=Nf_Y4MbUCLY [14:56:27] I don't know that one! [14:56:52] well, nobody expects the Spanish Inquisition! [15:14:30] 10DBA, 06Analytics-Kanban, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3143721 (10Nuria) a:03Ottomata [15:15:29] 10DBA, 06Analytics-Kanban, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) Let's take advantage of the fact that after the rename we have now autoincrement ids on new tables . [15:22:26] there is contention on db1089 [15:22:35] a friend of ours [15:22:54] our lovely queries? [15:22:58] yep [15:23:12] hopefully tomorrow they will be gone :) [15:26:35] pff [15:26:38] it is a very bad shape [15:46:05] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3143815 (10Marostegui) dewiki is finished and had some differences on geo_tags table: ``` Differences on db2045 TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY dewiki.geo_tags 1 0 1 PRIMARY 38 8911555 Differenc... [16:39:03] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3144009 (10Nuria) [16:57:12] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3144190 (10jcrespo) Another close-to-outage incident with a long running query at: T159319#3144183 , This confirms I need to do this ASAP. [17:20:09] 10DBA, 13Patch-For-Review: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485#3144341 (10jcrespo) text is now ok after reimport, oldimage, too (with some false positives due to some deleted revisions out of or... [18:26:23] hm, Q [18:26:31] is eventlogging data purged from the slaves? [18:26:38] i'm looking at this el_sync script [18:26:41] and the purging part is commented out