[08:09:17] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3908642 (10Marostegui) change_tag table has been fixed across all the servers. Next: tag_summary [08:59:05] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3908702 (10jcrespo) [08:59:08] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3908700 (10jcrespo) 05Resolved>03Open different rows detected on at least frwiki.archive on db2039. [12:21:01] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3909354 (10jcrespo) [12:21:05] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3909352 (10jcrespo) 05Open>03Resolved fixed [14:35:18] so now, what, do I continue runing compare.py on s3? [14:35:27] and risking a problem [14:35:42] or do I do it because if there is a problem, we have to reproduce it? [14:36:14] after all, if reading heavily on a single thread, breaks a host, it is not a good host to begin with [14:36:34] and better learning now than when it is in production [14:37:15] and db2018 had the same load and didn't crash [14:47:38] Yeah, I have been using it heavily lately [14:47:41] with 0 issues [14:47:45] (including db1052) [16:28:16] 10DBA: Check data consistency across production shards - https://phabricator.wikimedia.org/T183735#3910061 (10jcrespo) [16:28:18] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3910062 (10jcrespo) [16:28:21] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3910059 (10jcrespo) 05Resolved>03Open db2040 and db1094 eswiki.archive had drifts. Checking them and correcting them. [16:37:17] for some reason, the alter table on db1095 doesn't appear on tendril activity :-( [16:38:40] probably because row based replication [16:39:16] (even if alters are always a statement query) [16:59:45] 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3910160 (10Nuria) [16:59:47] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3910159 (10Nuria) 05Open>03Resolved [17:02:09] 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3910166 (10elukey) 05Open>03Resolved [17:02:17] 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1618887 (10elukey) [17:15:19] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3910211 (10jcrespo) This is bad (split brain between eqiad and codfw): ``` ./compare.py eswiki archive ar_id db1062 db1069 db1079 db1086 db1094 db1098:3317 db1101:3317 db2040 db2029 db2047 db2054 db2061 db2068 db2077 db... [17:16:51] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3910213 (10jcrespo) It is 2 rows that got swapped ids , either caused by a bad query on the codfw master, or was fixed on eqiad only: ``` mysql -h db2040.codfw.wmnet $db -e "SELECT * FROM $table WHERE $pk IN (12682920,... [17:25:50] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3910259 (10jcrespo) Should be fixed now, I will wait until s7 check is complete to close the ticket. [18:30:17] jynus: I think it is not because of ROW but because it goes thru replication [18:30:27] Because the ones done in codfw masters, doesn't appear when they run on the slaves