[04:40:25] 10DBA, 06Labs: Labs database corruption - https://phabricator.wikimedia.org/T166091#3285070 (10Legoktm) From a production database server: ``` mysql:wikiadmin@db1083 [enwiki]> SELECT pl_namespace, pl_title -> FROM page -> JOIN pagelinks ON pl_from = page_id -> WHERE page_namespace=0 AND page_title=... [06:03:13] 10DBA, 06Labs: Labs database corruption - https://phabricator.wikimedia.org/T166091#3285070 (10Marostegui) Looks like this is only happening on the old labs infra (, db1069 and labsdb1001 and labsdb1003). The new one are showing the same value as production. [06:06:23] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3285395 (10Marostegui) db1021 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# for i in `cat /home/marostegui/T162611`; do echo $i; mysql --skip-ssl -hdb1021 $i -e "show create table revision\G";... [07:07:48] 10DBA: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097#3285215 (10Marostegui) I have renamed the tables on a few hosts and will leave them like that for a few days to make sure no errors appear. They have been renamed to: ``` T166097_gather_list T166097_gather_list_flag T166097_gath... [07:19:39] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285491 (10Marostegui) ruwiki on codfw is done (dbstore2001 will get it tomorrow, it is the delayed s... [07:47:05] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285514 (10Marostegui) db1069 done (and replicated downstream): ``` root@neodymium:/home/marostegui/g... [07:56:06] jynus: marostegui it looks like write times for cognate queries have increaed on db1031 again [07:56:25] and on https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-7d&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1031 I see a bunch of things like disk latency and wait time increasing [07:56:28] anything to worry about? [07:58:09] what happened at 4:30? [07:59:24] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=19&fullscreen&orgId=1&var-server=db1031&var-network=eth0&from=now-24h&to=now [07:59:47] wow, thats crazy disk usage [08:00:11] !log the last script I started is now stopped [08:00:12] addshore: Not expecting to hear !log here [08:00:17] meh, wrong channel.... [08:00:19] It is not crazy per se, but it is crazy compared to the disk usage it had before [08:06:11] I think I know what is going on [08:06:16] Looks like HW issue [08:06:51] 1->36% in 2 mins does seem odd [08:07:30] It is the BBU [08:07:39] The raid controller went to writethru mode [08:08:48] marostegui: looking back, it looks like this happened too during the x1 outage when cognate was disabled? [08:08:50] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=19&fullscreen&orgId=1&var-server=db1031&var-network=eth0&from=1493816153771&to=1494017945000 [08:08:56] I think we might have just found the root cause? [08:09:25] addshore: Don't think so, because if we do not replace the BBU or change the policy (which I am doing now) it wouldn't have fixed by itself [08:09:39] interesting (expanded) https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=19&fullscreen&orgId=1&var-server=db1031&var-network=eth0&from=1493643353000&to=1494017945000 [08:11:26] The timex of that disk usage match the incident exactly [08:11:34] *times [08:11:49] But that can be the consecuence [08:12:02] addshore, the original issue is 10000 connections waiting [08:12:05] Again, this time it looks like the BBU, and that doesn't get fixed automatically and as far as I know we never touched it [08:12:12] Let me try to confirm it is the BBU this time [08:12:15] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&from=1493828117986&to=1493833737439&var-dc=eqiad%20prometheus%2Fops&var-server=db1031 [08:12:36] do you see that happening now? [08:13:25] The disk usage appears before the 1000 connections, DU spike started at 14:00, first instance of a query being killed is at 14:21:59, the query killer caused the table locks which in turn caused the spike in connections [08:13:52] the disk usage happens because of the switchover [08:14:15] jynus: no, but I do see the write times of queries increasing https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate?refresh=1m&orgId=1 [08:14:58] again, do you see the queries waiting? [08:15:42] because if you have a hammer, all you see is nails [08:16:47] not yet, but that took some hours before, DU increase @ 14:00, first query killed @ 14:21:59 spike in connections at 16:51. [08:17:09] only writes slower than 60 seconds are killed, do you see any of those? [08:17:13] And after forcing the policy to WriteBack, solved: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=19&fullscreen&orgId=1&var-server=db1031&var-network=eth0&from=1495511890227&to=1495527409190 [08:17:18] I am going to create the ticket now [08:18:31] jynus: no, but again, this seemed to take time during the outage [08:19:20] Auto-Learn Mode: Disabled [08:19:25] I disabled it [08:19:28] just in cae [08:19:29] case [08:19:31] did you change it now? [08:19:41] from what? [08:19:47] Auto-Learn Mode: Warn via Event [08:19:50] warning [08:19:51] after the outage when the DU dropped the contention issues immediately disappeared. With my limited db & hardware knowledge this still all seems to be very connected [08:19:53] then why [08:20:16] addshore: please stop [08:20:34] I didn't want the BBU to do anything weird once I forced WriteBack [08:20:40] Learn Cycle Requested : Yes [08:20:43] yes [08:20:47] I forced it first [08:20:48] why? [08:20:53] as that fixed db1048 first [08:20:56] ah, so now [08:21:16] do you have an output before touching it? [08:21:21] yep [08:21:25] I am pasting it on the ticket, one sec [08:21:31] please share [08:21:44] I am, give me a sec to finish the ticket :) [08:27:40] 10DBA, 06Operations, 10ops-eqiad: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10Marostegui) [08:27:40] jynus: ^ [08:27:59] Sorry it took some minutes, I wanted to paste the whole thing in order to give more context [08:28:29] no, I want the fully original output before youching it [08:28:56] if you have it [08:29:18] I doi [08:29:45] https://phabricator.wikimedia.org/P5476 [08:29:47] that? [08:30:12] yes, thanks [08:31:10] Warn via Event and disabled mostly do the same [08:31:33] it changed because the batterly was low on charge [08:32:06] Interesting, it now changed to Optimal again, which is the same behaviour db1048 had [08:32:30] Let me change the policy back to default and let's see what it does [08:33:17] he, it now says WriteBack again [08:33:33] So it looks pretty similar to db1048 (which was showing this from time to time) [08:35:57] 10DBA, 06Operations, 10ops-eqiad: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285604 (10Marostegui) After a long while the BBU shows `Optimal` again, so looks like the manual relearn worked (the same way it did on db1048 - T160731#3109104 ) Setting the policy back to its default... [08:38:01] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285607 (10Marostegui) db2040 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb2040.codfw.wmnet frwiktionary -e "show create table revision\G... [08:38:11] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285608 (10Marostegui) [08:38:45] addshore: this was the disk utilization of db2066 after switchover- db2066 did not had a BBU failure, but the same pattern arises [08:40:34] What is it that causes the DU spike for the period after the switchover? [08:40:47] mysql [08:40:52] mysql uses the disk [08:41:00] now should it be using the disk so much? [08:41:03] addshore: This is _clearly_ a BBU issue as it has fixed once we have forced it to do what it normally does. When the other issue appeared, we never touched any BBU related things [08:41:55] to give you an idea, in that case it was cold caches + inefficient queries [08:42:27] marostegui: agreed, however as far as we can see the DU caused the write queries to progressively take longer, get killed which caused the table level locks and the locks caused the spike in waiting connections. [08:42:36] disk usage is an indicator "something is happening" [08:42:42] Unless we can think of something to avoid this at switchovers I expect this would likely happen again [08:42:45] but 100% disk usage is not a problem [08:42:56] by itself [08:43:05] unless the bottom line is, we need to not do these queries? [08:43:44] the direct cause of the outage was too many connections running, and those were blocked [08:43:55] why that caused that, I do not know [08:46:07] my skepticism, is that with write through, I expect queries to be slower, of course [08:46:50] but no in the many seconds realm [08:47:28] marostegui: no alarm or warning from icinga, I assume? [08:47:35] nope that I saw no [08:48:02] maybe we could change so it policy is write-through, it is criticak [08:48:28] yeah, there was some discussions on irc about BBU alerts and so forth indeed [08:48:54] the other day was temperature [08:49:03] other day can be anything else [08:49:13] in the end, what we want is the cache working [08:50:10] yes, totally [08:50:42] 10DBA, 10Tool-Labs-tools-Other, 13Patch-For-Review: Tired of APIError: readonly - https://phabricator.wikimedia.org/T164191#3225024 (10Multichill) Yup, I'm tired of it too, but should be handled by Pywikibot. MediaWiki seems to have introduced a new way of throwing a readonly error and Pywikibot doesn't hand... [08:51:03] if Current Cache Policy != WriteBack, error [08:52:11] yes, i think that would be useful to catch things quickly [09:00:42] "he Backup Battery Units (BBU) on the RAID Controller cards have an average mean life of about 2 years." [09:01:10] great :) [09:01:23] well, to be honest, we have had quite a few in the last months [09:01:43] and that is why we are decomission those older hosts [09:01:51] db1048, db1031, db1047 [09:25:05] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285737 (10Marostegui) db1041 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb1041 frwiktionary -e "show create table revision\G" **********... [09:25:14] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285738 (10Marostegui) [09:56:06] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285780 (10Marostegui) db1062 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb1062 frwiktionary -e "show create table revision\G" **********... [09:56:19] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3285783 (10Marostegui) [09:56:21] 07Blocked-on-schema-change, 10DBA, 05MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 05MW-1.28-release-notes, 13Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#3285784 (10Marostegui) [09:56:23] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3285781 (10Marostegui) 05Open>03Resolved [09:59:25] Just an observation but DU and latency seem to have shot back up again in the last 10 mins [09:59:46] yes [09:59:48] i just saw that [09:59:55] the BBU went back to WriteThrough [09:59:59] Same behaviour as db1048 [10:00:07] so I am going to force it to WB and leave it like that [10:03:09] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285788 (10Marostegui) And this happened again: ``` root@db1031:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3830 mV Current: -685 mA Temperatur... [10:06:37] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285791 (10Marostegui) [10:12:52] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3285793 (10Marostegui) I have finished with T165743, so I am going to attempt to run pt-table-checksum on `frwiktionary` again [10:15:29] 10DBA, 06Operations: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3285795 (10akosiaris) Per https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=7680e3d95eee2aa98b1c461dbc0dcc5c&host=db1016&user=&schema=puppet&hours=24 this seems to be occurin... [10:25:37] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285816 (10Marostegui) db1023 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysq... [10:43:23] 10DBA: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097#3285823 (10Nemo_bis) >>! In T166097#3285459, @Marostegui wrote: > I have backuped those tables on: > ``` > dbstore1001:/srv/tmp/T166097 > ``` Thanks. [10:46:13] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285826 (10Marostegui) db1085 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysq... [10:55:42] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285831 (10Marostegui) db1088 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysq... [11:09:23] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285866 (10Marostegui) db1093 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysq... [11:31:38] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285930 (10Marostegui) db1050 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysq... [11:32:22] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3285931 (10Marostegui) [11:37:32] marostegui: trying to put as much info in the report as possible as I'm still not convinced we have found the issue. db2033 was the master for x1 in codfw during the switch over right? [11:38:30] addshore: yes, db2033 was the master in codfw [12:04:24] regarding the disk utilization and read and write latency both masters seemed to do very different things. [12:20:36] qq question for the EL purging script - during the offsite we came up with https://gerrit.wikimedia.org/r/#/c/353265/12/modules/role/files/mariadb/eventlogging_purge.py (still not using the python library that you guys showed me a while ago but it will be easy to port the script to it) [12:21:43] for the use case of updating only some fields to NULL we thought about a query like UPDATE set X=NULL,etc.. where id IN (SELECT .. LIMIT {} OFFSET {}) [12:22:04] the idea would be to iterate the updates in batches of fixed amount of rows [12:23:05] the amount of rows added per day are ~2M for the busies table more or less [12:23:21] note sure I would make the dbname an option [12:23:46] ah yes the code is still WIP, this came out after some hacking [12:23:47] unless you create a special user that only has rights on that db [12:24:11] whatever that avois accidentally delete from non-eventloggin tables [12:24:14] *avoids [12:24:53] the logic itself looks sane, but it depends on the context [12:26:01] maybe some extra error handling and not sure I see some kind of logging [12:26:33] what is the plan, run it locally or how? [12:28:18] yep, locally on each slave as we discussed [12:28:32] and then we'll apply something to the master [12:28:42] is there an index on timestamp? [12:28:48] on all tables? [12:29:30] not sure, the ts is part of a fixed EventLogging metadata so it could be, but I am going to check now [12:31:51] also do not use format except for dynamic sql [12:32:07] execute has its own filtering options [12:32:53] https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-execute.html [12:33:42] the most important part, which is the purging itself, is up to you [12:34:09] you know the problems - potential contention with app inserts and "replication" [12:34:49] yep exactly, but I wanted to get feedback from you if it was a stupid first prototype or if the choices made have some sense [12:34:52] :) [12:35:09] sounds like it is going in the right direction (will take into account your suggestions asap) [12:44:21] check the indexes- while they may not be as important as performance, not being indexes can cause contention. You could also use some tricks so that the purge gets killed more often if there is a problem [12:55:37] 10DBA, 10MediaWiki-Database: Integrate Facebooks "Online Schema Change for MySQL" into MediaWiki - https://phabricator.wikimedia.org/T32824#3286094 (10jcrespo) Note that we already use online schema change directly supported on Mariadb 10 / MySQL 5.6 or pt-table-checksum (update.php would do that also automati... [13:00:10] 10DBA, 10MediaWiki-Database: Integrate Facebooks "Online Schema Change for MySQL" into MediaWiki - https://phabricator.wikimedia.org/T32824#3286108 (10jcrespo) BTW, facebook's took got rewritten in python: https://github.com/facebookincubator/OnlineSchemaChange however, most people are abandoning trigger-based... [14:59:36] 10DBA, 10Analytics: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10Marostegui) [15:00:00] 10DBA, 10Analytics, 06Operations: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286450 (10Marostegui) [16:20:11] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3286650 (10Marostegui) dbstore1002 was missing and it is now done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdbstore1002 frwiktionary -e "show cr... [16:20:24] 10DBA, 13Patch-For-Review: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743#3286651 (10Marostegui) [16:21:09] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3286653 (10Marostegui) dbstore1002 was missed on: T165743 so it messed up the checksum when it arrived to revision table, I have fixed it and now I am waiting for dbstore1002 to catch up to start the run again [16:27:44] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10jcrespo) The previous patch was reverted, I am creating a separate one to allow to enable or disable the extra check at will (for megacli first). [16:29:27] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3286698 (10jcrespo) ``` root@prometheus1003:~$ python check-raid.py OK: optimal, 2 logical, 6 physical OK root@prometheus1003:~$ python check-raid.py --policy=WriteBack CRITICAL:... [16:29:33] ^marostegui [16:30:00] nice one :) [16:31:17] now, aside from some minor style corrections we can modify the generic check to enable a policy based on a hiera key [17:45:06] 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3286990 (10jcrespo) I've checked, and the currently in use check does too much, probably we do not need such a thorough check every time icinga runs, wh... [18:07:54] 10DBA, 06Reading-Web-Backlog: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097#3287041 (10Jdlrobson) Would it be possible to get access to those backups? I'd be interested in the public aspects of those database tables. What permissions do I need? [18:12:50] 10DBA, 06Reading-Web-Backlog: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097#3285215 (10jcrespo) Not the backups, but maybe public dumps could be generated? But that requires someone compromising to sanitize them. If it is a one-time access, regular [[ https://wikitech.wikimedi... [18:25:46] 10DBA, 07Schema-change: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949#3287140 (10demon) Also, there's no need for any backups here, and I can confirm nothing is still using this data. Safe to just drop outright. [18:28:41] 10DBA, 07Schema-change: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949#3287141 (10jcrespo) When we say we take backups, what manuel means is that we temporarily move the tables (but only for a limited amount of time). It is not as much as we do not trust that they are no... [22:39:48] 10DBA, 06Community-Tech, 10MediaWiki-User-management: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3287804 (10MusikAnimal) Well here are my results, sorry for the super long post! There are two variations for each query: (a... [22:47:19] Ok, I'm stumped. There's a stupid query that seems to be the source of the replag. It keeps attempting (and failing?) [22:47:19] I think it came from update.php [22:47:19] USE enwikisource; UPDATE /* Wikimedia\Rdbms\Database::query */ page SET page_content_model = 'proofread-index' WHERE page_namespace = 106 AND page_content_model = 'wikitext' ORDER BY page_namespace, page_title LIMIT 1000; [22:47:19] (why would we be mass-updating page_content_models outside of updates?) [22:47:19] Failing...because of the LIMIT on UPDATE? [22:47:19] Or because 0 rows updated? [22:47:26] (this is on db04 in beta)