[03:08:18] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, and 2 others: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10kaldari) [03:08:27] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288008 (10kaldari) [03:09:29] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10kaldari) [03:12:48] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288010 (10Tbayer) [03:12:58] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288011 (10kaldari) FWIW, this doesn't seem to be a lag issue as all the pages af... [03:20:29] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288012 (10kaldari) p:05Triage>03High Marking high priority since this is aff... [05:58:20] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288072 (10jcrespo) 05Open>03Resolved a:03jcrespo This is a known issue, wa... [05:59:56] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3288077 (10Marostegui) db1054 (eqiad master) is finished: ``` root@neodymium:~# for i in `cat /home/marostegui/T162611`; do echo $i; mysql --skip-ssl -hdb1054 $i -e "show create table revision\G";done bgwiki *****... [06:10:26] 10DBA, 06MediaWiki-Platform-Team, 10MediaWiki-Special-pages, 10Wikimedia-Site-requests, and 4 others: "Invalid DB key" errors on various special pages - https://phabricator.wikimedia.org/T155091#3288097 (10TTO) [06:11:36] 10DBA, 06MediaWiki-Platform-Team, 10MediaWiki-Special-pages, 10Wikimedia-Site-requests, and 4 others: "Invalid DB key" errors on various special pages - https://phabricator.wikimedia.org/T155091#2933244 (10TTO) This has been fixed by running a maintenance script to remove invalid data from the database. If... [06:11:44] 10DBA, 06MediaWiki-Platform-Team, 10MediaWiki-Special-pages, 10Wikimedia-Site-requests, and 4 others: "Invalid DB key" errors on various special pages - https://phabricator.wikimedia.org/T155091#3288099 (10TTO) 05Open>03Resolved [06:38:29] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3288107 (10Marostegui) Done `fawiki` directly on codfw master so all codfw gets done (dbstore2001 is... [06:47:36] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3288112 (10Marostegui) db1069 done and replicated downstream to labs hosts: ``` root@neodymium:/home/... [07:19:07] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2415416 (10kaldari) There seem to be several pages on English Wikipedia which have been deleted but still appear on the Labs and Analytics Store replicas in the `page` table. For example: ```lang=sql MariaDB [enwiki_p]> sele... [07:26:41] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3288156 (10jcrespo) "several" is too vague- I have fixed the one given: ``` root@neodymium:~$ ./sql.py -h labsdb1001.eqiad.wmnet enwiki -e "select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'" -... [07:27:46] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10Marostegui) Just for the record of this ticket, Jaime kindly fixed it... [07:32:21] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3288161 (10Marostegui) `fawiki` is done : ``` root@neodymium:/home/marostegui/git/software/dbtools# c... [07:32:24] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3288162 (10Marostegui) [07:40:14] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 2 others: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3288176 (10Marostegui) 05Open>03Resolved I have double checked all the wikis and they are all done. [07:56:07] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3288202 (10Marostegui) And finally `frwiktionary` is done. The only difference (among all the hosts is on the `archive` table). [07:56:13] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3288203 (10Marostegui) [07:56:33] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3189184 (10Marostegui) This shard is ready for compare.py to run. [07:57:11] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3288208 (10Marostegui) This shard is ready for compare.py to run [07:57:26] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3288209 (10Marostegui) This shard is ready for compare.py to run [07:59:19] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1, s2, s4, s5 and s7 (eqiad) - https://phabricator.wikimedia.org/T164185#3288216 (10Marostegui) a:05jcrespo>03Marostegui [08:27:49] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3288285 (10Marostegui) [08:29:05] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205#3288300 (10Marostegui) [08:30:19] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3288314 (10Marostegui) [08:31:27] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207#3288328 (10Marostegui) [08:32:39] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3288342 (10Marostegui) [08:33:41] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3288356 (10jcrespo) but `--no-replicate` will not replicate, which is the point of it (not sure about that) [08:34:54] yeah I saw that, I am still going to tweak it a bit :) [08:34:58] Thanks though! [08:48:43] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3288403 (10Marostegui) [09:01:59] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3288428 (10Marostegui) [09:10:18] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3288453 (10Marostegui) [09:19:20] so while looking at the write cache, I checked our configuration for hps [09:19:33] it seems to be disabled [09:20:00] which could or probably would make sense for ssds, but I think read cache is disabled too [09:20:15] only on the HP? [09:20:19] not sure which should be the right configuration [09:20:34] but at least we should not the current state [09:20:37] *know [09:20:44] on the newer hosts [09:21:33] and maybe event try to test it [09:21:55] you have one host name handy? [09:22:02] of the newer ones I mean [09:22:17] I have not checked the latest ones [09:22:24] ok [09:22:26] i will check [09:22:36] but I was looking the other day to db1095 [09:22:49] jynus: I cannot recall the details, but see also an email I sent you on 2016/05/30 about HP controllers, there were some links [09:22:58] could be useful, could be not ;) [09:23:00] yes, I remember that [09:23:14] I think by default everyhing is disabled [09:23:33] which is not necesarily bad, but it should be a more active decision [09:23:58] Cache Status: Not Configured [09:24:05] Cache Ratio: 100% Read / 0% Write [09:24:10] Read Cache Size: 0 MB [09:24:23] I was checking an old hp (db2034) and it is enabled there indeed [09:24:36] ^from that I am not sure if we are using cache for reads there [09:24:46] Drive Write Cache: Disabled [09:24:51] definitely not for writes [09:25:04] let me see one of the newer ones [09:26:41] Current Cache Policy: WriteBack / Current Access Policy: Read/Write [09:27:13] root@db2060:~# hpssacli controller all show detail | grep -i read [09:27:13] Cache Ratio: 10% Read / 90% Write [09:27:18] so the policy is different [09:27:31] and that is not necesarily bad [09:27:38] but we should look more closely at it [09:28:00] and document some logic [09:28:31] yeah, at least get to know our state [09:28:33] and test if needed [09:29:02] is Drive Write Cache: Disabled there? [09:29:08] (I can check, sorry= [09:29:17] yes [09:29:17] it is [09:29:56] does that mean that 90% of the size is unused , or how exactly? [09:31:02] I would assume 10% Read / 90% Write is if Writes are enabled, if not, I guess it is 100% for read [09:31:12] "the SSD hosts have this "HP SSD Smart Path" enabled by default INSTEAD of the standard BBU cache." [09:32:16] so maybe the check in general should be more complex [09:33:04] For this first phase, I would say just alert on =! of WriteBack because we have seen issues with WriteThrough [09:33:10] yes [09:33:16] But yes, there are many things to keep into account [09:33:29] that is why I want to deploy the check now mostly [09:33:47] for all the old hosts that will not work with it disabled [09:34:05] I will leave it for the newer hosts, too [09:34:29] but we can change on hiera the specific hosts if it is better to disable it for ssds [09:44:53] WriteBack check is rolling out to all megacli dbs [09:45:47] great [09:46:21] I see no errors (as in both check errors and the check criticals) [09:47:01] so far so good [09:47:02] db1097 is for now in WriteBack [09:47:30] But that should be its normal state no? [09:47:49] not 100% sure [09:48:03] it could be, but maybe we should disable the cache [09:48:19] in which case having the policy or not would be irrelevant [09:48:49] but at least it would be an indirect check that something is wrong with the host [09:48:53] Ah, you meant the default policy [09:49:02] in case BBU goes bad [09:49:36] I do not have perfect answers, but I think this is better than what we had before [09:49:47] it definitely is [09:50:01] I am going to change the policy of db1015 [09:50:06] to test the alarm [09:50:14] ok [09:50:41] mainly so it doesn't create a ticket [09:51:11] which probably requires me running puppet on tegmen first [10:33:10] it takes 15 minutes for the alert to show [10:34:10] to not bother too much the controllers we have a $check_interval = 10 and retries in $retry_interval = 5 [10:34:45] yes, which makes sense [10:35:32] but makes the check mostly useless, because it will lag much before the alert goes off [10:36:35] there it goes [10:36:51] seems the filter works? [10:37:04] * volans looking at the logs [10:37:39] 2017-05-24 10:36:10 [INFO] raid_handler::main: Skipping RAID Handler execution for host 'db1015' and RAID type 'megacli', skip string 'must have write cache policy' detected in 'CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough' [10:46:22] RECOVERY - MegaRAID on db1015 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:46:27] 15 minutes later [11:18:03] 10DBA, 10Wikimedia-General-or-Unknown, 07Wikimedia-maintenance-script-run: Run updateRestrictions.php on WMF wikis - https://phabricator.wikimedia.org/T166184#3287810 (10Krinkle) [12:21:31] 10DBA: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3175341 (10Marostegui) [12:25:21] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3288878 (10Marostegui) a:05Marostegui>03None [12:50:29] there is relatively high number of query errors on special:recentchanges [12:52:33] always db1036? [13:05:25] 10DBA: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3288960 (10Marostegui) All ready to start running it on s1. I have removed `tag_summary change_tag watchlist` tables from the ignored ones, as they can be checked now as we have a PK! :) [15:29:38] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3289267 (10Anomie) Here are a few more such deleted pages that I happen to know of: Labs: ``` MariaDB [enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title lik... [15:35:26] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3289284 (10jcrespo) To clarify, the only reliable solution is the start using the new servers already (labsdb-web.eqiad.wmnet and labsdb-analytics.eqiad.wnet). See: ``` root@labsdb1009[enwiki_p]> select count(*) from page... [15:36:20] I need to restart [15:37:00] did you fix your mic in the end? [15:37:08] no [15:37:17] I just rebooted into stretch [15:37:21] xddddd [15:37:51] which I need to undo now to test 10.0.31 [15:48:28] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3289314 (10Anomie) Is there a task tracking the updating of the DNS entries such as enwiki.labsdb to point to the new servers? [15:51:19] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3289317 (10jcrespo) No, we are not ready yet because not all wikis are available yet, but you can follow the meta ticket at: T153058 We will be finishing the reimports first, then announcing it as beta first "opt-in", and l... [16:36:07] I am uploading 10.0.31 today [16:36:13] will try it on a random host, maybe [17:17:44] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3289504 (10Nuria) @tbayer: no, every table not in whitelist whose entire data was older than 90 days was dropped. >FWIW, it looks like these three are cur... [18:19:14] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3289757 (10jcrespo) Labs now: ``` root@labsdb1001[enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G *************************** 1. ro... [19:14:43] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3290004 (10kaldari) @jcrespo: I don't have any way to identify all the extra pages on Tool Labs. All I can say is that there are 1032 extra there: Tool Labs: ```MariaDB [enwiki_p]> select count(*) from page; +----------+ |... [19:17:23] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290043 (10Tbayer) >>! In T161855#3289504, @Nuria wrote: > @tbayer: no, every table not in whitelist whose entire data was older than 90 days was dropped.... [19:18:28] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290046 (10Nuria) Amend: Here is list of tables, now , note that running this script was a one-off, purging script is being worked on in the prior ticket:... [19:26:50] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290067 (10Tbayer) Thanks for the list! Agree that leaving an empty table is much preferable to dropping it without a trace. [20:39:52] 10DBA, 10ProofreadPage: fixProofreadIndexPagesContentModel doDBUpdates method turns infinitely - https://phabricator.wikimedia.org/T166261#3290249 (10Dereckson) [20:40:17] 10DBA, 10ProofreadPage: fixProofreadIndexPagesContentModel doDBUpdates method turns infinitely - https://phabricator.wikimedia.org/T166261#3290262 (10Dereckson) p:05Triage>03High [ Set priority to high, as wikisource is currently affected by this issue. ] [20:55:14] 10DBA, 10ProofreadPage: fixProofreadIndexPagesContentModel doDBUpdates method turns infinitely - https://phabricator.wikimedia.org/T166261#3290249 (10Zppix) I have suggested in IRC to rollack the train strictly for affected projects if user visable, until the issue is fixed, then go back to wmf 3 with the fix... [21:36:39] 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290441 (10Dereckson) [21:36:47] 10DBA, 06Operations: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290453 (10Dereckson) p:05Triage>03High [21:42:36] 10DBA, 10ProofreadPage, 13Patch-For-Review: fixProofreadIndexPagesContentModel doDBUpdates method turns infinitely - https://phabricator.wikimedia.org/T166261#3290476 (10Tpt) 05Open>03Resolved a:03Tpt The script fixed by Chad have been pushed on production and run on all wikis. [21:43:53] 10DBA, 06Operations: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290480 (10Dereckson) [21:46:45] 10DBA, 06Operations: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290483 (10Dzahn) a:03Dzahn [21:46:56] 10DBA, 06Operations: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290441 (10Dzahn) my fault it seems.. on it [21:52:09] 10DBA, 06Operations: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290491 (10Dzahn) it was an extra "," that was causing a syntax error. fixed with hotfix. gerrit follow-up coming.. [21:58:20] 10DBA, 10ProofreadPage, 13Patch-For-Review: fixProofreadIndexPagesContentModel doDBUpdates method turns infinitely - https://phabricator.wikimedia.org/T166261#3290537 (10Dereckson) >>! In T166261#3290283, @Zppix wrote: > I have suggested in IRC to rollack the train strictly for affected projects if user visa... [22:05:03] 10DBA, 06Operations, 13Patch-For-Review: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290547 (10Dzahn) fixed. deployed on terbium and wasat. also: 15:06 < mutante> !log terbium: dbtree: git stash and git pull origin to fix unclean repo state, depl... [22:05:13] 10DBA, 06Operations, 13Patch-For-Review: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290548 (10Dzahn) 05Open>03Resolved [22:12:43] 10DBA, 10Analytics, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3290582 (10Tbayer) >>! In T166194#3288072, @jcrespo wrote: > This is a known issu...