[05:54:19] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3266287 (10Marostegui) db2063 is done: ``` root@neodymium:~# for i in `cat /home/marostegui/T162611`; do echo $i; mysql -hdb2063.codfw.wmnet --skip-ssl $i -e "show create table revision\G";done bgwiki ************... [06:23:05] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3266308 (10Marostegui) Hello, I have tested this change on a codfw slave of s3 and it took 4 seconds. After doing a select count on the table and after the fi... [07:15:35] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3266381 (10jcrespo) The problem usually is not the alter size, but the metadata locking, which creates way more contention. [07:21:53] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3266382 (10Marostegui) Ah right! I wasn't expecting a test table to have such issues, but yeah, that could be! [07:29:28] jynus: yeah I didn't upload it yet, though the mysql_slave metrics will be renamed due to connection_name being added, are we using those in any dashboard you know? [07:31:56] we use the seconds behind master [07:32:15] but we can break those temporarily [07:32:27] in fact, that may actually fix those for dbstores [07:32:34] where no metrics are currently gotten [07:33:03] nice, yeah the net effect is that on the dashboard we'll see the new old and new metric for a period of time [07:33:10] no problem [07:33:16] can you upload as is [07:33:23] and I will test it selectively? [07:33:55] sure, I'll do that now [07:34:43] fiber 50mbit simmetric from movistar at home is nice I must say [07:35:17] I have 200 mb, and was complaining the other day :-) [07:37:59] heheh here it has been stable most of the time afair [07:38:20] at least according to this thing https://atlas.ripe.net/probes/21966/ [07:40:33] uploaded 0.10.0 [07:58:50] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3266420 (10Marostegui) viwiki is done and there are quite some differences across the hosts, but only on codfw: ``` Differences on db2047 TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY viwiki.category 260 0 1... [07:59:06] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3266421 (10Marostegui) [08:01:42] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266430 (10Marostegui) s3 in codfw is done - small example ``` root@db2018:/tmp# for i in `mysql --skip-ssl -e "sel... [08:02:05] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266431 (10Marostegui) [08:08:17] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266440 (10Marostegui) [08:29:46] 10DBA, 10MediaWiki-Database, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 3 others: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3266479 (10Marostegui) [08:29:51] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266472 (10Marostegui) 05Open>03Resolved Everything is looking good. The only host which still doesn't have the... [08:29:56] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266480 (10Marostegui) [08:44:41] 10DBA: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190#3266500 (10Marostegui) For `centralauth` the following tables are going to be checksummed as they have PK: ``` bug_54847_password_resets PRIMARY KEY (`r_wiki`,`r_username`), global_group_permissions PRIMARY KEY (`ggp_group`,`ggp_p... [09:19:01] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3266600 (10Marostegui) [09:20:46] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2683529 (10Marostegui) s3 in codfw is done (some examples): ``` zh_yuewiki PRIMARY KEY (`ts_id`), zhwikibooks PRIMARY KEY (`ts_i... [09:20:54] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3266604 (10Marostegui) [09:31:23] go and vote for this: https://jira.mariadb.org/browse/MDEV-12811 [09:32:30] done! [09:33:18] jynus: typo :-P "are going are likely" [09:38:15] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266637 (10Addshore) >>! In T130067#3266472, @Marostegui wrote: > Everything is looking good. > The only host which... [09:38:38] the brand new mariadb backups fails to compile with a reference to a new library I have installed [09:45:10] it requires a minimum version of lz4 that it didn't bother to check [09:54:48] jynus: I have already built that new version for HHVM, let me check [09:56:05] jessie-wikimedia/backports has lz4/r131, which was recent enough for HHVM 3.18 [09:58:56] it is this new function [09:59:00] called loaddict [09:59:11] I am trying to know on which realease it was added [09:59:25] but github is horrible for that [10:00:04] the problem is not that, the problem is that cmake didn't bother to check that was ok [10:00:35] anyway, I have disabled that functionality on compile options, as we probably won't use it, I can enable it later [10:01:02] or even package it separately [10:02:12] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 4 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3266799 (10Marostegui) >>! In T130067#3266637, @Addshore wrote: >>>! In T130067#3266472, @Marostegui wrote: >> Ever... [10:04:36] 10DBA, 07Epic, 13Patch-For-Review, 05codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#3266810 (10Marostegui) [10:04:40] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3266808 (10Marostegui) 05Open>03Resolved Everything is looking good. The only host which still doesn't have the new column is db... [10:04:42] 10DBA, 10MediaWiki-Database, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 3 others: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3266812 (10Marostegui) [10:05:16] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3266827 (10Marostegui) [10:06:10] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3266828 (10Marostegui) a:03Marostegui Given that s3 codfw > eqiad replication is reseted for other maintenance - I will execute this on the codfw master, and... [10:12:24] yeah, I ran into a similar problem with hhvm, they don't have real lz4 releases at this point, Debian uses as fake 0.0~r131-1 based on the SVN revision... [10:21:30] I have uploaded new versions of mariadb 10.1 and the client-only [10:21:38] I will test them now [10:30:23] upgraded on sarin and neodymium [10:31:23] https://phabricator.wikimedia.org/P5450 [10:32:12] I will now upgrade db2062, and new labs-related hosts [10:32:18] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3266941 (10Ottomata) Update on this. @luca is working on T156933, and in talking, we realized that if we get rid of the second slave (db1047), we will only have one copy of E... [10:34:44] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3266955 (10jcrespo) If redundancy is the main reason, and not load balancing, I would suggest having the redundant server on codfw. But there is now no analytics server on co... [10:36:57] heads up for !log upgrading and restarting db2062's mariadb service [10:56:01] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3267070 (10Marostegui) This has been deployed on codfw master and replicated downstream: ``` root@neodymium:/home/marostegui/git/software/dbtools# for i in `cat... [11:08:27] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3267157 (10Marostegui) Deployed on db1069 (sanitarium) and replicated to labsdb1001 and labsdb1003 ``` root@neodymium:/home/marostegui/git/software/dbtools# my... [11:09:04] and another heads up for I am going to do a rolling upgrade of labsdb1009/10/11- you will see here warnings of the proxies as I restart each server [11:09:36] ack! [11:12:09] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3267165 (10Marostegui) db1015: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb1015 testwikidatawiki -e "show create table wb_ter... [11:17:02] actually, I am going to have lunch [11:17:11] so I will stop the upgrade [11:17:17] enjoy your lunch [11:17:25] and leave db1055 backing up [11:27:09] some unpuppetized Replicate_Wild_Ignore_Table options may be lost on restart [11:28:03] no worries [11:29:41] 10DBA, 10Wikidata, 07Schema-change: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246#3267215 (10Marostegui) 05Open>03Resolved I believe everything is done now (remember, dbstore2001 will get the alter tomorrow as it is our delayed slave, if... [11:31:46] error rate has decreased dramatically [11:31:49] that is strange [11:32:01] however, this time I can see the server depooled [11:32:33] I think it is just the normal deploy slowdown [11:32:46] it is going back to normal levels [11:33:04] i don't see anything strange no [11:36:00] leaving db1055 backup temporarely on dbstore1001 [11:36:14] ok [11:36:37] hopefult icinga doesn't forget the downtimes [11:40:08] 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), and 3 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3267229 (10jcrespo) [11:41:31] 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), and 3 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3077417 (10jcrespo) Pending: running `ALTER TABLE ores_classification ENG... [11:42:01] 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), and 3 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3267233 (10jcrespo) CC @Marostegui [11:42:16] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 4 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3267235 (10jcrespo) [11:44:07] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, and 4 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3267239 (10Marostegui) Given that replication codfw -> eqiad is now reseted on all the shards, it might... [11:45:56] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, 15User-Ladsgroup: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3267241 (10Ladsgroup) @jcrespo This can be done too (maybe at the same with the T159753#3267229) [12:08:53] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3267305 (10Ottomata) ​+1, that sounds like a good idea to me! [12:23:21] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 2 others: Use replica for reading the last dispatch position (chd_seen) - https://phabricator.wikimedia.org/T162557#3267313 (10Ladsgroup) [12:23:25] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 2 others: Consider only updating wb_changes_dispatch after a successful run - https://phabricator.wikimedia.org/T162556#3267314 (10Ladsgroup) [12:23:56] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Consider only updating wb_changes_dispatch after a successful run - https://phabricator.wikimedia.org/T162556#3267334 (10Ladsgroup) a:03Ladsgroup [12:24:03] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Use replica for reading the last dispatch position (chd_seen) - https://phabricator.wikimedia.org/T162557#3267336 (10Ladsgroup) a:03Ladsgroup [13:03:14] I wonder if we should have a completely separate model for lagged slaves, something like "get delayed 48 hours, and once a day, roll forward 24 hours" [13:03:45] I rather not forward 24h [13:03:53] You never know when something can happen [13:04:00] And I am also fine by delaying them even more (48h) [13:04:10] I do not understand [13:04:28] you are not ok with the time, or with doing all at the same time? [13:04:36] Haha sorry [13:04:58] No, I am not ok with doing a roll forward [13:04:58] I am saying to be between 24 hours and 48 hours behind [13:05:04] Ah, right right [13:05:09] always behind [13:05:14] yeah, then yes [13:05:14] of course [13:05:23] Sorry i didn't understand correctly [13:05:33] but if something breaks, it breaks at a reasonable time [13:05:39] yes [13:05:45] think 1-2 hours in the morning or something [13:06:28] but we are not stopping-starting all the time, which is painful for alters, etc. [13:06:55] Yeah, it is a bit of a pain, but that is the price we pay to be "safe" :) [13:07:01] Which I am totally fine with [13:07:14] yes, here the point is to keep the deplay [13:07:21] but make it a bit better [13:07:27] *delay [13:07:37] e.g. no replication while backups are running [13:08:27] I do not know how much it would take for the server to catch up 24 hours [13:09:02] yeah, if we start all the threads at the same time, maybe it will take quite long [13:09:08] but yes, no idea [13:09:18] I think I did that [13:09:22] I will check [13:09:26] :) [13:10:30] but I do not have historical data for time to catch-up [13:11:11] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team, 15User-Ladsgroup: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3236866 (10Marostegui) Yes, we could do both at the same time. [13:15:12] it recovered 150000 seconds in half a day, so maybe around 4 hours for 24 hours [13:15:45] oh wow, that is not bad at all [13:44:51] 10DBA, 06Labs, 10wikitech.wikimedia.org, 07Schema-change: Drop Semantic Database tables from wikitech wikis - https://phabricator.wikimedia.org/T164887#3267595 (10Marostegui) I have backup'ed the tables by the way: ``` dbstore1001:/srv/tmp/silver_labswiki_T164887.tar.gz dbstore1001:/srv/tmp/labtestweb2001_... [14:14:18] 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3267675 (10Marostegui) [14:14:31] 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3267688 (10Marostegui) p:05Triage>03Normal [14:21:25] there is non-critical corruption on labsdb1009 [14:21:45] some tables still exist on the data dictionary, but not physically [14:22:25] not problematic, but annoying on start/upgrade [14:43:52] 10DBA, 06Operations, 10ops-codfw: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3267751 (10Marostegui) [14:49:49] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3267773 (10Marostegui) db2056 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# for i in `cat /home/marostegui/T162611`; do echo $i; mysql -hdb2056.codfw.wmnet --skip-ssl $i -e "show create table... [15:04:03] there is a spike of errors on bnwiki [15:04:33] not sure why [15:04:36] yes [15:04:40] I was checking it [15:04:50] To see if it was currently affected by any of the maintenance I am doing [15:04:53] but it is not [15:05:32] I do not see a pattern [15:05:44] bnwiki is s3? [15:06:07] it is [15:06:07] yes [15:07:03] it seems lag, but I would not see why it would only affect bnwiki [15:07:24] oh, a bot doing many queries [15:07:43] so not a general issue [15:07:50] where did you see that? [15:07:58] went to all logs [15:08:05] searched for wiki:ngwiki [15:08:08] *bn [15:09:14] you can also see a matching spike at: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=slave [15:10:17] Aaah I see 34% coming from the same ip [15:29:39] I am going to restart and upgrade db1095 (sanitarium2) [15:29:54] go ahead [15:31:53] I see it down on icinga, any reason? [15:32:23] did you run alter recently? [15:35:33] 10DBA, 06Operations, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3267987 (10greg) [15:38:25] 10DBA, 10Gerrit, 06Operations, 13Patch-For-Review, 06Release-Engineering-Team (Backlog): Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#3268002 (10greg) [15:43:07] 10DBA, 06Operations, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268061 (10jcre... [16:10:09] even if it doesn't look like it, I am testing the new prometheus exported on a few hosts (all that have been upgraded) [16:10:14] *exporter [16:10:42] 10DBA, 06Operations, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268180 (10Jdfo... [16:19:01] 10DBA, 06Operations, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268191 (10jcre... [18:35:28] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3168583 (10jcrespo) Can you double check tomorrow merged patches vs. deployed ones? I am not sure 100% sure about the status of the latest 2 ones. [18:36:23] 10DBA, 13Patch-For-Review: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611#3268911 (10jcrespo) (wrong ticket)