[03:26:29] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#603671 (10tstarling) Why can't you just sort by el_index? Then you could use el_index values for continuation. [05:58:44] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3216675 (10Marostegui) labsdb1001 is done: ``` [root@labsdb1001 05:57 /root] # mysql wikidatawiki -e "show create table... [05:59:38] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3216676 (10Marostegui) labsdb1001 is done: ``` [root@labsdb1001 05:57 /root] # mysql --skip-ssl wikidatawiki -e "... [06:02:59] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3216679 (10Marostegui) labsdb1011 is done: ``` root@labsdb1011:~# mysql --skip-ssl enwiki -e "show create table revision\G" *************************** 1. row *********... [06:05:41] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3216680 (10Marostegui) Pending hosts: dbstore1001 db1069 (sanitarium2) labsdb1001 labsdb1003 I think I will not do db1069, labsdb1001 and labsdb1003 as they will be... [06:13:57] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3216689 (10Marostegui) Ignore the last two posts from Stashbot, it was for another ticket number [06:14:13] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3216691 (10Marostegui) I am altering db1070 locally, instead of from neodymium as this host is going to be affected by:... [06:14:29] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3216694 (10Marostegui) I am altering db1070 locally, instead of from neodymium as this host is going to be affect... [06:38:26] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3216724 (10Marostegui) [06:38:29] 10DBA, 06Operations: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#3216721 (10Marostegui) 05Open>03Resolved a:03Marostegui This has been dropped from the random places where it existed (it had 0 rows everywhere): s2: bgwiktionary enwikiquote enwiktionary s3... [06:54:55] we need to put cebwiki outside of s3 [06:55:05] too big already? [06:55:12] maybe we can move it to s5 once we have moved wikidata away [06:55:15] It took 26,745s to alter templatelinks [06:55:17] yes [07:07:39] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3216759 (10jcrespo) During the night: ```lines=10 Altering cebwiki... ERROR 1091 (42000) at line 13: Can't DROP 'old_id'; check that column/key exists Altering cewiki.... [07:13:25] backups look ok, last backup (zhwiki) still ongoing [07:33:26] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3216796 (10Marostegui) So I have been talking to @ayounsi and the servers in row D still need to be recabled, so they will be affected by the small outage. As per his comment on: T148506#3215394 ``... [07:35:20] jynus ^ [07:38:58] no issue if we use replication [07:39:17] sure, just making sure you were aware, as there are so many stuff going on, it is easy to miss stuff :) [07:41:24] I love that 50% of the s3 hosts have one structure, 25% other and 25% other [08:11:29] Hey, I'm about to write my first schema change that will change the default values for site_stats table to be more consistent (ref T56888). I'm not sure about the "everything needs to be optional for some time" part: My guess is that it does no harm having the old defaults on some databases and the new one on others which would means there wouldn't be any additional necessary step to this. Would be glad to hear if I'm right or wrong about [08:11:29] that assumption. [08:11:30] T56888: Fresh install of MediaWiki lists "-1 recent contributors" in Special:UserLogin/signup - https://phabricator.wikimedia.org/T56888 [08:14:50] I do not undestand "having the old defaults on some databases and the new one on others which would means there wouldn't be any additional necessary step" [08:18:37] Do as told on T56888#3216523 [08:18:38] T56888: Fresh install of MediaWiki lists "-1 recent contributors" in Special:UserLogin/signup - https://phabricator.wikimedia.org/T56888 [08:21:41] jynus: Writing the schema change will need me to edit tables.sql for new wikis and the updater for existing databases. Question basically is if any additional step is necessary to fulfill "Make your schema change optional – All schema changes must go through a period of being optional." from [[mw:Development_policy#Database_patches]] [08:22:03] changes optional means [08:22:17] that if you do not run update.php, wikis should continue working [08:22:42] normally that is acieved with a configuration switch [08:22:55] but it depends on the case [08:23:06] Which they will regardless of what the default value in the schema is, right? [08:23:15] I do not know that [08:23:35] apparently there is some bug inserting -1, look at that [08:24:05] note we are wikimedia-databases [08:24:14] we do not use the updater at WMF [08:24:20] so we do not know much about it [08:24:27] but all other wikis out there use it [08:43:34] The "some bug inserting -1" _was_ the web installer inserting a row without specifying something for field ss_active_users, which lead to the schema default (-1) being used. That is fixed now (by fixing the web installer to specify a value for that field). [08:43:41] Now, while fixing this, we came up with the default values in the schema in site_stats being inconsistent. ss_total_edits, ss_good_articles, ss_images use 0. ss_total_pages, ss_users, ss_active_users use -1. We're noẃ following up to align them to all be -1. [08:43:46] Now, when that schema change gets applied, there will be a period at which "some databases" on WMF cluster will still have the "old default" (e.g. ss_images default 0) and "some databases" will have the "new defaults" (e.g. ss_images default -1) depending if they have been altered already or not. My _guess_ is that this will do no harm, but I'm not sure if I'm overlooking anything. [08:46:10] looks ok [08:48:27] Okay, thanks. Sorry if that was confusing, just asking in advance to be sure to do the right thing. :) [08:49:44] 10DBA: Convert unique keys into primary keys for some wiki tables on s6 - https://phabricator.wikimedia.org/T163979#3216957 (10jcrespo) [08:53:50] 10DBA, 10MediaWiki-General-or-Unknown: Timeout in WikiPage::insertRedirectEntry after move - https://phabricator.wikimedia.org/T163597#3216972 (10Marostegui) 05Open>03declined Feel free to reopen if it happens more often. At least we have now this for tracking. Thanks! [08:55:36] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#3216982 (10Marostegui) [08:55:38] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Create a cronjob/check to run check_private_data data script and report back - https://phabricator.wikimedia.org/T153680#3216979 (10Marostegui) 05Open>03Resolved a:03Marostegui Going to close this for now as the script is working fine as a fir... [09:18:58] ok to drop /dev/sda from labsdb1003 ? [09:19:38] yes :) [09:19:58] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1003&var-network=eth0&from=1493198375174&to=1493284775174 [09:20:01] and it will get worse [09:20:14] Yeah, that is my alter runnig + your alters later [09:20:57] mine has not yet arrived as labs is behind 3 levels of replicas [09:21:59] then it is only me! [09:22:06] it took 11 hours on labsdb1001 [09:22:25] so another 8 to go or so for it [09:42:23] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3217063 (10jcrespo) 05Open>03Resolved We added 1 extra terabyte by deleting /srvuserdata on both hosts- this will likely impact performance negatively, but at leasy they can now rece... [09:50:57] 10DBA, 13Patch-For-Review, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3217083 (10jcrespo) I intend to mass-deploy this once 1.29.0-wmf.21 is everywhere. [10:21:36] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#3217118 (10waldyrious) Thanks everyone, much appreciated! [10:47:43] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3217141 (10jcrespo) [10:48:36] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3217143 (10jcrespo) [11:27:04] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3217183 (10jcrespo) It "finished" now: ```lines=10 Altering ocwikibooks... Altering ocwiktionary... Altering officewiki... ERROR 1091 (42000) at line 13: Can't DROP '... [12:29:56] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3217276 (10Marostegui) I have downtimed for 20 hours the above hosts plus the slaves of those masters involved as replication broken will page: ``` db1095 db1053 db1056 db1059 db1064 db1081 db1084 db... [12:39:19] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3217291 (10Marostegui) That is fine by us, but then we probably want to go ahead and fix this: T159266 [12:56:57] db1040 copy finished [12:57:16] but I am running hashes to veryfy it was transmitted correctly [13:27:58] oh great [13:28:02] how long did it take in the end? [13:31:06] 24 hours [13:31:14] I tested a new compression method [13:31:21] it saved 40GB [13:31:24] oh, what did you do? [13:31:27] but it took 10 times more [13:31:31] so not worth it [13:31:38] :| [13:31:57] yeah it was also sending stuff from non ssd, yeah :( [13:32:08] cross dc [13:32:19] an encryption [13:32:21] not fast [13:32:28] but I made it even slower [13:32:57] I want to be 100% sure it is correct because being the s4 master [13:33:02] I am sure we will need it [13:33:05] (old master) [13:33:12] hopefully not! [13:33:15] I mean [13:33:20] to repair other slaves [13:33:21] not as is [13:33:25] ah yeah [13:33:54] marostegui I think I know where your errors come from [13:33:55] db1093 is broken because of the frwwiki alter [13:34:00] on it [13:34:04] which errors? [13:34:10] the non-existent dbs [13:34:18] ah, iluminame [13:34:33] I will illuminate you [13:34:37] :-) [13:34:45] :) [13:35:00] what are you using for going over all wikis list? [13:35:32] are you doing SHOW DATABASES? [13:35:41] yep [13:35:48] I think that is the issue [13:35:57] I think I will not have so many problems [13:35:59] I trust more what is in the master than the dblist [13:36:05] because I go over s2.dblist [13:36:06] maybe I am doing it wrong indeed [13:36:09] right! [13:36:11] skipping closed dbs [13:36:15] I will do that in the future yep [13:36:15] not wrong [13:36:23] there is a debate there [13:36:37] but I think my method is less annoying [13:36:56] yes, totally [13:37:03] i think i will start using it [13:37:05] we can add filters anyway [13:37:22] or reimport those wikis [13:37:28] so it doesn't happen again [13:37:28] I am basically going thru all the errors to make sure it is the "good" error [13:37:37] that is why I prefer to skip it manually [13:38:26] we should have consistency, as you wisely say [13:38:37] either it is everywhere or noware [13:38:58] another thing I do [13:39:03] is to use the database [13:39:12] yeah, when we have time (HA!) we should check that whatever is on the dblist is or not on the master [13:39:14] so that the filters discard the alters [13:39:30] like the ones I do on s3 [13:39:33] at least on labs [13:39:49] ah that is a good idea [13:39:52] so use X, ALTER Y; instead of ALTER X.Y [13:40:02] which I think works on the new ones [13:40:02] that is a pretty good tip! [13:40:10] but I am not sure on the old ones [13:40:47] I couldn't care less how you do it, eh [13:40:57] no no I know [13:41:03] but it is a good tip to get rid of these errors [13:41:05] and save time [13:41:11] but I feel pain with all the extra work you do [13:41:44] I also break stuff (see frwiki) [13:42:04] well, i left that there [13:42:09] so i technically broke it too :) [13:42:14] ha ha [13:42:19] I technically asked you to do it [13:42:22] hahaha [13:42:46] remember that everthing with thouse unrequested PKs was my idea [13:42:54] so everthign that broke [13:43:00] and will break, totally my decision [13:43:09] but in a few months we will say: hey, it was so great to get all the PKs added eh [13:43:18] well [13:43:25] let's wait for the failback [13:43:29] xdddddd [13:43:33] and you may think differently [13:43:51] * marostegui scared now [13:44:22] https://www.youtube.com/watch?v=EWAXitUcYHc [13:44:39] hahahahahahah [13:44:39] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3217482 (10Marostegui) We do not clone stuff from labs servers [13:44:41] haha [13:44:53] I can imagine you doing that on hangouts [13:49:46] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3217496 (10Marostegui) db1069 is done: ``` root@db1069:~# mysql -S /tmp/mysql.s5.sock --skip-ssl wikidatawiki -e "show c... [13:50:32] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3217514 (10Marostegui) db1069 is done: ``` root@db1069:~# mysql -S /tmp/mysql.s5.sock --skip-ssl wikidatawiki -e... [14:04:20] I think we took the right decision- text is very large- I think changes on large wikis will take 2 days, not 1 [14:05:58] yeah, at least the most difficult hosts are done [14:06:07] all the alters for tag summary and change tag finished on dbstore1001 [14:07:06] dumps too, I think [14:08:10] it didn't ? [14:09:22] i think they did [14:09:35] nothing on processlist [14:09:36] -rw-r----- 1 root root 7.7G Apr 20 03:59 zhwiki-201704190205.sql.gz [14:09:41] -rw-r----- 1 root root 4.6G Apr 27 10:48 zhwiki-201704260205.sql.gz [14:09:53] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3217601 (10Marostegui) I have altered silver and labstestweb2001 ``` mysql:root@localhost [labswiki]> select @@hostname; +------------+ | @@hostname | +------------+ |... [14:10:13] which normally it means backups were not transferred [14:10:43] let's see what bacula says [14:15:25] I am not able to see the jobs [14:15:35] bacula stucks on: Connecting to Client dbstore1001.eqiad.wmnet-fd at dbstore1001.eqiad.wmnet:9102 [14:17:13] do not worry, we have 647 GB of labtestweb backups [14:17:27] and 5 TB of OTRS [14:17:44] 10DBA, 13Patch-For-Review: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300#3217615 (10Marostegui) db1041 is now clean of partitions: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -e "show create table metawiki.pagelinks\G" -hdb1041 ******... [14:17:56] 647G??? [14:18:36] 10DBA, 07Epic, 13Patch-For-Review, 05codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#3217618 (10Marostegui) [14:18:39] 10DBA, 13Patch-For-Review: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300#3217616 (10Marostegui) 05Open>03Resolved a:03Marostegui [14:18:43] are you able to connect to the bacula client? [14:18:53] because it keeps saying connecting to me [14:18:56] but I can connect thru telnet [14:19:11] https://phabricator.wikimedia.org/P5341 [14:19:12] so if helium cannot connect, I guess the backups are not being taken? [14:19:21] oh so you can [14:19:26] and no backups [14:19:27] :( [14:19:31] for this week [14:20:58] last backup is 201704190205 [14:21:04] oh [14:21:09] look at this [14:21:12] logs? [14:21:21] 27-Apr 10:48 dbstore1001.eqiad.wmnet-fd JobId 52623: ClientRunBeforeJob: mysqldump: Error 1412: Table definition has changed, please retry transaction when dumping table `tag_summary` at row: 0 [14:21:26] could that send a kill to the job? [14:21:57] ok, that could be it [14:22:13] so totally our fault [14:22:17] yes [14:22:21] I will make a manual zhwiki [14:22:24] backup [14:22:51] and I think as a quick fix we should disable dumps on backup [14:23:05] and just backup /srv/backups [14:23:22] and put that in a cron we can easily kill and stop [14:23:32] is that the predump? [14:23:35] you lost me a bit [14:23:38] yes, sorry [14:23:38] sorry [14:24:16] and later add detailed monitoring and better format [14:24:35] plus doing that on both dbstores [14:24:51] even if we copy only from one [14:24:57] ah right, I get you, do a cronjob to mysqldump our stuff and then tell bacula, just take care of backuping this directory [14:25:00] is that? [14:25:05] yes [14:25:09] right right [14:25:10] yeah [14:25:14] the predumps works ok [14:25:26] for a small db [14:25:36] but it doesn't scale anymore for us [14:25:47] plus we are blocking other backups andrecoveries [14:26:22] yes, it is too much already [14:26:44] meta ticket: https://phabricator.wikimedia.org/T138562 [14:26:55] specific ticket: https://phabricator.wikimedia.org/T162789 [14:28:28] thanks [14:28:29] yeah [14:28:41] we better give some care to the backups [14:38:36] I am doing jawiki now [14:39:01] ok [14:39:12] I am triple checking all the watchlist, tag_siummary and change_tag changes across the shards [14:39:26] jawiki is normally quite small [14:43:54] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3217714 (10Papaul) p:05High>03Normal [14:46:10] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217729 (10Anomie) We can't use only by el_index since it's not unique. While MariaDB will probably return the rows in the same or... [14:51:53] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217763 (10Anomie) >>! In T59176#3217729, @Anomie wrote: > We can't use only by el_index since it's not unique. While MariaDB will... [14:53:30] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217786 (10jcrespo) > MariaDB will probably return the rows in the same order every time It definitely does not happen- this has... [15:40:30] remember to run puppet after the upgrade [15:41:11] yeah :) [15:49:17] let me tell you another thing that doesn't work well- stop slave on events + alter table3s [15:49:33] I am going to disable events on dbstore1001 [15:49:38] to avoid blockage [15:49:41] ok [15:49:42] i hate events [15:49:55] it is not events in this case [15:50:04] just stop/show/slave contention [15:50:14] a script would have done the same [15:50:28] and I am not breaking anything [15:50:43] dbstore1001 has not been delayed for a long time now [15:51:15] see: https://tendril.wikimedia.org/host/view/dbstore1001.eqiad.wmnet/3306 [15:52:06] checking dbstore2001 [15:52:36] dbstore1001 is finally fixed, BTW [15:53:14] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218132 (10Marostegui) This has happened again: `˜/icinga-wm 17:47> PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.00 seconds`... [15:53:32] thanks for fixing dbstore, it is always a pain [15:54:36] 2001 seems ok [15:56:01] I will do now a general check of s3 schema changes [15:56:17] and then start those on s1, s2, s4, s5 and s7 [15:56:22] (masters only) [15:56:29] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218173 (10Marostegui) And it recovered: ``` root@db1048:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 4058 mV Current: 152 mA Temperature: 33 C... [15:57:31] md5sum keeps running [15:57:43] it will take a while yeah [15:57:49] go for the schema changes! \o/ [15:59:54] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3218190 (10Marostegui) db1070 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb10... [16:00:40] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3218197 (10Marostegui) db1070 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl... [16:01:28] so things that are fully down now: db1040 (pending to check the backup), db1022 (bogus data), es2019 (possible hw repair) [16:02:08] yes, that matches my list too [16:03:35] labs issue with row-based replication [16:03:49] Column 5 of table 'wikidatawiki.wb_terms' cannot be converted from type 'varchar(255)' [16:04:03] to type 'varbinary(32)' [16:04:04] which labs is that one?? [16:04:13] 1 and 3, I think [16:04:22] mmmmmm [16:05:12] probably we have to do the thing we did for the other table [16:05:21] ? [16:05:33] let me find the issue [16:07:26] https://phabricator.wikimedia.org/T73563#3117611 [16:08:01] although it is already set up with that mode on 1001 [16:08:20] are you sure you are not missing a schema change there? [16:09:38] not sure what you mean [16:10:11] https://gerrit.wikimedia.org/r/#/c/341322/4/repo/sql/AddTermsFullEntityId.sql [16:10:16] alter table wb_terms ADD COLUMN term_full_entity_id VARCHAR(32) DEFAULT NULL AFTER term_entity_id, add key term_full_entity (term_full_entity_id), add key term_search_full (term_language, term_full_entity_id, term_type, term_search_key(16)), drop key wb_terms_entity_type, drop key wb_terms_type; [16:10:19] there is a column missing [16:10:59] `term_full_entity_id` varbinary(32) DEFAULT NULL, [16:12:19] so it is the same column I added but varbinary [16:12:35] no, it is missing [16:12:53] I am looking at it right now [16:12:59] on which host? [16:13:10] labsdb1003, probably 1001 too [16:13:19] it is on db1069 [16:13:35] `term_full_entity_id` varbinary(32) DEFAULT NULL, -> that column is on 1001 [16:13:50] well, it is not on 1003 [16:15:16] no,because it is being added now [16:15:19] it is not finished yet [16:15:22] ok [16:15:25] but [16:15:34] it is failing on 1001 [16:15:35] the replication error is different from that [16:15:45] (and it is the same on 1003) [16:15:52] I am trying to see why [16:15:58] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Add wl_timestamp to the watchlist table - https://phabricator.wikimedia.org/T125991#3218243 (10Addshore) So as I see this the semantics described in the patch still hold true. Having the date that a watched item was last ad... [16:15:59] Column 5 of table 'wikidatawiki.wb_terms' cannot be converted from type 'varchar(255)' to type 'varbinary(32)' [16:16:04] which is `term_language` varbinary(32) NOT NULL, [16:16:16] and we haven't touched that [16:16:40] no, that means it is trying to insert on the wrong column [16:17:06] https://phabricator.wikimedia.org/T163551 [16:17:26] maybe someone doing stuff there? [16:17:53] no, that is data [16:19:10] I have the query [16:20:33] yes, the insert is missing a field [16:20:47] any idea where is that query coming from? [16:20:53] i am glad it didn't happen in core [16:20:54] its master [16:21:31] could it be a decoodination? [16:21:40] between the alter order and the master change? [16:22:12] don't know, it is very strange [16:22:42] the master is changed, and goes to a master that hasn't applied the schema change yet [16:22:58] date 170427 5:12:16 [16:23:25] that is today…labsdb1001 was finished yesgterday evening late in the evening [16:23:38] we are missing a column on the updates [16:23:53] and db1069…was not ready yesterday [16:23:55] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Add wl_timestamp to the watchlist table - https://phabricator.wikimedia.org/T125991#3218275 (10daniel) @Addshore Fine with me, but I want to make sure that this is really the semantics that is wanted/needed. The description... [16:24:10] 69 doesn't need to be ready [16:25:21] can you check sanitarium2 around 2017-04-27T05:12:16.000890 [16:25:34] yes [16:25:35] and see if it has the new column on its binlog [16:25:49] if it does, we can change labsdb master to it [16:25:54] (asuming it has s5) [16:26:21] let me recheck the query again [16:26:33] no, sanitarium2 doesn't have s5 imported [16:26:45] it was a pending work for after the failover [16:26:56] it is ok [16:27:02] so possibilities here [16:27:22] we do the inverse schema change labsdb1001 and 3 [16:27:38] until it starts failing again [16:28:07] or we clone a new server from that date and set it in row [16:28:17] i would go for option 1 [16:28:23] it is faster and it is thursday [16:28:27] not sure it will be ready by tomorrow [16:28:47] and we need to log that query and try to get it fixed :( [16:28:52] how much time until 1003 is finished? [16:29:04] probably around 3-4 hours [16:29:12] but it has already failed on replication [16:29:31] so the thing is [16:29:40] by making the change asyncronous [16:30:01] binlog changes before it is ready, I think [16:30:03] or after [16:30:09] dpending how you see it [16:30:32] we should do all schema change blocking and on the sanitarium [16:30:38] "master" [16:30:40] and let it replicate yes [16:30:53] I am thinking of a faster option [16:30:58] so we should revert sanitarium and the slaves [16:31:17] sanitarium hasn't failed [16:31:49] the problem is we can do things on 10.1 with slave-only triggers [16:31:52] but what would happen if we remove the column and sanitarium does have the column? [16:31:53] but not on the current ones [16:32:14] let's kill 1003 alter [16:32:22] and restart replication [16:32:25] are you sure? [16:32:35] well, it is the same thing as reverting the column, no? [16:33:02] but I am not 100% that will work, too [16:33:21] think this [16:33:38] we are getting updates without the column [16:33:47] but they failed [16:34:08] and 1003 doesn't have yet the column [16:34:22] but it still fails [16:34:34] yeah, but I am not sure in which state the table is in the middle of the alter [16:35:16] how much time for a reimport? [16:35:19] maybe it is reading the table metadata and it sees the new column there even if it is still not there? [16:35:26] would that be faster? [16:35:32] let me check the size [16:35:48] root@db1069:/srv/sqldata.s5/wikidatawiki# ls -lh wb_terms.ibd [16:35:48] -rw-rw---- 1 mysql mysql 231G Apr 27 16:35 wb_terms.ibd [16:36:15] I think db1069 is to blame [16:36:19] here [16:36:43] the thing is that labsdb1001 got the alter before db1069 [16:36:46] you did it slave -> master [16:36:47] yes [16:36:48] so it should have not failed [16:36:56] which is the right way in statement [16:36:58] but in row [16:37:07] it has be be syncronous [16:37:15] think this [16:37:34] the column is there, but the inserts expect not have them [16:37:38] because unlike statement [16:37:48] they do not know about "default columns" [16:37:57] there is another possibility [16:38:12] which is ignoring the table, and reimporting [16:38:19] it may take more [16:38:23] but it would not block replication [16:38:32] and it is safe on row replication [16:38:56] reimporting from db1069 no? [16:38:56] better replication running and dropping the table (sorry, currently unavailable) [16:38:58] yeah [16:39:00] yea [16:39:06] or db1095 [16:39:13] no, db1095 doesn't have it [16:39:15] or production, I think it is 100% public [16:39:18] true [16:39:20] :-) [16:39:25] so we waitt [16:39:35] until we get the right schema [16:39:36] maybe we can filter wb_terms (Replication filters) so at least old data is there [16:39:42] while we do the dump from db1069 [16:39:43] yes [16:39:51]