[03:26:29] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#603671 (10tstarling) Why can't you just sort by el_index? Then you could use el_index values for continuation. [05:58:44] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3216675 (10Marostegui) labsdb1001 is done: ``` [root@labsdb1001 05:57 /root] # mysql wikidatawiki -e "show create table... [05:59:38] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3216676 (10Marostegui) labsdb1001 is done: ``` [root@labsdb1001 05:57 /root] # mysql --skip-ssl wikidatawiki -e "... [06:02:59] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3216679 (10Marostegui) labsdb1011 is done: ``` root@labsdb1011:~# mysql --skip-ssl enwiki -e "show create table revision\G" *************************** 1. row *********... [06:05:41] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3216680 (10Marostegui) Pending hosts: dbstore1001 db1069 (sanitarium2) labsdb1001 labsdb1003 I think I will not do db1069, labsdb1001 and labsdb1003 as they will be... [06:13:57] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3216689 (10Marostegui) Ignore the last two posts from Stashbot, it was for another ticket number [06:14:13] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3216691 (10Marostegui) I am altering db1070 locally, instead of from neodymium as this host is going to be affected by:... [06:14:29] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3216694 (10Marostegui) I am altering db1070 locally, instead of from neodymium as this host is going to be affect... [06:38:26] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3216724 (10Marostegui) [06:38:29] 10DBA, 06Operations: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#3216721 (10Marostegui) 05Open>03Resolved a:03Marostegui This has been dropped from the random places where it existed (it had 0 rows everywhere): s2: bgwiktionary enwikiquote enwiktionary s3... [06:54:55] we need to put cebwiki outside of s3 [06:55:05] too big already? [06:55:12] maybe we can move it to s5 once we have moved wikidata away [06:55:15] It took 26,745s to alter templatelinks [06:55:17] yes [07:07:39] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3216759 (10jcrespo) During the night: ```lines=10 Altering cebwiki... ERROR 1091 (42000) at line 13: Can't DROP 'old_id'; check that column/key exists Altering cewiki.... [07:13:25] backups look ok, last backup (zhwiki) still ongoing [07:33:26] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3216796 (10Marostegui) So I have been talking to @ayounsi and the servers in row D still need to be recabled, so they will be affected by the small outage. As per his comment on: T148506#3215394 ``... [07:35:20] jynus ^ [07:38:58] no issue if we use replication [07:39:17] sure, just making sure you were aware, as there are so many stuff going on, it is easy to miss stuff :) [07:41:24] I love that 50% of the s3 hosts have one structure, 25% other and 25% other [08:11:29] Hey, I'm about to write my first schema change that will change the default values for site_stats table to be more consistent (ref T56888). I'm not sure about the "everything needs to be optional for some time" part: My guess is that it does no harm having the old defaults on some databases and the new one on others which would means there wouldn't be any additional necessary step to this. Would be glad to hear if I'm right or wrong about [08:11:29] that assumption. [08:11:30] T56888: Fresh install of MediaWiki lists "-1 recent contributors" in Special:UserLogin/signup - https://phabricator.wikimedia.org/T56888 [08:14:50] I do not undestand "having the old defaults on some databases and the new one on others which would means there wouldn't be any additional necessary step" [08:18:37] Do as told on T56888#3216523 [08:18:38] T56888: Fresh install of MediaWiki lists "-1 recent contributors" in Special:UserLogin/signup - https://phabricator.wikimedia.org/T56888 [08:21:41] jynus: Writing the schema change will need me to edit tables.sql for new wikis and the updater for existing databases. Question basically is if any additional step is necessary to fulfill "Make your schema change optional – All schema changes must go through a period of being optional." from [[mw:Development_policy#Database_patches]] [08:22:03] changes optional means [08:22:17] that if you do not run update.php, wikis should continue working [08:22:42] normally that is acieved with a configuration switch [08:22:55] but it depends on the case [08:23:06] Which they will regardless of what the default value in the schema is, right? [08:23:15] I do not know that [08:23:35] apparently there is some bug inserting -1, look at that [08:24:05] note we are wikimedia-databases [08:24:14] we do not use the updater at WMF [08:24:20] so we do not know much about it [08:24:27] but all other wikis out there use it [08:43:34] The "some bug inserting -1" _was_ the web installer inserting a row without specifying something for field ss_active_users, which lead to the schema default (-1) being used. That is fixed now (by fixing the web installer to specify a value for that field). [08:43:41] Now, while fixing this, we came up with the default values in the schema in site_stats being inconsistent. ss_total_edits, ss_good_articles, ss_images use 0. ss_total_pages, ss_users, ss_active_users use -1. We're noẃ following up to align them to all be -1. [08:43:46] Now, when that schema change gets applied, there will be a period at which "some databases" on WMF cluster will still have the "old default" (e.g. ss_images default 0) and "some databases" will have the "new defaults" (e.g. ss_images default -1) depending if they have been altered already or not. My _guess_ is that this will do no harm, but I'm not sure if I'm overlooking anything. [08:46:10] looks ok [08:48:27] Okay, thanks. Sorry if that was confusing, just asking in advance to be sure to do the right thing. :) [08:49:44] 10DBA: Convert unique keys into primary keys for some wiki tables on s6 - https://phabricator.wikimedia.org/T163979#3216957 (10jcrespo) [08:53:50] 10DBA, 10MediaWiki-General-or-Unknown: Timeout in WikiPage::insertRedirectEntry after move - https://phabricator.wikimedia.org/T163597#3216972 (10Marostegui) 05Open>03declined Feel free to reopen if it happens more often. At least we have now this for tracking. Thanks! [08:55:36] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#3216982 (10Marostegui) [08:55:38] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Create a cronjob/check to run check_private_data data script and report back - https://phabricator.wikimedia.org/T153680#3216979 (10Marostegui) 05Open>03Resolved a:03Marostegui Going to close this for now as the script is working fine as a fir... [09:18:58] ok to drop /dev/sda from labsdb1003 ? [09:19:38] yes :) [09:19:58] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1003&var-network=eth0&from=1493198375174&to=1493284775174 [09:20:01] and it will get worse [09:20:14] Yeah, that is my alter runnig + your alters later [09:20:57] mine has not yet arrived as labs is behind 3 levels of replicas [09:21:59] then it is only me! [09:22:06] it took 11 hours on labsdb1001 [09:22:25] so another 8 to go or so for it [09:42:23] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3217063 (10jcrespo) 05Open>03Resolved We added 1 extra terabyte by deleting /srvuserdata on both hosts- this will likely impact performance negatively, but at leasy they can now rece... [09:50:57] 10DBA, 13Patch-For-Review, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3217083 (10jcrespo) I intend to mass-deploy this once 1.29.0-wmf.21 is everywhere. [10:21:36] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#3217118 (10waldyrious) Thanks everyone, much appreciated! [10:47:43] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3217141 (10jcrespo) [10:48:36] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3217143 (10jcrespo) [11:27:04] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3217183 (10jcrespo) It "finished" now: ```lines=10 Altering ocwikibooks... Altering ocwiktionary... Altering officewiki... ERROR 1091 (42000) at line 13: Can't DROP '... [12:29:56] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3217276 (10Marostegui) I have downtimed for 20 hours the above hosts plus the slaves of those masters involved as replication broken will page: ``` db1095 db1053 db1056 db1059 db1064 db1081 db1084 db... [12:39:19] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3217291 (10Marostegui) That is fine by us, but then we probably want to go ahead and fix this: T159266 [12:56:57] db1040 copy finished [12:57:16] but I am running hashes to veryfy it was transmitted correctly [13:27:58] oh great [13:28:02] how long did it take in the end? [13:31:06] 24 hours [13:31:14] I tested a new compression method [13:31:21] it saved 40GB [13:31:24] oh, what did you do? [13:31:27] but it took 10 times more [13:31:31] so not worth it [13:31:38] :| [13:31:57] yeah it was also sending stuff from non ssd, yeah :( [13:32:08] cross dc [13:32:19] an encryption [13:32:21] not fast [13:32:28] but I made it even slower [13:32:57] I want to be 100% sure it is correct because being the s4 master [13:33:02] I am sure we will need it [13:33:05] (old master) [13:33:12] hopefully not! [13:33:15] I mean [13:33:20] to repair other slaves [13:33:21] not as is [13:33:25] ah yeah [13:33:54] marostegui I think I know where your errors come from [13:33:55] db1093 is broken because of the frwwiki alter [13:34:00] on it [13:34:04] which errors? [13:34:10] the non-existent dbs [13:34:18] ah, iluminame [13:34:33] I will illuminate you [13:34:37] :-) [13:34:45] :) [13:35:00] what are you using for going over all wikis list? [13:35:32] are you doing SHOW DATABASES? [13:35:41] yep [13:35:48] I think that is the issue [13:35:57] I think I will not have so many problems [13:35:59] I trust more what is in the master than the dblist [13:36:05] because I go over s2.dblist [13:36:06] maybe I am doing it wrong indeed [13:36:09] right! [13:36:11] skipping closed dbs [13:36:15] I will do that in the future yep [13:36:15] not wrong [13:36:23] there is a debate there [13:36:37] but I think my method is less annoying [13:36:56] yes, totally [13:37:03] i think i will start using it [13:37:05] we can add filters anyway [13:37:22] or reimport those wikis [13:37:28] so it doesn't happen again [13:37:28] I am basically going thru all the errors to make sure it is the "good" error [13:37:37] that is why I prefer to skip it manually [13:38:26] we should have consistency, as you wisely say [13:38:37] either it is everywhere or noware [13:38:58] another thing I do [13:39:03] is to use the database [13:39:12] yeah, when we have time (HA!) we should check that whatever is on the dblist is or not on the master [13:39:14] so that the filters discard the alters [13:39:30] like the ones I do on s3 [13:39:33] at least on labs [13:39:49] ah that is a good idea [13:39:52] so use X, ALTER Y; instead of ALTER X.Y [13:40:02] which I think works on the new ones [13:40:02] that is a pretty good tip! [13:40:10] but I am not sure on the old ones [13:40:47] I couldn't care less how you do it, eh [13:40:57] no no I know [13:41:03] but it is a good tip to get rid of these errors [13:41:05] and save time [13:41:11] but I feel pain with all the extra work you do [13:41:44] I also break stuff (see frwiki) [13:42:04] well, i left that there [13:42:09] so i technically broke it too :) [13:42:14] ha ha [13:42:19] I technically asked you to do it [13:42:22] hahaha [13:42:46] remember that everthing with thouse unrequested PKs was my idea [13:42:54] so everthign that broke [13:43:00] and will break, totally my decision [13:43:09] but in a few months we will say: hey, it was so great to get all the PKs added eh [13:43:18] well [13:43:25] let's wait for the failback [13:43:29] xdddddd [13:43:33] and you may think differently [13:43:51] * marostegui scared now [13:44:22] https://www.youtube.com/watch?v=EWAXitUcYHc [13:44:39] hahahahahahah [13:44:39] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3217482 (10Marostegui) We do not clone stuff from labs servers [13:44:41] haha [13:44:53] I can imagine you doing that on hangouts [13:49:46] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3217496 (10Marostegui) db1069 is done: ``` root@db1069:~# mysql -S /tmp/mysql.s5.sock --skip-ssl wikidatawiki -e "show c... [13:50:32] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3217514 (10Marostegui) db1069 is done: ``` root@db1069:~# mysql -S /tmp/mysql.s5.sock --skip-ssl wikidatawiki -e... [14:04:20] I think we took the right decision- text is very large- I think changes on large wikis will take 2 days, not 1 [14:05:58] yeah, at least the most difficult hosts are done [14:06:07] all the alters for tag summary and change tag finished on dbstore1001 [14:07:06] dumps too, I think [14:08:10] it didn't ? [14:09:22] i think they did [14:09:35] nothing on processlist [14:09:36] -rw-r----- 1 root root 7.7G Apr 20 03:59 zhwiki-201704190205.sql.gz [14:09:41] -rw-r----- 1 root root 4.6G Apr 27 10:48 zhwiki-201704260205.sql.gz [14:09:53] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3217601 (10Marostegui) I have altered silver and labstestweb2001 ``` mysql:root@localhost [labswiki]> select @@hostname; +------------+ | @@hostname | +------------+ |... [14:10:13] which normally it means backups were not transferred [14:10:43] let's see what bacula says [14:15:25] I am not able to see the jobs [14:15:35] bacula stucks on: Connecting to Client dbstore1001.eqiad.wmnet-fd at dbstore1001.eqiad.wmnet:9102 [14:17:13] do not worry, we have 647 GB of labtestweb backups [14:17:27] and 5 TB of OTRS [14:17:44] 10DBA, 13Patch-For-Review: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300#3217615 (10Marostegui) db1041 is now clean of partitions: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -e "show create table metawiki.pagelinks\G" -hdb1041 ******... [14:17:56] 647G??? [14:18:36] 10DBA, 07Epic, 13Patch-For-Review, 05codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#3217618 (10Marostegui) [14:18:39] 10DBA, 13Patch-For-Review: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300#3217616 (10Marostegui) 05Open>03Resolved a:03Marostegui [14:18:43] are you able to connect to the bacula client? [14:18:53] because it keeps saying connecting to me [14:18:56] but I can connect thru telnet [14:19:11] https://phabricator.wikimedia.org/P5341 [14:19:12] so if helium cannot connect, I guess the backups are not being taken? [14:19:21] oh so you can [14:19:26] and no backups [14:19:27] :( [14:19:31] for this week [14:20:58] last backup is 201704190205 [14:21:04] oh [14:21:09] look at this [14:21:12] logs? [14:21:21] 27-Apr 10:48 dbstore1001.eqiad.wmnet-fd JobId 52623: ClientRunBeforeJob: mysqldump: Error 1412: Table definition has changed, please retry transaction when dumping table `tag_summary` at row: 0 [14:21:26] could that send a kill to the job? [14:21:57] ok, that could be it [14:22:13] so totally our fault [14:22:17] yes [14:22:21] I will make a manual zhwiki [14:22:24] backup [14:22:51] and I think as a quick fix we should disable dumps on backup [14:23:05] and just backup /srv/backups [14:23:22] and put that in a cron we can easily kill and stop [14:23:32] is that the predump? [14:23:35] you lost me a bit [14:23:38] yes, sorry [14:23:38] sorry [14:24:16] and later add detailed monitoring and better format [14:24:35] plus doing that on both dbstores [14:24:51] even if we copy only from one [14:24:57] ah right, I get you, do a cronjob to mysqldump our stuff and then tell bacula, just take care of backuping this directory [14:25:00] is that? [14:25:05] yes [14:25:09] right right [14:25:10] yeah [14:25:14] the predumps works ok [14:25:26] for a small db [14:25:36] but it doesn't scale anymore for us [14:25:47] plus we are blocking other backups andrecoveries [14:26:22] yes, it is too much already [14:26:44] meta ticket: https://phabricator.wikimedia.org/T138562 [14:26:55] specific ticket: https://phabricator.wikimedia.org/T162789 [14:28:28] thanks [14:28:29] yeah [14:28:41] we better give some care to the backups [14:38:36] I am doing jawiki now [14:39:01] ok [14:39:12] I am triple checking all the watchlist, tag_siummary and change_tag changes across the shards [14:39:26] jawiki is normally quite small [14:43:54] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3217714 (10Papaul) p:05High>03Normal [14:46:10] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217729 (10Anomie) We can't use only by el_index since it's not unique. While MariaDB will probably return the rows in the same or... [14:51:53] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217763 (10Anomie) >>! In T59176#3217729, @Anomie wrote: > We can't use only by el_index since it's not unique. While MariaDB will... [14:53:30] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3217786 (10jcrespo) > MariaDB will probably return the rows in the same order every time It definitely does not happen- this has... [15:40:30] remember to run puppet after the upgrade [15:41:11] yeah :) [15:49:17] let me tell you another thing that doesn't work well- stop slave on events + alter table3s [15:49:33] I am going to disable events on dbstore1001 [15:49:38] to avoid blockage [15:49:41] ok [15:49:42] i hate events [15:49:55] it is not events in this case [15:50:04] just stop/show/slave contention [15:50:14] a script would have done the same [15:50:28] and I am not breaking anything [15:50:43] dbstore1001 has not been delayed for a long time now [15:51:15] see: https://tendril.wikimedia.org/host/view/dbstore1001.eqiad.wmnet/3306 [15:52:06] checking dbstore2001 [15:52:36] dbstore1001 is finally fixed, BTW [15:53:14] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218132 (10Marostegui) This has happened again: `˜/icinga-wm 17:47> PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.00 seconds`... [15:53:32] thanks for fixing dbstore, it is always a pain [15:54:36] 2001 seems ok [15:56:01] I will do now a general check of s3 schema changes [15:56:17] and then start those on s1, s2, s4, s5 and s7 [15:56:22] (masters only) [15:56:29] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218173 (10Marostegui) And it recovered: ``` root@db1048:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 4058 mV Current: 152 mA Temperature: 33 C... [15:57:31] md5sum keeps running [15:57:43] it will take a while yeah [15:57:49] go for the schema changes! \o/ [15:59:54] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3218190 (10Marostegui) db1070 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hdb10... [16:00:40] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3218197 (10Marostegui) db1070 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl... [16:01:28] so things that are fully down now: db1040 (pending to check the backup), db1022 (bogus data), es2019 (possible hw repair) [16:02:08] yes, that matches my list too [16:03:35] labs issue with row-based replication [16:03:49] Column 5 of table 'wikidatawiki.wb_terms' cannot be converted from type 'varchar(255)' [16:04:03] to type 'varbinary(32)' [16:04:04] which labs is that one?? [16:04:13] 1 and 3, I think [16:04:22] mmmmmm [16:05:12] probably we have to do the thing we did for the other table [16:05:21] ? [16:05:33] let me find the issue [16:07:26] https://phabricator.wikimedia.org/T73563#3117611 [16:08:01] although it is already set up with that mode on 1001 [16:08:20] are you sure you are not missing a schema change there? [16:09:38] not sure what you mean [16:10:11] https://gerrit.wikimedia.org/r/#/c/341322/4/repo/sql/AddTermsFullEntityId.sql [16:10:16] alter table wb_terms ADD COLUMN term_full_entity_id VARCHAR(32) DEFAULT NULL AFTER term_entity_id, add key term_full_entity (term_full_entity_id), add key term_search_full (term_language, term_full_entity_id, term_type, term_search_key(16)), drop key wb_terms_entity_type, drop key wb_terms_type; [16:10:19] there is a column missing [16:10:59] `term_full_entity_id` varbinary(32) DEFAULT NULL, [16:12:19] so it is the same column I added but varbinary [16:12:35] no, it is missing [16:12:53] I am looking at it right now [16:12:59] on which host? [16:13:10] labsdb1003, probably 1001 too [16:13:19] it is on db1069 [16:13:35] `term_full_entity_id` varbinary(32) DEFAULT NULL, -> that column is on 1001 [16:13:50] well, it is not on 1003 [16:15:16] no,because it is being added now [16:15:19] it is not finished yet [16:15:22] ok [16:15:25] but [16:15:34] it is failing on 1001 [16:15:35] the replication error is different from that [16:15:45] (and it is the same on 1003) [16:15:52] I am trying to see why [16:15:58] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Add wl_timestamp to the watchlist table - https://phabricator.wikimedia.org/T125991#3218243 (10Addshore) So as I see this the semantics described in the patch still hold true. Having the date that a watched item was last ad... [16:15:59] Column 5 of table 'wikidatawiki.wb_terms' cannot be converted from type 'varchar(255)' to type 'varbinary(32)' [16:16:04] which is `term_language` varbinary(32) NOT NULL, [16:16:16] and we haven't touched that [16:16:40] no, that means it is trying to insert on the wrong column [16:17:06] https://phabricator.wikimedia.org/T163551 [16:17:26] maybe someone doing stuff there? [16:17:53] no, that is data [16:19:10] I have the query [16:20:33] yes, the insert is missing a field [16:20:47] any idea where is that query coming from? [16:20:53] i am glad it didn't happen in core [16:20:54] its master [16:21:31] could it be a decoodination? [16:21:40] between the alter order and the master change? [16:22:12] don't know, it is very strange [16:22:42] the master is changed, and goes to a master that hasn't applied the schema change yet [16:22:58] date 170427 5:12:16 [16:23:25] that is today…labsdb1001 was finished yesgterday evening late in the evening [16:23:38] we are missing a column on the updates [16:23:53] and db1069…was not ready yesterday [16:23:55] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Add wl_timestamp to the watchlist table - https://phabricator.wikimedia.org/T125991#3218275 (10daniel) @Addshore Fine with me, but I want to make sure that this is really the semantics that is wanted/needed. The description... [16:24:10] 69 doesn't need to be ready [16:25:21] can you check sanitarium2 around 2017-04-27T05:12:16.000890 [16:25:34] yes [16:25:35] and see if it has the new column on its binlog [16:25:49] if it does, we can change labsdb master to it [16:25:54] (asuming it has s5) [16:26:21] let me recheck the query again [16:26:33] no, sanitarium2 doesn't have s5 imported [16:26:45] it was a pending work for after the failover [16:26:56] it is ok [16:27:02] so possibilities here [16:27:22] we do the inverse schema change labsdb1001 and 3 [16:27:38] until it starts failing again [16:28:07] or we clone a new server from that date and set it in row [16:28:17] i would go for option 1 [16:28:23] it is faster and it is thursday [16:28:27] not sure it will be ready by tomorrow [16:28:47] and we need to log that query and try to get it fixed :( [16:28:52] how much time until 1003 is finished? [16:29:04] probably around 3-4 hours [16:29:12] but it has already failed on replication [16:29:31] so the thing is [16:29:40] by making the change asyncronous [16:30:01] binlog changes before it is ready, I think [16:30:03] or after [16:30:09] dpending how you see it [16:30:32] we should do all schema change blocking and on the sanitarium [16:30:38] "master" [16:30:40] and let it replicate yes [16:30:53] I am thinking of a faster option [16:30:58] so we should revert sanitarium and the slaves [16:31:17] sanitarium hasn't failed [16:31:49] the problem is we can do things on 10.1 with slave-only triggers [16:31:52] but what would happen if we remove the column and sanitarium does have the column? [16:31:53] but not on the current ones [16:32:14] let's kill 1003 alter [16:32:22] and restart replication [16:32:25] are you sure? [16:32:35] well, it is the same thing as reverting the column, no? [16:33:02] but I am not 100% that will work, too [16:33:21] think this [16:33:38] we are getting updates without the column [16:33:47] but they failed [16:34:08] and 1003 doesn't have yet the column [16:34:22] but it still fails [16:34:34] yeah, but I am not sure in which state the table is in the middle of the alter [16:35:16] how much time for a reimport? [16:35:19] maybe it is reading the table metadata and it sees the new column there even if it is still not there? [16:35:26] would that be faster? [16:35:32] let me check the size [16:35:48] root@db1069:/srv/sqldata.s5/wikidatawiki# ls -lh wb_terms.ibd [16:35:48] -rw-rw---- 1 mysql mysql 231G Apr 27 16:35 wb_terms.ibd [16:36:15] I think db1069 is to blame [16:36:19] here [16:36:43] the thing is that labsdb1001 got the alter before db1069 [16:36:46] you did it slave -> master [16:36:47] yes [16:36:48] so it should have not failed [16:36:56] which is the right way in statement [16:36:58] but in row [16:37:07] it has be be syncronous [16:37:15] think this [16:37:34] the column is there, but the inserts expect not have them [16:37:38] because unlike statement [16:37:48] they do not know about "default columns" [16:37:57] there is another possibility [16:38:12] which is ignoring the table, and reimporting [16:38:19] it may take more [16:38:23] but it would not block replication [16:38:32] and it is safe on row replication [16:38:56] reimporting from db1069 no? [16:38:56] better replication running and dropping the table (sorry, currently unavailable) [16:38:58] yeah [16:39:00] yea [16:39:06] or db1095 [16:39:13] no, db1095 doesn't have it [16:39:15] or production, I think it is 100% public [16:39:18] true [16:39:20] :-) [16:39:25] so we waitt [16:39:35] until we get the right schema [16:39:36] maybe we can filter wb_terms (Replication filters) so at least old data is there [16:39:42] while we do the dump from db1069 [16:39:43] yes [16:39:51] but not yet [16:40:04] only when we start getting events with the new column [16:40:18] which I think we do not get yet? [16:40:29] well, hopefully not, the column isn't even on codfw [16:40:35] no [16:40:36] I mean [16:40:43] not actual writes [16:40:53] just the column number on db1069 [16:41:15] let me show you [16:41:19] ok [16:41:35] https://phabricator.wikimedia.org/P5342 [16:41:43] this is the event we have from db1069 [16:42:05] for some reason, it is not getting the column (maybe it hadn't finish at the time) [16:42:18] we need to skip events since that time [16:42:28] and when we started the new structure [16:43:33] let me check that will arrive at some point [16:43:39] otherwise we have larger problems [16:43:47] I see [16:43:51] it is lagged [16:44:06] can we see how much? [16:44:14] oooh I get what you mean [16:44:37] it is not lagged according to monitoring [16:45:27] this is a more recent one: https://phabricator.wikimedia.org/P5342#28641 [16:45:33] so let's skip the table [16:45:46] and reimport it at some later time [16:45:54] better than general lag everywhere [16:45:57] yes [16:46:44] it is 231 G [16:46:47] :-/ [16:46:54] yes, it is massive :| [16:47:08] there is an alternative [16:47:12] skip it [16:47:34] and then use STATEMENT based replication to move it forward a few hours [16:47:52] it will take less time, but it is more involved [16:48:11] considering that we are going to import s5 on the new labsdb, maybe it is a good alternative [16:48:16] if we can discover [16:48:22] to leave this running with not super acurated data [16:48:24] when the new events started happeinng [16:48:36] no, data will be 100% accurate [16:48:42] more or les [16:49:05] let's discover when the new format got applied [16:49:40] let's do archeology [16:50:17] it started at 170427 5:12:16 [16:50:28] is that the first one? [16:50:38] when it failed [16:50:42] yes [16:50:46] first failed event [16:51:22] continued at 6:00 [16:51:51] another option would be to reapply those [16:51:55] at 7:00 [16:52:00] yes, that is what I said [16:52:06] jynus> and then use STATEMENT based replication to move it forward a few hours [16:52:22] ah sorry missed that [16:52:32] at 8 [16:53:22] not at 12 [16:53:58] until 10-11 [16:54:00] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218359 (10Papaul) Main board replacement DIMM B4 Replaced DIMM A1 Replaced BIOS update from 2.2.5 to 2.4.3 [16:54:10] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218360 (10Papaul) a:05Papaul>03Marostegui [16:55:48] how are you seeing those so fast! [16:58:04] 10:39-10:40 [16:59:21] until 546409737 [17:00:09] from s5-bin.003381:1041824315 to s5-bin.003382:546409737 [17:00:17] so the plan is [17:00:22] continue the alter [17:00:31] we will ignore the table [17:00:43] 1001 has the alter done, maybe we can start with it? [17:00:46] start replication until s5-bin.003382:546409737 [17:01:23] while replication stopped, apply the changes from the statement replication above (only on that table), that we should get now [17:01:38] and then restart, and we should be good to go [17:01:51] let me get the .sql of changes from the master [17:02:06] ok i will get the replication filters for 1001 [17:04:23] set global Replicate_Wild_Ignore_Table = "%wik%.%,information_schema_p.%,heartbeat.%,wikidatawiki.wb_terms" [17:04:37] uff [17:04:56] it is what it is now + the wb_terms [17:05:46] you are mixing do and ignore [17:05:58] yes [17:05:59] and that is a problem, because they are not mixable [17:06:00] you are right [17:06:15] it is a hierarchy [17:06:29] we can then just try the ignore to that first table [17:06:31] if it is in do, I think the ignore does not apply [17:06:43] although then you have the do [17:06:45] maybe we can rename the table [17:06:48] but that should be fine [17:06:49] no? [17:06:53] and create is as blackhole? [17:06:56] if you first ignore and then do? [17:07:03] doesn't work [17:07:12] we tried it for the knew labs [17:07:28] then, rename + blackhole [17:07:30] and realized it doesn't work after reading the 10 pages of documentation [17:07:33] yes [17:07:36] same thing applies [17:07:53] we do not ignore, just temporarily subtitute it for a blackhole table [17:07:59] yes [17:08:02] we will break some tables for some time [17:08:09] but it should only take a few moments [17:10:01] ok, let's rename it then and create a new one with blackhole [17:10:04] wait [17:10:12] I want to get the .sql first [17:10:17] yeah yeah [17:10:19] not doing it now [17:10:20] if not, the exercice is useless [17:10:31] I need more archeology [17:10:37] to get the script [17:12:05] sure [17:21:56] let's go for it then? [17:22:19] i think the stat and stop position for the binlogs matches what I have seen [17:22:29] wait, the script is missing ; [17:23:51] ok, rename the table [17:24:04] I will test the script on the blackhole table first [17:24:13] hehe [17:24:23] that is done [17:24:40] the new one is created [17:24:54] wait [17:24:56] so I will run the script on labsdb1001, nothing is moving yet, right? [17:24:59] it is created with the new column [17:25:17] just drop it [17:25:43] yes [17:25:46] like that [17:25:47] give me a sec [17:25:54] all you want [17:26:00] you gave me a lot to me [17:28:04] ok [17:28:07] we should be good now [17:28:13] try the script if you want [17:31:36] ok, testing on labsdb1001 [17:31:39] blackhole [17:31:45] ok [17:32:43] syntax error [17:33:18] what's going on here, 7 hours lag for a long time now: [17:33:24] MariaDB [dewiki_p]> select lag from heartbeat_p.heartbeat where shard='s5'; [17:33:27] +-------+ [17:33:29] | lag | [17:33:31] doctaxon: we are fixing it [17:33:32] +-------+ [17:33:34] doctaxon, there is maintenance ongoing [17:33:34] | 24727 | [17:33:36] it was announced [17:33:37] +-------+ [17:33:39] 1 row in set (0.02 sec) [17:33:46] thanks [17:33:47] please do not flood the channel [17:33:52] whereis the syntax error jynus? [17:34:44] line 26 [17:35:21] I think it was my bash script [17:35:34] that deleted escape character [17:35:39] let me see if I can fix it [17:35:56] weird, I can insert that perfectly on a test host [17:36:00] that line [17:36:11] really? [17:36:46] yep [17:36:47] totally [17:37:02] I am using the one in db1063 though [17:37:19] yes, on the original script [17:37:23] not the updated one [17:38:02] I need to man echo [17:38:08] ah, the semicolom one fails yes [17:38:11] -E [17:38:49] 10DBA, 06Labs, 10MediaWiki-extensions-Linter, 13Patch-For-Review: Make "linter" table available on Labs - https://phabricator.wikimedia.org/T160611#3218500 (10chasemp) a:03chasemp Approved by security in https://phabricator.wikimedia.org/T148583#2854927 [17:39:39] still happens [17:40:46] oh [17:40:46] i see it [17:41:11] it was really small! [17:43:06] it shoudl work now [17:43:16] bash == bad sed == good [17:43:20] haha [17:43:56] it worked [17:44:00] ok [17:44:07] so start replication until [17:44:13] or we apply the changes first? [17:44:25] I would apply the changes first [17:44:31] and see if we can replicate finely [17:44:52] well, it doesn't matter [17:45:02] they do not apply to that table for that time [17:45:11] ok [17:45:14] actually [17:45:20] we need to un-rename [17:45:22] to apply those [17:45:27] the table name is hardcoded [17:45:51] see the command [17:46:02] yeah [17:46:30] START REPLICATION UNTIL MASTER_LOG_FILE='s5-bin.003382', MASTER_LOG_POS=546409737; [17:46:38] that should be it [17:46:45] mmm 82? [17:46:46] ah yes [17:46:49] yes [17:46:56] it may take some time [17:47:09] those are 5 hours of replication [17:47:26] yeah [17:47:33] you want to apply first? [17:47:38] I cannot [17:47:48] I nee the table with the original name [17:47:54] just run the above command [17:48:00] we can rename back, apply, rename back, start replication [17:48:02] but yeah [17:48:04] it doesn't matter [17:48:05] but waut [17:48:09] on s [17:48:13] s5 [17:48:22] or default_master_connection='s5' [17:48:27] default yes [17:49:39] ok, so run it [17:50:00] ok! on my way [17:50:32] go [17:51:07] is it failing or not run yet? [17:51:11] check the correction i made [17:51:26] 10DBA, 06Labs, 10MediaWiki-extensions-Linter, 13Patch-For-Review: Make "linter" table available on Labs - https://phabricator.wikimedia.org/T160611#3105094 (10Andrew) I merged the puppet change. Next we need to update things with ``` maintain-views --all-databases --table linter --debug --replace-all `... [17:51:35] wait [17:51:40] is the position right? [17:52:12] yes, of course [17:52:14] start slave [17:52:17] it is right hte position i think [17:52:20] the position is right [17:52:26] ok, so let's go for it then [17:52:37] it is the master position of sanitarium:s5 [17:53:00] go gog go [17:53:26] done [17:53:41] see, doctaxon, it is replicating now [17:54:06] thank you so much [17:54:17] but replag will be generalize for all week [17:54:26] not more than a few hours [17:54:29] but it will happen [17:54:36] we are doing important schema changes [17:54:38] on production [17:54:44] to modernize mediawiki [17:54:56] and actually, they will make labs faster and more reliable [17:54:58] is see the lag rising [17:55:05] actually [17:55:13] on s5 going down [17:55:19] on the other it depends on the time [17:55:37] as I said, we are under maintenance [17:55:45] nope, s5 is rising [17:55:50] those are on purpose, but unavoidable [17:55:59] oh, s5 may be on labsdb1003 [17:56:06] doctaxon: that is probably 1001 [17:56:06] so that is normal [17:56:09] sorry, 1003 [17:56:20] it will go up until it goes down again [17:56:32] just wait 2-3 more hours [17:56:38] okay [17:57:34] 26087 - 26096 - 26129 - 26188 - 26254 seconds [17:57:48] but if it is okay, all best [17:58:01] note that there is 2 options [17:58:17] you suffer 1 day the lag because the schema has changed [17:58:23] and no more lag later [17:58:39] or you stop receiving updates from production forever :-) [17:58:47] you probably want #1 :-) [17:58:51] hahaha [18:00:01] your work is great to maintain our systems. Great job! [18:00:36] we are upgrading the wb_terms table [18:00:52] check the mediawiki documentation and be ready for the change! [18:01:07] it will now have an extra column [18:01:50] what a column will it be [18:02:42] `term_full_entity_id` varbinary(32) DEFAULT NULL, [18:02:50] it will be empty for now [18:03:03] but it will allow for finer control of how wikidata is being used [18:03:24] what is a full entity id other than an entity id? [18:03:41] ah, the details to the devels [18:03:50] I am just a humble sysadmin [18:04:08] oh fine! ;) [18:06:38] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218689 (10Marostegui) Thanks @papaul! Let's see how it goes. [18:16:43] hallo Freddy2001 [18:16:54] hallo doctaxon [18:24:02] 10DBA, 10MediaWiki-API, 05MW-1.29-release-notes, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 2 others: action=query&list=pagepropnames really slow on a big wiki, got error with ppnlimit=500 function: /* ApiQueryPagePropNames::execute... - https://phabricator.wikimedia.org/T115825#3218752 [18:24:23] 10DBA, 10MediaWiki-API, 05MW-1.29-release-notes, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 2 others: action=query&list=pagepropnames really slow on a big wiki, got error with ppnlimit=500 function: /* ApiQueryPagePropNames::execute... - https://phabricator.wikimedia.org/T115825#3218753 [18:26:06] 10DBA, 10MediaWiki-General-or-Unknown: Timeout in WikiPage::insertRedirectEntry after move - https://phabricator.wikimedia.org/T163597#3218759 (10Umherirrender) >>! In T163597#3216972, @Marostegui wrote: > Feel free to reopen if it happens more often. At least we have now this for tracking. > Thanks! This dep... [18:38:34] jynus: 1001 caught up [18:38:51] 10DBA, 10MediaWiki-General-or-Unknown: Timeout in WikiPage::insertRedirectEntry after move - https://phabricator.wikimedia.org/T163597#3218792 (10Marostegui) I am not saying this is not an issue, what I meant is that if it only happens once it could be just a once time thing. If it happens more often, there mi... [18:38:54] good, I will rename the table [18:38:54] apply the script [18:40:33] now running the script [18:40:39] fingers crossed [18:41:06] ETA 2 minutes [18:41:06] you scared me for a sec, i did the show tabels and didn't see the DO_NOT_DROP table, you renamed it back faster than i checked it [18:41:09] no errors [18:42:03] and now, start replication [18:42:09] (s5) [18:42:21] ^^ [18:42:37] no error (yet) [18:42:46] so far so good yes [18:48:18] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218833 (10Cmjohnson) is there anything I need to be doing for this? [18:49:09] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218835 (10Marostegui) Do you have any spare BBU available? [18:50:22] 10DBA, 10MediaWiki-API, 05MW-1.29-release-notes, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 2 others: action=query&list=pagepropnames really slow on a big wiki, got error with ppnlimit=500 function: /* ApiQueryPagePropNames::execute... - https://phabricator.wikimedia.org/T115825#3218842 [19:05:41] 10DBA, 10MediaWiki-API, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.29-release-notes, and 3 others: action=query&list=pagepropnames really slow on a big wiki, got error with ppnlimit=500 function: /* ApiQueryPagePropNames::execute */ - https://phabricator.wikimedia.org/T115825#3218889 (... [19:06:07] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218890 (10Cmjohnson) @Marostegui yes, I can use one from a decommissioned server. [19:15:19] 10DBA, 06Labs, 10MediaWiki-extensions-Linter, 13Patch-For-Review: Make "linter" table available on Labs - https://phabricator.wikimedia.org/T160611#3218897 (10Andrew) This is now available on all db hosts except for 1001, which is misbehaving. [19:17:20] 10DBA, 10MediaWiki-JobRunner, 07Wikimedia-log-errors: Job runners throw lots of "Can't connect to MySQL server" exceptions - https://phabricator.wikimedia.org/T121623#3218899 (10Krinkle) @jcrespo Hm... I still see them in logstash, though? 10DBA, 10MediaWiki-General-or-Unknown: Timeout in WikiPage::insertRedirectEntry after move - https://phabricator.wikimedia.org/T163597#3218930 (10Umherirrender) Another report, but works later: > [WQIzoQrAIE0AAJxBOI4AAAAT] 2017-04-27 18:08:55: Fataler Ausnahmefehler des Typs „DBQueryError“ From https://de.wi... [19:53:16] doctaxon, replicatin lag should now be going down on wikidata [19:53:42] and it is: https://tools.wmflabs.org/replag/ [19:54:21] glad to read this [19:54:44] you can investigate the new schema now :-) [19:55:38] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission db1057 - https://phabricator.wikimedia.org/T162135#3219053 (10Cmjohnson) p:05Normal>03Low [19:55:39] the other channels will go down, too [19:55:43] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id unsigned in WMF production - https://phabricator.wikimedia.org/T163911#3214088 (10Umherirrender) Was already created after merge of patch: T89737 [19:57:10] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id unsigned in WMF production - https://phabricator.wikimedia.org/T163911#3219070 (10jcrespo) [19:57:12] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id an unsigned int on wmf databases - https://phabricator.wikimedia.org/T89737#3219068 (10jcrespo) [19:58:58] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id an unsigned int on wmf databases - https://phabricator.wikimedia.org/T89737#1044134 (10jcrespo) The tags are very important for discoveribility! Otherwise duplicates are created, and this has been basically unateded since 2015 due... [19:59:56] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id an unsigned int on wmf databases - https://phabricator.wikimedia.org/T89737#3219080 (10jcrespo) [20:11:43] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3219138 (10Cmjohnson) [20:11:45] 10DBA, 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3219135 (10Cmjohnson) 05Open>03Resolved Reset the idrac and it appears that db1070 is not accessible from ipmi tool cmjohnson@db1070:~$ sudo ipmi-chassis --get-chassis-status System Power... [20:19:45] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219182 (10Cmjohnson) @Marostegui I will be in the data center Friday 4/27 at 0930. Let's get this take care of right away. [20:20:58] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10jcrespo) Cmjohnson- we really appreciate the effort- we now these days you have lots and lots of work! [20:29:03] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3219234 (10Cmjohnson) @Marostegui Same thing as db1048? I can use a spare bbu from a decom server if you like or is this server nearing it's last days? [20:29:07] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219235 (10Marostegui) Thanks Chris!!! @jcrespo, this means reconfigure the slaves as the masters will change IPs... [20:33:38] 10DBA, 06Operations, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3219263 (10Cmjohnson) @jcrespo and @Marostegui d b1106 is racked, idrac/bios setup, switch cfg is done. dhcpd file is configured...ready for install [20:33:54] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3219269 (10Marostegui) [20:41:47] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219307 (10Marostegui) >>! In T163895#3219235, @Marostegui wrote: > Thanks Chris!!! > @jcrespo, this means reconfigure the slaves as the masters will change IPs... Nevermind, just realised we re... [20:49:41] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 10Tool-Labs-tools-Other: s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds - https://phabricator.wikimedia.org/T162519#3165938 (10Nemo_bis) Thanks... [20:56:29] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 10Tool-Labs-tools-Other: s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds - https://phabricator.wikimedia.org/T162519#3219371 (10jcrespo) I have... [20:57:47] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 10Tool-Labs-tools-Other: s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds - https://phabricator.wikimedia.org/T162519#3219375 (10jcrespo) I would... [21:35:26] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10jcrespo) > Nevermind, just remembered we replicate from fqdn and not IPs :) But mediawiki uses IPs. [21:56:50] 10DBA, 06Labs, 10MediaWiki-extensions-Linter, 13Patch-For-Review: Make "linter" table available on Labs - https://phabricator.wikimedia.org/T160611#3219478 (10chasemp) a:05chasemp>03Andrew thanks @Andrew