[02:06:37] 10DBA, 06Operations: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2845942 (10fgiunchedi) [07:17:25] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846176 (10Marostegui) Thanks for taking care of this. I have submitted a patch to skip this check: https://gerrit.wikimedia.org/r/325257 I am not completely aware of the who... [07:51:17] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2846231 (10Marostegui) This is running on db1082 [08:09:41] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846235 (10Marostegui) I have started transferring data from labsdb1010 to labsdb1011. Once we have both up we can try to tes... [08:18:30] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2846236 (10Marostegui) On Friday Papaul and myself discussed about next steps for this host: We are going to recable the disk controller today (Monday) and Papaul is going to try to see if he can g... [08:52:04] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2846276 (10Legoktm) Ok, I think I've addressed all of the feedback, and would appreciate a re-review: ```lang=sql CREATE TABLE /*_*/linter ( -- primary key linter_id int UNSIGNED PRIMARY KEY not... [09:11:03] 10DBA, 06Labs, 10Labs-Infrastructure: Create labsdb_accounts db on m5 to store state about labsdb accounts - https://phabricator.wikimedia.org/T152377#2846295 (10yuvipanda) [09:52:03] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2846381 (10jcrespo) ``` CREATE INDEX /*i*/linter_cat ON /*_*/linter (linter_cat); CREATE UNIQUE INDEX /*i*/linter_cat_page_position ON /*_*/linter (linter_cat, linter_page, linter_start, linter_end)... [09:53:52] 10DBA, 06Labs, 10Labs-Infrastructure: Create labsdb_accounts db on m5 to store state about labsdb accounts - https://phabricator.wikimedia.org/T152377#2846397 (10jcrespo) a:05yuvipanda>03jcrespo [10:08:02] jynus: did you deploy this? https://gerrit.wikimedia.org/r/#/c/325257/ so I can enable puppet on the affected two hosts? [10:29:35] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846422 (10jcrespo) This was caused by the cert expiration on all analytics hosts, making all mysql connections from other databases to fail. This was part of the mitigation o... [10:30:01] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841012 (10jcrespo) [10:30:03] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846424 (10jcrespo) [10:51:46] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846466 (10Marostegui) labsdb1011 is up and running with a single channel. I will test two of them too [10:52:07] marostegui, yes, did not run puppet on the hosts [10:52:18] jynus: ok, I will take care of that [10:52:45] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2846467 (10Marostegui) db1082 is done ``` root@neodymium:~# mysql -hdb1082 -A dewiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Creat... [11:04:01] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846486 (10Marostegui) I have re-enabled puppet and ran it to pick up the commit. [11:39:11] 10DBA: 66 rows from external storage (dewiki) gave duplicate key errors on master failover - https://phabricator.wikimedia.org/T152385#2846544 (10jcrespo) [12:33:47] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846647 (10jcrespo) ``` db1095$ check_private_data.py -- Non-public databases that are present: DROP DATABASE IF EXISTS `tes... [12:57:52] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846675 (10Marostegui) >>! In T152194#2846647, @jcrespo wrote: > ``` > db1095$ check_private_data.py > -- Non-public databas... [13:10:48] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846706 (10Marostegui) I have started transferring the data to labsdb1009. Also took a backup of the existing data, just in... [13:59:39] morning jynus, quick q, did you say you had patched up meta_p db manually for previous cases? would you be willing to do it for this one https://phabricator.wikimedia.org/T151570#2845352 [14:00:15] the manage meta_p script is not in a state to run atm still, but I hate to make them wait if we can help it and if one of us is going mysql commando it would make sense to be one of you guys? [14:00:16] I cannot [14:00:23] unless I get the data [14:00:31] I only did because krenair executed it [14:00:40] I can if I have the data to fill it in [14:00:46] (manually) [14:01:11] get me the row values and no problem [14:01:59] ok I'll see if I can make sense of it [14:54:07] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2846967 (10Marostegui) labsdb1009 is now up and running. The three servers are replicating fine. I have also enabled SSL. R... [15:23:51] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847153 (10Marostegui) [15:26:29] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847171 (10Marostegui) We have also marked 32:2 as failed. Both disks had media error, can we get them replaced? [15:37:46] 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2847200 (10SBisson) From all the possible userExpLevel filter combinations, I... [15:41:50] what is the state of the labsdb1009 copy? [15:42:00] I have to reload the proxies when you are done [15:42:19] so it un-failovers [15:42:27] Oh, I am done: https://phabricator.wikimedia.org/T152194#2846967 [15:42:49] ah, thank you! [15:42:56] regarding the buffer pool [15:43:10] remember that the 25% was when tokudb was the main engine [15:44:41] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2847231 (10jcrespo) > I have checked the current labs servers and they have set 25% RAM for the buffer pool size. That was b... [15:50:19] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2847244 (10chasemp) a:05yuvipanda>03None >>! In T149933#2843924, @... [15:50:37] good point [15:59:53] 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2847281 (10jcrespo) When comparing query performance, EXPLAIN may be useful, b... [16:01:49] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2847284 (10Marostegui) db1087 is done: ``` root@neodymium:~# mysql -hdb1087 -A dewiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Crea... [16:03:39] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2847287 (10yuvipanda) a:03yuvipanda After more chat, we decided that... [16:04:16] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2847295 (10scfc) [16:05:01] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Create labsdb_accounts db on m5 to store state about labsdb accounts - https://phabricator.wikimedia.org/T152377#2847309 (10jcrespo) 05Open>03Resolved [16:05:03] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2847310 (10jcrespo) [16:11:22] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2847356 (10jcrespo) As a reminder, ALTER TABLEs on m5, with just a few... [16:13:06] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2847364 (10Marostegui) We are testing the transfer again. What has been done - Recabled the RAID controller - Changed network cable Papaul has identified that db2034 has 1 disk which is a SAS 2.5... [16:21:12] 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2847394 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:22:12] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2847396 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:22:33] 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2847399 (10Marostegui) Thanks ``` physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Rebuilding) ``` [16:23:07] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2847400 (10Marostegui) Thanks! ``` physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, Rebuilding) ``` [17:32:53] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2847728 (10Marostegui) There was two disks in different spans with media errors: ``` 32:0 32:2 ``` Both have been marked as failed. [17:33:26] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847730 (10Marostegui) [17:33:29] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2847729 (10Marostegui) [17:33:45] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2847734 (10jcrespo) p:05Low>03Normal Lag is back. [17:34:21] jynus: should we silence the alert or acking it means that it won't alert if it recovers and then lags again? [17:34:34] downtime for 2 days [17:34:53] or 1.5 days [17:35:27] done [17:43:01] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2847801 (10Marostegui) And...it died ``` /system1/log1/record3 Targets Properties number=3 severity=Critical date=12/05/2016 time=17:16 description=System Power Fault Detect... [17:59:10] 10DBA: Create a check/calendar alert for TLS certs - https://phabricator.wikimedia.org/T152427#2847847 (10Marostegui) [18:09:53] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2847917 (10Marostegui) Papaul has replaced the 2.5" disk with a 3.5" one. The RAID is now rebuilding - once it is done I will try to crash it again. [18:22:56] I am going to stop replication on db1048 for a second to change dbstore1002 m3 master [18:23:02] so it doesn't get affected [18:56:01] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2848149 (10jcrespo) I have added the several admin users to labsdb1009... [19:00:21] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2848170 (10jcrespo) I have added the 3 labsdb hosts to tendril, cleaned up its accounts, added the admin ones that labs host... [21:20:12] hi bd808_: what is this: [21:20:13] 21:19 <@doctaxon> MariaDB [(none)]> USE dewiki_p; [21:20:14] 21:19 <@doctaxon> ERROR 1049 (42000): Unknown database 'dewiki_p' [21:20:44] doctaxon: which database replica are you logged into? [21:20:47] and why is the table text missing in database [21:21:13] we don't replicate the text tables to labs [21:21:29] they are a) too big and b) stored in a format that doesn't replicate [21:21:31] but in your manual is this stated [21:21:44] well doesn't replicate like the rows do [21:21:52] which manual? [21:22:21] https://www.mediawiki.org/wiki/Manual:Text_table [21:23:16] the database replica is dewiki_p [21:23:59] doctaxon: sure. that manual describing the MediaWiki internals which do have the text table to hold revision data. [21:24:19] We don't replicate them as I said though which I'm pretty sure is documented on wikitech [21:24:51] internals? [21:25:15] MediaWiki the software product [21:26:05] okay back to above: [21:26:18] 21:20 < doctaxon> 21:19 <@doctaxon> MariaDB [(none)]> USE dewiki_p; [21:26:18] 21:20 < doctaxon> 21:19 <@doctaxon> ERROR 1049 (42000): Unknown database 'dewiki_p' [21:26:33] doctaxon: if I do `sql dewiki_p` from tools-login I get into the dewiki_p data. How are you connecting? [21:26:54] oh, I do 'mysql dewiki_p' [21:27:01] this is probably a better conversation to have in #wikimedia-labs too [21:27:02] just a moment [21:27:25] okay thank you [21:27:39] it's sql instead of mysql [21:28:04] :) yw. The `sql` script is a wrapper that picks the right server to connect to and passes your credentials