[07:06:36] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2834238 (10Marostegui) The server survived the whole night without any issues. So looks like the user cpu time/processes do not make the server crash. [07:20:13] 10DBA: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2834257 (10Marostegui) dbstore1001 is done ``` root@neodymium:~# mysql -hdbstore1001 -A dewiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Create Table: CREA... [07:50:39] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2834273 (10Marostegui) I killed all the burning CPU processes and started a transfer from db2048 which killed the host after 30 minutes again. ``` date=11/30/2016 time=07:45 description... [08:12:59] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2834284 (10Marostegui) I am now testing to generate lots of disks reads and writes locally, without involving an ethernet connection. This is how iotop looks like now: ``` Total DISK READ : 108... [08:45:43] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2834296 (10Marostegui) The first attempt to crash the server locally has not been successful. I have started again the test with more threads. Current state ``` top - 08:45:24 up 51 min, 10 users,... [08:47:52] the unfiltered columns doesn't work well [08:48:11] for what? [08:48:26] it shows no column for db1095 [08:48:46] you mean the drops? [08:49:06] but the other, the drops should be ok, with some review [08:49:57] you can give it a look at yourself [08:50:28] can I just run the private check data in db1095 safely? [08:50:39] yes, it is a check [08:50:42] I just did [08:50:56] as long as you do not redirect its output to mysql [08:51:03] :) [08:51:14] my phyilosophy is to create only pipe commands [08:51:24] so we can play with the output first [08:51:44] so all those drop tables output, with the tables with not filtered columns aren't 100 right? [08:51:47] is that what you meant? [08:52:22] well, the drops have to be audited [08:52:45] I am fairly confident about them- but I would try them on test wiki first and have a look, etc. [08:53:03] compare with sanitarium1, etc. [08:53:09] yeah, we gotta be careful [08:53:14] what I know it is wrong is the output [08:53:17] by the way, when do you want to try to run the sanitize in s3? [08:53:22] -- Unfiltered columns that are present: [08:53:31] well, this was part of the effort [08:54:07] running this, then running sanitize_sanitarium.sh, then running this again [08:54:43] I want to fix check_* first [08:54:55] sure, sure [08:55:00] maybe, if you have the time, you can start looking at the DROP commands [08:55:06] to check they are correct [08:55:15] maybe they should be DROP IF EXISTS? [08:55:23] any comment, basically or test [08:55:41] now it is the time :-) [08:55:43] at this point it doesn't really matter, but maybe yes, it is safer in case we plan to run this once we have slaves [08:55:55] I will start checking the output yes [08:56:18] anything you see or test, please add it to the ticket, and I will try to fix it [08:56:35] sounds good [08:56:36] I will meanwhile try to fix the column checks [09:24:04] I have fixed it but [09:24:12] now it takes a long time to run [09:24:23] and I know we may have some false positives [09:29:10] I am analizying the databases now [09:29:15] I will reply in a bit to the ticket :) [09:41:29] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2834366 (10Marostegui) >>! In T147052#2834292, @gerritbot wrote: > Change... [09:42:54] ^nice [09:51:01] we can go with the DROP as soon as you feel confident about it [09:52:45] I think I broke the script again [09:57:17] Before doing the drops I was checking the repl filters [09:59:28] I am a bit confused, for instance, there is one db that are supposed to be dropped from db1095 [09:59:31] dkwiki [10:00:05] But it is not in the repl filters and it has tables that are also not ignored in the ingore table filters, for instance the archive table [10:00:24] so if we get an insert on that table, it would break replication on db1069 or db1095 (once the db is dropped) [10:00:28] what am I missing? [10:00:36] deleted wikis? [10:00:51] that should be added to the ops private list? [10:01:14] on the script I considered deleted wikis as private ones [10:01:20] aaah ok [10:01:28] so you are also checking deleted wikis [10:01:28] maybe those are missing from the filters [10:01:29] fine fine [10:01:31] now I get it [10:01:56] and I do not think those should be replicated [10:02:08] but I am guessing, please check that compared to the dblists [10:02:27] well, they are clearly not in use, otherwise db1069 would have broken [10:02:31] I will check [10:02:45] ah, so current sanitarium agrees with that, I assume [10:02:51] yes [10:02:57] db1069 has no dkwiki [10:03:00] and has the same filters [10:03:04] which would confirm the need for more filters [10:03:06] but I was like: how is this possible? what if... [10:03:14] that is handled on realm.pp [10:03:27] no, not really, we do not need more filters [10:03:33] as those wikis are not in db1069 [10:03:37] I actually added the deleted ones to the filters [10:03:42] so it is fine, I was just confused [10:04:06] but I had to revert the commit [10:05:56] I am checking and those who are not in the repl filters are deleted [10:06:00] so all good [10:06:02] :) [10:20:20] 10DBA, 10Wiki-Loves-Monuments-Database, 13Patch-For-Review: mysqldump is timing out preventing all tables from being included in the dump - https://phabricator.wikimedia.org/T138517#2834398 (10Lokal_Profil) >>! In T138517#2834395, @gerritbot wrote: > Change 324396 had a related patch set uploaded (by Lokal P... [10:28:23] I am confident to delete all the stuff, i will not delete the "test" db as it is in db1069 [10:28:37] and the table is being written from time to time, so it will break replication [10:28:47] the content of that table is just….XD [10:40:29] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2834437 (10Marostegui) Interesting, the second attemp with even more threads didn't kill the server. However, starting a transfer killed the server again. I was generating waaaaay more load and writ... [11:19:19] so check_private_data.py should finally do the right thing [11:19:31] but the last check takes 20 minutes or so [11:19:47] I am doing a test run on labsdb1001 [11:19:54] I am about to drop the databases :) [11:19:59] mmm [11:20:12] I didn't add a ';' at the end of each drop [11:20:36] every time I commit, I find a missing small issue [11:20:57] good thing that commiting is free :) [11:21:35] well, only stable things are supposed to be commited [11:21:48] all mine technically are, I check that it runs [11:22:29] yes, go on with the databases [11:22:32] BTW [11:24:13] I am dropping them, I took a bkup anyways [11:24:19] just in case repl gets broken [11:24:23] but don't think so [11:24:29] those dbs do not exist on db1069 anyways [11:24:33] backup? [11:24:40] we call that production :-) [11:25:08] haha [11:25:19] I may start with es* rolling restart [11:25:28] 10DBA, 10Wikidata, 07Performance, 15User-Ladsgroup, 03Wikidata-Sprint-2016-11-15: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2834541 (10daniel) [11:25:31] well, if there is something inserting in a db that I dropped in the next few hours I can just import the single database instead of the whole thing :) [11:25:56] the only thing that can go wrong there [11:26:08] is that you drop from the production master by mistake [11:26:30] haha that is why I check a thousand times that I was in db1095 :) [11:26:39] which can happen more likely than you think [11:27:02] (that is why I never leave a master -ssh or mysql- connection open for long) [11:29:04] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2834556 (10Marostegui) I have deleted all the databases that the script su... [11:30:02] good, not the tables :-) [11:30:07] *now [11:34:22] 10DBA, 06Operations: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2834580 (10jcrespo) [11:38:15] yeah, that one is going to be fun :p [11:38:48] actually no [11:38:53] I intend to do this by today [11:39:07] my question is if to upgrade to .28 or not [11:39:29] those are literally our most valiable servers [11:40:05] on the other side, we know there are software bugs before .28 [11:40:57] i would be conservative to be honest [11:41:13] if they have been functioning well, I would mix both operations (tls+upgrade) [11:41:19] we are not in a rush to upgrade, right? [11:41:32] not to upgrade [11:41:38] yes for the rolling restart [11:41:50] if it wasn't, I wouldn't be doing it now [11:41:54] then I would not upgrade [11:42:23] so mysql restart with no package upgrade [11:42:27] because kernel also [11:42:29] or maybe [11:42:37] one upgrade per shard [11:43:12] or upgrade of codfw [11:43:31] jynus, marostegui of course if you need anything for the TLS changes on ES let me know, given I was involved in it's rolling on the other shards ;) [11:43:43] Upgrade codfw can be a good idea, but eqiad I will just do the restart and plan the upgrade with more time [11:43:56] I think I would do codfw [11:44:02] and one on eqiad [11:44:38] no, the one on eqiad doesn't work, because I have to promote the slave [11:44:52] so we can do codfw [11:45:16] and just a mysql restart for the others [11:46:02] that sounds good, just codfw and see how that goes and leave them running for a while [11:46:29] I can also disable temporarelly the codfw -> eqiad link [11:46:32] just in case [11:46:47] +1 to that [11:46:55] will not hurt [11:57:24] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2834642 (10ShakespeareFan00) https://quarry.wmflabs.org/query/5979 This as of 30 Novemeber 2016 is showing 45 rows that the query says are 'orphaned' but when checked on English Wikipedia certainly arent. [12:26:35] 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2834676 (10jcrespo) p:05Normal>03High a:03jcrespo [12:46:07] 10DBA, 10Monitoring, 06Operations: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2834710 (10jcrespo) [13:32:33] so the private data check still does not work well, because depending on the column, some are set to NULL [13:32:43] some are set to '', and some are set to 0 [13:33:09] which is nice, because you can filter both null and not null columns, and string and numerical ones [13:33:26] but it is horrible to check if they are really sanitized [13:34:27] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2834847 (10Marostegui) I have been taking a look at the tables the script... [13:34:55] the tables checks are good though :) [13:35:10] yes [13:35:16] I am confident about those [13:35:22] I woud apply those now [13:35:49] I would then sanitize [13:36:09] and recheck, maybe I can tune the script a bit [13:36:32] to do different checks depending on the table type and contents [13:36:35] I am going to go ahead and drop those tables yes [13:36:38] *column [13:37:00] or [13:37:13] rewrite the sanitization script [13:37:17] but not now [13:39:37] I suspect there is a hw problem on es2 restarts [13:39:42] for extra fun [13:40:46] need help? [13:40:51] not really [13:40:55] it is done serially [13:41:06] so not many things to do concurrently [13:41:08] isn't not coming back' [13:41:09] ? [13:41:21] nope [13:41:32] I am kiking it and see what happens [13:41:56] hopefully it will not have been reimaged [13:41:57] which one is it? 20127? [13:41:59] 2017? [13:42:13] last on log let me see [13:42:30] yes, 2017 [13:42:32] on SAL [13:42:51] BIOS takes forever [13:43:07] we may need to update it [13:43:19] for it to go faster [13:43:25] /s [13:43:32] es2017 sounds familiar [13:43:37] yep [13:43:37] didn't it have some hw issues before? [13:43:39] let me check [13:43:46] I am doing the depooled ones first [13:43:54] • Number of crashes es2017: 26th May 30th May, [13:43:55] and this was one of the ones that crashed [13:43:58] long ago though [13:43:58] yeah [13:44:09] it seems kernel is booting now [13:44:16] no bios error [13:44:28] mmm [13:44:33] everything normal now [13:44:37] :-/ [13:44:52] I will check ipmi [13:44:54] do they do memory check? [13:45:01] not memory [13:45:12] but they did an inventory of all things there [13:46:03] I will not go too much in depth on this [13:46:10] I have other machines to reboot [13:46:14] sure :) [13:46:17] and this one is part of the incident [13:47:44] I checked backups [13:47:46] they are ok [13:47:54] and asked about dbstore1001 disk [13:48:55] yeah, saw that [13:48:58] it has been a while [13:49:09] interesting that htey arrived to dallas a lot earlier [13:49:23] they were more expensive, too [14:06:52] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2834958 (10Marostegui) I have dropped all the tables suggested by the scri... [14:50:54] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2835049 (10Marostegui) @Papaul db2034 and db2048 are now off. Please proceed and change the memory DIMMs and turn them on so I can try to crash them later today or tomorrow morning. Thanks! [15:46:17] 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2835134 (10jcrespo) !log stopping for 24 hours cross-dc replication on shards es2,es3 codfw->eqiad (es1015, es1019) [16:36:24] 10DBA: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2835294 (10Marostegui) dbstore1002 is done ``` MariaDB DBSTORE localhost dewiki > show create table revision\G *************************** 1. row *************************** Table: revision Create Table: CREATE TABLE `revisio... [19:28:22] 10DBA, 06Operations: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2835972 (10jcrespo) [19:29:43] 10DBA, 06Operations: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2835972 (10jcrespo) p:05Normal>03High [20:01:23] 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2836131 (10jcrespo) Another example of why long running connections are a problem: I am depooling es1017 for important maintenance, I have depooled it, so I expect co... [20:03:16] 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2836146 (10jcrespo) I also do not want you to make you work more than necessary. If you only need 1000 rows, and it contains no private data, I can give you access to... [20:17:20] 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2836213 (10daniel) @jcrespo a misc server would be fine, no private data there. We'll need to add a config variable to allow wb_changes_dispatch to live on a separat... [20:59:47] 10DBA, 06Labs, 10Tool-Labs: Tool Labs: Add skin, language, and variant to user_properties_anon - https://phabricator.wikimedia.org/T152043#2836353 (10Krinkle) [21:00:00] 10DBA, 06Labs, 10Tool-Labs, 07Regression: Tool Labs: Add skin, language, and variant to user_properties_anon - https://phabricator.wikimedia.org/T152043#2836368 (10Krinkle) p:05Triage>03High [23:01:19] 10DBA, 06Labs: Prepare and check storage layer for new fi.wikivoyage.org - https://phabricator.wikimedia.org/T151756#2836726 (10jcrespo) I've filtered the database on sanitarium (db1069) and checked it is filtered on labs. I suppose views are pending. [23:05:05] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2836731 (10jcrespo) I wanted to sanitize this for T151756, I realized the database hasn't been...