[04:03:12] 10DBA, 10MediaWiki-extensions-ORES: rcshow=oresreview is slow - https://phabricator.wikimedia.org/T152585#2853290 (10Tgr) [04:04:38] 10DBA, 10MediaWiki-extensions-ORES: rcshow=oresreview is slow - https://phabricator.wikimedia.org/T152585#2853304 (10Tgr) [05:12:50] 10DBA, 10MediaWiki-extensions-ORES: rcshow=oresreview is slow - https://phabricator.wikimedia.org/T152585#2853331 (10Tgr) Huhh, query optimizer fail. ``` mysql:wikiadmin@db1080 [enwiki]> SELECT rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid FROM `recentchang... [07:28:40] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2853484 (10Marostegui) 05Open>03Resolved All good now - thanks Papaul! ``` hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F6F50) Gen8 ServBP 12+2 at P... [07:34:30] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2853487 (10Marostegui) The new labs servers (1009,1010 and 1011) as well as sanitarium2 (db1095) has gtid_domain_id now deployed. ``` root@neodymium:~# for i in db1095 labsdb1009 labsdb1010 la... [07:35:15] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2853488 (10Marostegui) The 3 new labsdb hosts and sanitarium2 have now gtid_domain_id variable deployed and enabled. [08:09:12] 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2853507 (10Marostegui) 05Open>03Resolved This is good now - thanks Papaul! ``` root@db2042:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 001438031205F10)... [08:36:36] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2853540 (10Marostegui) The RAID finished rebuilding and it is marked as OK but the new disk is marked as a predictive failure: ``` physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive... [08:48:46] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2853544 (10jcrespo) Assuming linter_params won't go crazy (it is really extra info of a reasonable size, and not hidden structured data, I do not see any reason to block this deployment to productio... [09:01:48] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2853563 (10Marostegui) I have been going thru all the idrac HW logs and I haven't found anything relevant. Only this: ``` OperationalStatus[1] = 3 (Degraded) ``` Which doesn't give many details, but I assume it is the RAI... [09:04:13] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2853576 (10jcrespo) Given the cause were not the disks, should we rebuild them or should we change them anyway because of the media errors? [09:05:05] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2853577 (10Marostegui) I would rather change them as they had media errors, just to be on the safe side [09:07:10] 10DBA: Unknown cause is creating lag on db1048 under write load (but not on the other m3 slaves) - https://phabricator.wikimedia.org/T151039#2853578 (10jcrespo) [09:13:20] 10DBA: Unknown cause is creating lag on db1048 under write load (but not on the other m3 slaves) - https://phabricator.wikimedia.org/T151039#2853584 (10Marostegui) Checked the config difference between db1048 and db2012 The only significant one is that db1048 has a larger thread_pool_size : ``` -thread_pool_siz... [09:18:59] 10DBA, 06Labs, 10Labs-Infrastructure, 10Tool-Labs, 10Wikimedia-Developer-Summit (2017): Labsdbs for WMF tools and contributors: get more data, faster - https://phabricator.wikimedia.org/T149624#2853591 (10Qgil) ... on the other hand this basically looks like a proposal for a presentation/training session... [09:19:13] 10DBA, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853592 (10jcrespo) [09:19:36] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853604 (10jcrespo) [09:19:50] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853592 (10jcrespo) [09:19:53] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2853606 (10jcrespo) [09:19:55] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853607 (10Marostegui) Should we merge these two: https://phabricator.wikimedia.org/T152427 ? [09:20:51] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2853611 (10jcrespo) [09:20:53] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2853608 (10jcrespo) 05Open>03Resolved a:03jcrespo The ongoing issues are now resolved. Long term fixes will go on T152188. [09:21:08] 10DBA, 06Operations, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2853612 (10jcrespo) a:05jcrespo>03Marostegui [09:23:14] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2853613 (10Marostegui) I have seen that: `modules/role/manifests/labs/db/replica.pp` already includes the firewall class: ``... [09:24:24] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853615 (10jcrespo) Sorry, I didn't see that one, my fault entirely, but you should have added me as subscriber. [09:25:13] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853616 (10Marostegui) Ah sorry - I thought that by adding the project DBA it would add you automatically. My bad! [09:25:16] 10DBA: Create a check/calendar alert for TLS certs - https://phabricator.wikimedia.org/T152427#2853619 (10jcrespo) [09:25:18] 10DBA, 10Monitoring, 06Operations: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853621 (10jcrespo) [09:25:37] 10DBA: Create a check/calendar alert for TLS certs - https://phabricator.wikimedia.org/T152427#2847847 (10jcrespo) [09:25:40] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2853623 (10jcrespo) [09:26:04] 10DBA, 10Monitoring, 06Operations: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2847847 (10jcrespo) [09:27:36] 10DBA, 10Monitoring, 06Operations: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2853628 (10jcrespo) [09:30:08] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194#2853637 (10jcrespo) We probably need custom rules, rather than `role::mariadb::ferm`. [09:47:19] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2853656 (10Marostegui) db2023 is done ``` root@neodymium:/home/marostegui# mysql -hdb2023.codfw.wmnet -A wikidatawiki -e "show create table revision\G" *************************** 1. row ******... [10:01:13] I want to restart dbstore1001, 2001 and 2002, ok with that? [10:01:21] yep [10:01:22] totally fine [10:11:32] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2853666 (10Marostegui) codfw is done. Pending eqiad now: ``` labsdb1001.eqiad.wmnet PRIMARY KEY (`rev_id`), UNIQUE KEY `rev_page_id` (`rev_page`,`rev_id`), KEY `rev_timestamp` (`rev_time... [10:21:20] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2853675 (10Marostegui) The first has been successful and the server HAS NOT crashed. I have started the transfer again to make sure it wasn't just luck. We need to replace that disk though [11:51:41] jynus: you happen to have the ALTERs needed for the rc slaves in enwiki handy? [11:52:32] no rush or anything [12:00:47] it is on software/dbtools [12:01:04] ah cool, will look for it, thanks [12:01:20] why? [12:01:32] Looks like db2034 is not crashing any longer!!! [12:01:35] https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/s1-pager.sql [12:01:42] I feared that [12:01:49] just copy from the other one I did [12:02:00] No, I would like to do a final test [12:02:03] Which is a big alter [12:02:10] the alter takes 3 days [12:02:10] (the first time it crashed with a big alter) [12:02:36] if you do, compress at the same time [12:02:58] it has been depooled for almost a month, so… [12:03:10] it will be good to see how it reacts also once we plug the new disk [12:05:55] I cannot believe it, it crashed as soon as the last file was transferred :( [12:09:12] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2853782 (10Marostegui) The second transfer finished correctly but seconds after the last file was transferred, the server crashed: ``` date=12/07/2016 time=11:59 description=System Powe... [12:11:06] marostegui, may be you will kill me... https://gerrit.wikimedia.org/r/325763 [12:11:17] haha I was commenting right noiw [12:11:19] now [12:16:26] next thing I plan to do is to create a script to transfer files and tablespaces [12:16:31] and I will need your help [12:17:19] sure thing [12:55:36] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2853843 (10Marostegui) db1045 is done ``` root@neodymium:~# mysql -hdb1045 -A dewiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Creat... [13:02:17] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2853860 (10Marostegui) The only pending host is db1049 (eqiad master) [13:03:43] 10DBA, 10MediaWiki-extensions-ORES, 07Performance: rcshow=oresreview is slow - https://phabricator.wikimedia.org/T152585#2853863 (10Aklapper) [13:05:37] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2853865 (10Marostegui) As the transfer was done correctly, I have started MySQL replication + an alter table to see if it dies with normal load too. [14:02:26] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2853957 (10Marostegui) The first alter over a 20G table went fine. I am altering now a 61G one. [14:26:47] it takes an incredible amount of time to restart the dbstores with mechanical disks [14:27:18] or maybe it is the compression [14:30:06] incredible meaning…? [14:30:16] like 30 minutes or so? [14:30:21] the restart time is [14:30:59] around 10 minutes [14:31:21] but that is with half of the content it is supposed to have [14:31:35] yeah, it has 4 shards out of 7 [14:31:47] could be the compression yes [14:31:56] I thought innodb would speed it a bit [14:31:58] how much does it take in dbstore100X? [14:32:06] it was even worse on tokudb [14:32:19] so no actual regression [14:32:43] haha [14:35:57] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2854074 (10Ladsgroup) @aaron: Hey, I'm working on this but it seems that RedisLockManager (and LockManagers in general) doesn't have... [15:03:11] where do we calculate server_id? I cannot find it anywhere [15:07:00] I think it is on the config.pp class of the mariadb module [15:08:31] marostegui: yep there line 27 [15:08:38] is calculated from teh ip [15:09:23] so I am ok with $domain_id = $server_id [15:09:28] but do it once [15:09:30] Ah, I didn't have the submodul initialized since I got my wmf laptop that is why grep wasn't working :) [15:09:34] there and jobe done [15:09:52] if later we change it, we do not have to go over 20 templates [15:11:01] yep, gtid_domain_id is actually going to be calculated the same way as server_id :) [15:14:01] the other problem- imagine that we change later how server_id is calculated and we change without wanting domain_id too [15:14:18] yes, that can be a big problem [15:14:28] for now I will copy the same calculation so they do not depend on each other [15:14:38] that is ok too, to me [15:26:19] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2854152 (10aaron) There is no analogous method. Maybe a non-blocking engageClientLock() call can replace the isClientLockUsed() call?... [15:39:38] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2854167 (10Marostegui) The disk has been pulled out. We are going to let the big ALTER finish and then we will start the transfer again. `physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Fail... [15:51:45] 10DBA, 06Labs, 10Labs-Infrastructure, 10Tool-Labs, 10Wikimedia-Developer-Summit (2017): Labsdbs for WMF tools and contributors: get more data, faster - https://phabricator.wikimedia.org/T149624#2854197 (10bd808) >>! In T149624#2853591, @Qgil wrote: > ... on the other hand this basically looks like a prop... [16:18:11] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2854301 (10daniel) Without looking at the code, it seems we should be able to do without Database::lockIsFree. It's nice to do a chec... [17:11:59] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2854471 (10aaron) If the owner fatals, the lock will have to expire (the TTL depends on the LockManager instance config and/or whethe... [18:08:47] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2854773 (10jcrespo) We finally have the backups again up and running, with one day of delay. Reminder: check that all complete ok. [18:13:15] 10DBA, 10Monitoring, 06Operations: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2854794 (10jcrespo) So, the main issue is that we can have problems like this: ``` 07-Dec 02:05 helium.eqiad.wmnet JobId 4293... [18:14:00] 10DBA, 10Monitoring, 06Operations: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2854796 (10jcrespo) a:05jcrespo>03None [19:00:46] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2735236 (10Bawolff) I confirm for the security team that its fine to replicate this table to labs. [19:04:49] 10DBA, 10MediaWiki-extensions-Linter: DBA review of Linter extension - https://phabricator.wikimedia.org/T148866#2854947 (10Legoktm) 05Open>03Resolved a:03Legoktm Thanks all! [19:33:04] 10DBA, 13Patch-For-Review: Moving eventlogging mariadb role into its own .pp - https://phabricator.wikimedia.org/T152081#2855082 (10Ottomata) [19:41:17] 10DBA, 10MediaWiki-Database, 06Performance-Team, 07Availability: wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#2855162 (10Gilles) p:05Normal>03Low