[00:27:40] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10bd808) @jcrespo You flagged this in the last SRE meeting as needing #cloud-services-team help to finish up. Let me know what we can do, and... [06:43:15] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009241 (10Marostegui) Whilst we follow up this potential bug with MariaDB - I have skipped those events queries on db2093 to let it replicate because even if w... [06:55:14] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009279 (10Marostegui) We probably do need to filter (one, or maybe both): - global_status - global_status_log [07:00:27] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009281 (10Marostegui) I have set: `Replicate_Wild_Ignore_Table: tendril.global_status_log`on db2093 to see how replication goes. [07:04:16] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4009283 (10Marostegui) For s6 probably db1063 is the only host which is not a large server. However, I wouldn't like to place db1063 as a... [08:10:54] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009339 (10Marostegui) An ignore on global_status_log is making no effect. Probably we should try to go for an ignore on `global_status` too... [10:14:42] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009545 (10Marostegui) I have set this now - we will see how it goes and if the slave can start catching up: ``` Replicate_Wild_Ignore_Table: tendril.global_sta... [10:22:16] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009554 (10jcrespo) Yes, global_status is a snapshot of "current" state, so rows are deleted and inserted all the time- it can be ignored too. [10:25:14] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009555 (10Marostegui) >>! In T184704#4009554, @jcrespo wrote: > Yes, global_status is a snapshot of "current" state, so rows are deleted and inserted all the t... [10:36:28] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009570 (10Jayprakash12345) 05Open>03stalled [10:36:50] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009584 (10jcrespo) Replication is not a huge deal- if it is a controlled failover, we can copy data around. If it is an emergency failover, we can start from a... [10:44:47] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009601 (10Marostegui) >>! In T184704#4009584, @jcrespo wrote: > Replication is not a huge deal- if it is a controlled failover, we can copy data around. If it... [10:53:43] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009570 (10Marostegui) Is this public? Does it need to be replicated to labs? [11:00:15] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009647 (10Jayprakash12345) @Marostegui See T168788, Same thing will be happen here. [11:02:32] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009682 (10Marostegui) Thanks - let us know when it is created in production so we can: - Check if T187089 T185128 T153182 need to be applied - Saniti... [11:05:00] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4009688 (10Marostegui) a:03Papaul This host is failing almost everyday (the same slot). So I am starting to believe it is the controller and not the disks anymore. [11:05:51] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4002363 (10jcrespo) @Papaul what disks are you using as replacement? [11:39:28] I would like to restart db1115 to pick up binlog disablement change [11:41:17] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009780 (10Marostegui) [11:52:37] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009801 (10jcrespo) This seems solved to me? We can talk backups on the other goal (e.g. regular logica backups of the structure and some tables, but not with m... [11:57:16] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009813 (10Marostegui) >>! In T184704#4009801, @jcrespo wrote: > This seems solved to me? We can talk backups on the other goal (e.g. regular logica backups of... [11:59:28] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009820 (10jcrespo) If it is just a restart, just disable event_scheduler, wait for a few seconds and restart now. [11:59:54] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009821 (10Marostegui) ok! Doing it now - thanks [12:00:49] 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4009824 (10jcrespo) a:03jcrespo [12:05:13] jynus: I have a quick question about backups generation on dbstore2001 [12:06:18] ^heads up [12:06:21] shoot [12:06:53] jynus: so backups on dbstore2001 are done, however not for s1, s3 and s4 as: SHARDS=${SHARDS:-"s2 s5 s6 s7 s8 x1"} [12:07:01] you want me to run those manually or what do you normally do? [12:08:28] 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4009850 (10Marostegui) [12:08:32] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009851 (10Marostegui) [12:08:34] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009847 (10Marostegui) 05Open>03Resolved a:03jcrespo db1115 has been restarted ``` root@db1115[(none)]> show master status\G Empty set (0.00 sec) ``` [12:10:39] marostegui: yes, there is 1 script for that [12:10:59] yeah [12:11:01] the shards_non_local [12:11:02] right? [12:11:05] yep [12:11:11] check it loks fine [12:11:14] and just run it [12:11:16] ok, just asking to see if there was some more black magic [12:11:17] cool :) [12:11:18] thanks [12:11:23] you can go back to the cave :p [12:12:20] no, that is the bad stuff I am fixing now [12:12:28] having 3 separate scripts because it was easier [12:12:42] as a "things are broken, make it work" [12:13:03] yeah, totally agreed [12:13:12] I wanted to make sure I was not missing something :) [12:13:21] in theory, the new system doesn't improve yet lots of things [12:13:30] but makes it automatic and configurable [12:13:40] well, that is a big step [12:13:58] I guess we will have to make the actual goal in 1 month [12:21:36] Started dump at: 2018-02-28 10:31:52 [12:21:44] Finished dump at: 2018-02-28 10:59:27 [12:21:46] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009907 (10Marostegui) We have db1113 as db1114 as spare btw. (they are large servers) [12:21:52] what's that dump? [12:22:09] es2001:/srv/backups/dump.m3.2018-02-28T10:31:52 [12:22:15] <3 [12:22:20] I am not 100% sure I like the ':' [12:22:25] but [12:22:28] it is the iso format [12:22:50] yeah, the ":" can be a pain to parse, I would prefer "_103152" or something like that [12:23:21] marostegui - jynus -- a global renamer just performed https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Glorious_Engine [12:23:25] +100,000 edits [12:23:33] but didn't notified you [12:23:38] that's ok [12:23:39] I've already lectured him :P [12:23:49] ok, talk to _kart on operations [12:23:53] We should increase the threshold anyways [12:24:01] because I would say we should ban global renames [12:24:13] while their script is running [12:24:41] or at least massive ones [12:24:50] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress <-- 4 currently running [12:25:10] if you need us to stop global renaming please warn us in advance, thanks :) [12:25:29] it has not started [12:25:31] Hauskatze: As jynus said, talk to kart_ on #operations to see whether that is a concern or not while the script runs [12:25:33] but pleast talk there [12:25:51] I don't think small ones will be a problem [12:26:14] but we shouldn't do massive ones in parallel or under maintenance [12:27:00] blame the renamer who didn't follow the protocol, Jaime, not me [12:27:12] No one is blaming you :) [12:27:21] I am not blaming you, just explaining here :-) [12:27:23] well it doesn't really sound like it's not [12:27:28] appreciate the clarification [12:32:01] 10DBA, 10Operations, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009933 (10Marostegui) [12:35:06] 10DBA, 10Operations, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009956 (10Marostegui) [12:43:47] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009996 (10Marostegui) [12:44:41] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4010000 (10Marostegui) a:03RobH All the DBA steps are done. Assigning it to @robh so this can continue Thanks! [13:22:27] 10DBA, 10Wikimedia-Incident: Investigate why query killer didn't kill 1-hour log queries - https://phabricator.wikimedia.org/T188505#4010174 (10jcrespo) p:05Triage>03High [13:56:54] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010303 (10jcrespo) db1113 as db1114 are not spares, they were bought to generate backups on eqiad, we need them. [13:58:13] Hey, is it okay to enable a feature on enwiki, it increases the size of wbc_entity_usage to some degrees but not super big [13:58:33] If there is any storage concerns, I can stop it [14:11:15] 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4010363 (10jcrespo) [14:11:19] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010364 (10jcrespo) [14:11:21] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010361 (10jcrespo) 05Resolved>03Open Actually, we need to puppetize the event_scheduler on or off. Maybe an eqiad/codfw.yaml enabled/disbled hiera key? [14:11:48] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010367 (10jcrespo) a:05jcrespo>03None [14:29:17] I probably will use dbstore1001 as temporary storage for labsdb1011? [14:29:52] that means deleting /srv/sqldata [14:34:02] the big global rename seems to be doing fine fwiw [14:34:43] cool [14:36:44] I'll count how many wikis are left [14:37:27] don't worry too much about it [14:37:51] maybe the limit should be increased now that we have better hardware [14:38:11] and better replication control [14:38:57] MariaDB [centralauth_p]> select count(*) from renameuser_status where ru_status="queued" and ru_newname = 'Glorious Engine'; ==> 77 [14:39:12] we recently increased the limit from 50k to 100k [14:41:46] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010458 (10Papaul) Disks from the decommissioned servers [14:45:06] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010462 (10jcrespo) Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ? [14:56:51] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010522 (10Marostegui) >>! In T188286#4010462, @jcrespo wrote: > Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ? Agreed.... [15:00:49] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010539 (10Papaul) on each decommissioned servers when the disk is blinking before decommissioning, the disk it label bad so I do not have to use it. [15:05:09] I am currently copying labsdb1011 to dbstore1001 [15:07:50] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010608 (10Marostegui) I don't want to believe we have such bad luck that all the disks we have used happened to be bad or become bad after a few days :( As I said above maybe it is safer to promote db2055... [15:09:38] https://jira.mariadb.org/browse/MDEV-12012 this is supposed to be fixed on 10.1.30 [15:10:54] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010624 (10Papaul) I am good with that. If you want to try another disk. [15:15:36] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010638 (10Marostegui) Let's go for another disk then! Thanks guys! [15:17:43] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010651 (10Marostegui) I know, just saying that as we ordered 8 servers already, we don't really wait to wait for those to arrive if we want to use th... [15:22:58] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010675 (10jcrespo) Oh, I didn't think about that- you are completely right. [15:25:37] labsdb copy will probably take a day- taking a break [16:00:21] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010823 (10Papaul) a:05Papaul>03Marostegui Disk replaced [16:05:06] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010832 (10jcrespo) ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ``` [16:30:28] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010893 (10Marostegui) >>! In T184704#4010361, @jcrespo wrote: > Actually, we need to puppetize the event_scheduler on or off. Maybe an eqiad/codfw.yaml enabled... [16:35:35] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010919 (10jcrespo) Ok to me, as long as it is puppetized- for when we have to reboot the servers or they crash. [17:26:57] 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4011097 (10Marostegui) [17:27:01] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4011098 (10Marostegui) [17:27:06] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4011094 (10Marostegui) 05Open>03Resolved a:03jcrespo In a very rudimentary and ugly way, that has been implemented. Closing this then [18:21:12] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#4011277 (10Tpt) [19:28:48] jynus: marostegui: Hey, Wikidata team decided to pick this up: https://phabricator.wikimedia.org/T184485. What it means DB-wise is that logging table in Wikidata (and most wikis like commons) will be cut to either: 1- half 2- or one percent. That will free up lots of storage. ETA of happening it is the next month. Any considerations? will definitely let you know so you optimize the tables [19:29:10] The table itself is 600M rows with average of 180 bytes per row (not to count indexes)