[00:27:40] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10bd808) @jcrespo You flagged this in the last SRE meeting as needing #cloud-services-team help to finish up. Let me know what we can do, and...
[06:43:15] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009241 (10Marostegui) Whilst we follow up this potential bug with MariaDB - I have skipped those events queries on db2093 to let it replicate because even if w...
[06:55:14] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009279 (10Marostegui) We probably do need to filter (one, or maybe both): - global_status - global_status_log
[07:00:27] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009281 (10Marostegui) I have set: `Replicate_Wild_Ignore_Table: tendril.global_status_log`on db2093 to see how replication goes.
[07:04:16] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4009283 (10Marostegui) For s6 probably db1063 is the only host which is not a large server. However, I wouldn't like to place db1063 as a...
[08:10:54] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009339 (10Marostegui) An ignore on global_status_log is making no effect. Probably we should try to go for an ignore on `global_status` too...
[10:14:42] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009545 (10Marostegui) I have set this now - we will see how it goes and if the slave can start catching up: ``` Replicate_Wild_Ignore_Table: tendril.global_sta...
[10:22:16] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009554 (10jcrespo) Yes, global_status is a snapshot of "current" state, so rows are deleted and inserted all the time- it can be ignored too.
[10:25:14] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009555 (10Marostegui) >>! In T184704#4009554, @jcrespo wrote: > Yes, global_status is a snapshot of "current" state, so rows are deleted and inserted all the t...
[10:36:28] <wikibugs>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009570 (10Jayprakash12345) 05Open>03stalled
[10:36:50] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009584 (10jcrespo) Replication is not a huge deal- if it is a controlled failover, we can copy data around. If it is an emergency failover, we can start from a...
[10:44:47] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009601 (10Marostegui) >>! In T184704#4009584, @jcrespo wrote: > Replication is not a huge deal- if it is a controlled failover, we can copy data around. If it...
[10:53:43] <wikibugs>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009570 (10Marostegui) Is this public? Does it need to be replicated to labs?
[11:00:15] <wikibugs>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009647 (10Jayprakash12345) @Marostegui See T168788, Same thing will be happen here.
[11:02:32] <wikibugs>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4009682 (10Marostegui) Thanks - let us know when it is created in production so we can:  - Check if T187089 T185128 T153182 need to be applied - Saniti...
[11:05:00] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4009688 (10Marostegui) a:03Papaul This host is failing almost everyday (the same slot).  So I am starting to believe it is the controller and not the disks anymore.
[11:05:51] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4002363 (10jcrespo) @Papaul what disks are you using as replacement?
[11:39:28] <marostegui>	 I would like to restart db1115 to pick up binlog disablement change
[11:41:17] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009780 (10Marostegui)
[11:52:37] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009801 (10jcrespo) This seems solved to me? We can talk backups on the other goal (e.g. regular logica backups of the structure and some tables, but not with m...
[11:57:16] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009813 (10Marostegui) >>! In T184704#4009801, @jcrespo wrote: > This seems solved to me? We can talk backups on the other goal (e.g. regular logica backups of...
[11:59:28] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009820 (10jcrespo) If it is just a restart, just disable event_scheduler, wait for a few seconds and restart now.
[11:59:54] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009821 (10Marostegui) ok! Doing it now - thanks
[12:00:49] <wikibugs>	 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4009824 (10jcrespo) a:03jcrespo
[12:05:13] <marostegui>	 jynus: I have a quick question about backups generation on dbstore2001
[12:06:18] <jynus>	 ^heads up
[12:06:21] <jynus>	 shoot
[12:06:53] <marostegui>	 jynus: so backups on dbstore2001 are done, however not for s1, s3 and s4 as: SHARDS=${SHARDS:-"s2 s5 s6 s7 s8 x1"}
[12:07:01] <marostegui>	 you want me to run those manually or what do you normally do?
[12:08:28] <wikibugs>	 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4009850 (10Marostegui)
[12:08:32] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009851 (10Marostegui)
[12:08:34] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4009847 (10Marostegui) 05Open>03Resolved a:03jcrespo db1115 has been restarted ``` root@db1115[(none)]> show master status\G Empty set (0.00 sec)  ```
[12:10:39] <jynus>	 marostegui: yes, there is 1 script for that
[12:10:59] <marostegui>	 yeah
[12:11:01] <marostegui>	 the shards_non_local
[12:11:02] <marostegui>	 right?
[12:11:05] <jynus>	 yep
[12:11:11] <jynus>	 check it loks fine
[12:11:14] <jynus>	 and just run it
[12:11:16] <marostegui>	 ok, just asking to see if there was some more black magic
[12:11:17] <marostegui>	 cool :)
[12:11:18] <marostegui>	 thanks
[12:11:23] <marostegui>	 you can go back to the cave :p
[12:12:20] <jynus>	 no, that is the bad stuff I am fixing now
[12:12:28] <jynus>	 having 3 separate scripts because it was easier
[12:12:42] <jynus>	 as a "things are broken, make it work"
[12:13:03] <marostegui>	 yeah, totally agreed
[12:13:12] <marostegui>	 I wanted to make sure I was not missing something :)
[12:13:21] <jynus>	 in theory, the new system doesn't improve yet lots of things
[12:13:30] <jynus>	 but makes it automatic and configurable
[12:13:40] <marostegui>	 well, that is a big step
[12:13:58] <jynus>	 I guess we will have to make the actual goal in 1 month
[12:21:36] <jynus>	 Started dump at: 2018-02-28 10:31:52
[12:21:44] <jynus>	 Finished dump at: 2018-02-28 10:59:27
[12:21:46] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009907 (10Marostegui) We have db1113 as db1114 as spare btw. (they are large servers)
[12:21:52] <marostegui>	 what's that dump?
[12:22:09] <jynus>	 es2001:/srv/backups/dump.m3.2018-02-28T10:31:52
[12:22:15] <marostegui>	 <3
[12:22:20] <jynus>	 I am not 100% sure I like the ':'
[12:22:25] <jynus>	 but
[12:22:28] <jynus>	 it is the iso format
[12:22:50] <marostegui>	 yeah, the ":" can be a pain to parse, I would prefer "_103152" or something like that
[12:23:21] <Hauskatze>	 marostegui - jynus -- a global renamer just performed https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Glorious_Engine
[12:23:25] <Hauskatze>	 +100,000 edits
[12:23:33] <Hauskatze>	 but didn't notified you
[12:23:38] <marostegui>	 that's ok
[12:23:39] <Hauskatze>	 I've already lectured him :P
[12:23:49] <jynus>	 ok, talk to _kart on operations
[12:23:53] <marostegui>	 We should increase the threshold anyways
[12:24:01] <jynus>	 because I would say we should ban global renames
[12:24:13] <jynus>	 while their script is running
[12:24:41] <jynus>	 or at least massive ones
[12:24:50] <Hauskatze>	 https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress <-- 4 currently running
[12:25:10] <Hauskatze>	 if you need us to stop global renaming please warn us in advance, thanks :)
[12:25:29] <jynus>	 it has not started
[12:25:31] <marostegui>	 Hauskatze: As jynus said, talk to kart_ on #operations to see whether that is a concern or not while the script runs
[12:25:33] <jynus>	 but pleast talk there
[12:25:51] <jynus>	 I don't think small ones will be a problem
[12:26:14] <jynus>	 but we shouldn't do massive ones in parallel or under maintenance
[12:27:00] <Hauskatze>	 blame the renamer who didn't follow the protocol, Jaime, not me
[12:27:12] <marostegui>	 No one is blaming you :)
[12:27:21] <jynus>	 I am not blaming you, just explaining here :-)
[12:27:23] <Hauskatze>	 well it doesn't really sound like it's not
[12:27:28] <Hauskatze>	 appreciate the clarification
[12:32:01] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009933 (10Marostegui)
[12:35:06] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009956 (10Marostegui)
[12:43:47] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009996 (10Marostegui)
[12:44:41] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4010000 (10Marostegui) a:03RobH All the DBA steps are done. Assigning it to @robh so this can continue Thanks!
[13:22:27] <wikibugs>	 10DBA, 10Wikimedia-Incident: Investigate why query killer didn't kill 1-hour log queries - https://phabricator.wikimedia.org/T188505#4010174 (10jcrespo) p:05Triage>03High
[13:56:54] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010303 (10jcrespo) db1113 as db1114 are not spares, they were bought to generate backups on eqiad, we need them.
[13:58:13] <Amir1>	 Hey, is it okay to enable a feature on enwiki, it increases the size of wbc_entity_usage to some degrees but not super big
[13:58:33] <Amir1>	 If there is any storage concerns, I can stop it
[14:11:15] <wikibugs>	 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4010363 (10jcrespo)
[14:11:19] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010364 (10jcrespo)
[14:11:21] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010361 (10jcrespo) 05Resolved>03Open Actually, we need to puppetize the event_scheduler on or off. Maybe an eqiad/codfw.yaml enabled/disbled hiera key?
[14:11:48] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010367 (10jcrespo) a:05jcrespo>03None
[14:29:17] <jynus>	 I probably will use dbstore1001 as temporary storage for labsdb1011?
[14:29:52] <jynus>	 that means deleting /srv/sqldata
[14:34:02] <Hauskatze>	 the big global rename seems to be doing fine fwiw
[14:34:43] <jynus>	 cool
[14:36:44] <Hauskatze>	 I'll count how many wikis are left
[14:37:27] <jynus>	 don't worry too much about it
[14:37:51] <jynus>	 maybe the limit should be increased now that we have better hardware
[14:38:11] <jynus>	 and better replication control
[14:38:57] <Hauskatze>	 MariaDB [centralauth_p]> select count(*) from renameuser_status where ru_status="queued" and ru_newname = 'Glorious Engine'; ==> 77
[14:39:12] <Hauskatze>	 we recently increased the limit from 50k to 100k
[14:41:46] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010458 (10Papaul) Disks from the decommissioned servers
[14:45:06] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010462 (10jcrespo) Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ?
[14:56:51] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010522 (10Marostegui) >>! In T188286#4010462, @jcrespo wrote: > Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ?  Agreed....
[15:00:49] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010539 (10Papaul) on each decommissioned servers when the disk is blinking before decommissioning, the disk it label bad so I do not have to use it.
[15:05:09] <jynus>	 I am currently copying labsdb1011 to dbstore1001
[15:07:50] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010608 (10Marostegui) I don't want to believe we have such bad luck that all the disks we have used happened to be bad or become bad after a few days :(  As I said above maybe it is safer to promote db2055...
[15:09:38] <jynus>	 https://jira.mariadb.org/browse/MDEV-12012 this is supposed to be fixed on 10.1.30
[15:10:54] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010624 (10Papaul) I am good with that. If you want to try another disk.
[15:15:36] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010638 (10Marostegui) Let's go for another disk then! Thanks guys!
[15:17:43] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010651 (10Marostegui) I know, just saying that as we ordered 8 servers already,  we don't really wait to wait for those to arrive if we want to use th...
[15:22:58] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010675 (10jcrespo) Oh, I didn't think about that- you are completely right.
[15:25:37] <jynus>	 labsdb copy will probably take a day- taking a break
[16:00:21] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010823 (10Papaul) a:05Papaul>03Marostegui Disk replaced
[16:05:06] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010832 (10jcrespo) ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ```
[16:30:28] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010893 (10Marostegui) >>! In T184704#4010361, @jcrespo wrote: > Actually, we need to puppetize the event_scheduler on or off. Maybe an eqiad/codfw.yaml enabled...
[16:35:35] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4010919 (10jcrespo) Ok to me, as long as it is puppetized- for when we have to reboot the servers or they crash.
[17:26:57] <wikibugs>	 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4011097 (10Marostegui)
[17:27:01] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4011098 (10Marostegui)
[17:27:06] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4011094 (10Marostegui) 05Open>03Resolved a:03jcrespo In a very rudimentary and ugly way, that has been implemented. Closing this then
[18:21:12] <wikibugs>	 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#4011277 (10Tpt)
[19:28:48] <Amir1>	 jynus: marostegui: Hey, Wikidata team decided to pick this up: https://phabricator.wikimedia.org/T184485. What it means DB-wise is that logging table in Wikidata (and most wikis like commons) will be cut to either: 1- half 2- or one percent. That will free up lots of storage. ETA of happening it is the next month. Any considerations? will definitely let you know so you optimize the tables
[19:29:10] <Amir1>	 The table itself is 600M rows with average of 180 bytes per row (not to count indexes)