[05:17:07] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212803 (10Marostegui) This time it worked ```       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicaldrive 1I:1:2 (port 1I:box 1:bay 2,...
[05:18:06] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212805 (10Marostegui) Looks like it was a one time thing: ``` root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper Temperature: 47 C   Temperature                             : OK ```  I am going to...
[05:26:43] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212814 (10Marostegui) After reboot: ``` root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper Temperature: 48 C   Temperature                             : OK ```
[05:27:02] <wikibugs>	 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4212817 (10Marostegui)
[05:27:08] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Move db1066 to row A - https://phabricator.wikimedia.org/T193847#4212815 (10Marostegui) 05Open>03Resolved Server repooled Thanks Chris for getting this done!
[05:28:39] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212818 (10Marostegui)
[05:28:57] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4212819 (10Marostegui)
[05:28:59] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4212820 (10Marostegui)
[05:30:39] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4148947 (10Marostegui) s3 eqiad progress  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore10...
[05:30:56] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4145124 (10Marostegui) s3 eqiad progress  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1...
[05:30:59] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212823 (10Marostegui) s3 eqiad progress  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1095 []...
[05:31:21] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212824 (10Marostegui)
[05:31:23] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4212825 (10Marostegui)
[05:31:49] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4212826 (10Marostegui)
[05:42:08] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission db1053 - https://phabricator.wikimedia.org/T194634#4212837 (10Marostegui) Let's make sure we label this disk, somehow, as broken when we decommission this host - so it is not reused in the future to replace other disks: ``` Enclosure Device ID: 32    Slot Number: 10 ```
[05:43:24] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212838 (10Marostegui) 05Open>03Resolved
[08:11:59] <jynus>	 db1113:3316 is having spikes of lag
[08:13:12] <jynus>	 from 0 to 2 minutes
[08:13:34] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13316&from=1526458406888&to=1526631206888
[08:15:02] <marostegui>	 is that real? or could be graphing issues?
[08:15:10] <jynus>	 it is real
[08:15:16] <jynus>	 mediawiki complains, too
[08:16:05] <marostegui>	 the other rc slave doesn't have that issue?
[08:16:23] <jynus>	 I am checking now, but didn't see any other host on the logs
[08:16:38] <jynus>	 I would depool and observe
[08:16:49] <jynus>	 doing it now
[08:16:58] <marostegui>	 sounds good yeah
[08:17:06] <jynus>	 to me that points to a hw or other instance-specific issues
[08:18:25] <jynus>	 it is not an rc
[08:18:27] <jynus>	 it is dump
[08:18:32] <jynus>	 apparently
[08:18:42] <jynus>	 but it is load 0
[08:18:46] <jynus>	 that would explain the issue
[08:18:55] <marostegui>	 ah!
[08:19:10] <marostegui>	 I assumed it was rc because I forgot we now have also dumps with multi-instance
[08:22:24] <jynus>	 https://gerrit.wikimedia.org/r/433697
[13:50:55] <wikibugs>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213769 (10Marostegui) We are all set for doing the copies to the new hardware once it arrives.  eqiad:  db1116: s1, s3, s5, s8...
[14:13:31] <wikibugs>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213808 (10jcrespo) One thing we could fix at the same time was the configuration of the triggers to write to the binlog- I am...
[14:14:04] <marostegui>	 ^ I don't get that comment
[14:29:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4213884 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson Already talked to @Cmjohnson - he will replace it today. I manually failed it.
[14:29:55] <jynus>	 thanks
[14:30:09] <marostegui>	 jynus: I didn't get your last comment about the triggers and the binlogs
[14:39:48] <jynus>	 let me find the issue
[14:39:58] <marostegui>	 :)
[14:44:16] <jynus>	 https://mariadb.com/kb/en/library/running-triggers-on-the-slave-for-row-based-events/
[14:45:20] <marostegui>	 Ah right
[14:45:39] <marostegui>	 You want to log them, no?
[14:47:48] <marostegui>	 or you want to get the labsdb servers to run the triggers too?
[14:50:59] <jynus>	 https://phabricator.wikimedia.org/T190704#4213808
[14:51:12] <jynus>	 I want to test which is the right option
[14:51:31] <jynus>	 because we saw loggin them with yes
[14:52:02] <jynus>	 so basically, see the effects of the 2 configurations and "prove" we are doing it right (or correct if there is an issue)
[14:52:15] <jynus>	 I have updated with the documentation links
[14:52:21] <marostegui>	 yeah, just saw it
[14:53:00] <marostegui>	 As per the documentation, it looks like YES is the option we want
[14:56:36] <jynus>	 shouldn't we want LOGGING?
[14:56:43] <jynus>	 I am not sure
[14:56:52] <jynus>	 I would like to test both, and see the differences
[14:56:57] <marostegui>	 sure, feel free :)
[14:57:04] <marostegui>	 I am reading this https://mariadb.org/mariadb-10-1-1-triggers-for-rbr/
[14:57:18] <jynus>	 just wanted to add the task as unofficially pary of the goal
[14:57:18] <marostegui>	 but it doesn't give any new info
[14:57:19] <marostegui>	 on logging
[14:57:35] <jynus>	 to make sure we are not doing anything wrong
[14:57:52] <jynus>	 (and there is not much to do about that until the hardware arrives)
[14:58:05] <marostegui>	 the eqiad hw was supposed to arrive today
[14:58:08] * marostegui crosses his fingers
[14:58:38] <jynus>	 I will leave the partitioning on db1105 over the weekend
[14:59:14] <marostegui>	 coool
[14:59:14] <jynus>	 did you try to stop and restart the servers you just setup?
[14:59:15] <marostegui>	 yeah
[14:59:16] <marostegui>	 a few times
[14:59:18] <marostegui>	 the 4 of them
[14:59:19] <marostegui>	 XD
[14:59:22] <jynus>	 to try to replicate the errors
[14:59:32] <jynus>	 so do you have a better guess of the reason?
[15:00:13] <marostegui>	 I guess the gtid table
[15:00:26] <marostegui>	 But it was truncated (I am 99% sure I truncated it before)
[15:00:57] <jynus>	 I think there is weirdness with gtid
[15:01:05] <jynus>	 populating that table
[15:01:14] <jynus>	 from binlogs, from master, etc.
[15:01:56] <marostegui>	 I think logging isn't what we want
[15:02:00] <marostegui>	 But not sure of course
[15:02:19] <jynus>	 that is why I want to test
[15:02:43] <marostegui>	 If we invoke triggers on sanitarium, then we are already redacting the data, if we include their effects on the binlogs, what will arrive to the labs hosts?
[15:02:44] <jynus>	 see the effect and the propose an alternative, if needed
[15:02:56] <jynus>	 i don't know
[15:02:59] <jynus>	 not sure if 2 events
[15:03:00] <marostegui>	 We will be sending via replication the before and the after
[15:03:01] <jynus>	 or 1
[15:03:09] <jynus>	 I don't know
[15:03:11] <marostegui>	 whereas now, we only send the after, which is already sanitized
[15:03:27] <jynus>	 technically, row sends the before and the after
[15:03:52] <marostegui>	 that's true
[15:04:19] <jynus>	 I want to audit everthing, so we actually know
[15:04:40] <jynus>	 even if not part of what we do now
[15:04:45] <marostegui>	 yeah
[15:05:08] <marostegui>	 You can take one of the codfw instances if you like
[15:06:38] <jynus>	 don't want to touch real ones
[15:06:52] <jynus>	 will use a test deployment
[15:06:59] <jynus>	 but probably not today
[15:07:11] <jynus>	 I have broken enough things this week
[15:07:26] <jynus>	 will stop working today soon
[15:07:33] <marostegui>	 The first week you break something, I still am quite lots of breakages ahead of you!
[15:44:43] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214123 (10Marostegui) Still looking good after 10 hours: ``` root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper Temperature: 47 C   Temperature                             : O...
[15:45:45] <wikibugs_>	 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4214130 (10Marostegui)
[15:49:32] <wikibugs_>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4214145 (10Marostegui)
[15:49:54] <wikibugs_>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui)
[15:49:58] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4208342 (10Marostegui)
[15:50:02] <wikibugs_>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui)
[15:50:05] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4208357 (10Marostegui)
[15:52:11] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214156 (10Marostegui) For the record, after the reboot: ``` root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper Temperature: 48 C   Temperature ```
[17:14:17] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4214576 (10Marostegui) Thanks Chris ``` root@db1066:~# megacli -PDRbld -ShowProg -PhysDrv [32:6] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 8% in 9 Minutes. ```
[18:09:22] <wikibugs_>	 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4215205 (10jcrespo)
[18:56:43] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215416 (10Marostegui) This is all good now ``` root@db1066:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name                : RAID Level          : Prim...
[18:57:07] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215417 (10Marostegui) 05Open>03Resolved
[19:55:11] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215535 (10Urbanecm)
[19:55:26] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215547 (10Urbanecm)
[20:15:07] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215719 (10Urbanecm) a:05Urbanecm>03None I'm not supposed to do this...