[05:17:07] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212803 (10Marostegui) This time it worked ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2,... [05:18:06] 10DBA, 10Operations, 10ops-eqiad: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212805 (10Marostegui) Looks like it was a one time thing: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : OK ``` I am going to... [05:26:43] 10DBA, 10Operations, 10ops-eqiad: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212814 (10Marostegui) After reboot: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 48 C Temperature : OK ``` [05:27:02] 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4212817 (10Marostegui) [05:27:08] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Move db1066 to row A - https://phabricator.wikimedia.org/T193847#4212815 (10Marostegui) 05Open>03Resolved Server repooled Thanks Chris for getting this done! [05:28:39] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212818 (10Marostegui) [05:28:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4212819 (10Marostegui) [05:28:59] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4212820 (10Marostegui) [05:30:39] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4148947 (10Marostegui) s3 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore10... [05:30:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4145124 (10Marostegui) s3 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1... [05:30:59] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212823 (10Marostegui) s3 eqiad progress [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1095 []... [05:31:21] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4212824 (10Marostegui) [05:31:23] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4212825 (10Marostegui) [05:31:49] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4212826 (10Marostegui) [05:42:08] 10DBA, 10Patch-For-Review: Decommission db1053 - https://phabricator.wikimedia.org/T194634#4212837 (10Marostegui) Let's make sure we label this disk, somehow, as broken when we decommission this host - so it is not reused in the future to replace other disks: ``` Enclosure Device ID: 32 Slot Number: 10 ``` [05:43:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212838 (10Marostegui) 05Open>03Resolved [08:11:59] db1113:3316 is having spikes of lag [08:13:12] from 0 to 2 minutes [08:13:34] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13316&from=1526458406888&to=1526631206888 [08:15:02] is that real? or could be graphing issues? [08:15:10] it is real [08:15:16] mediawiki complains, too [08:16:05] the other rc slave doesn't have that issue? [08:16:23] I am checking now, but didn't see any other host on the logs [08:16:38] I would depool and observe [08:16:49] doing it now [08:16:58] sounds good yeah [08:17:06] to me that points to a hw or other instance-specific issues [08:18:25] it is not an rc [08:18:27] it is dump [08:18:32] apparently [08:18:42] but it is load 0 [08:18:46] that would explain the issue [08:18:55] ah! [08:19:10] I assumed it was rc because I forgot we now have also dumps with multi-instance [08:22:24] https://gerrit.wikimedia.org/r/433697 [13:50:55] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213769 (10Marostegui) We are all set for doing the copies to the new hardware once it arrives. eqiad: db1116: s1, s3, s5, s8... [14:13:31] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213808 (10jcrespo) One thing we could fix at the same time was the configuration of the triggers to write to the binlog- I am... [14:14:04] ^ I don't get that comment [14:29:05] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4213884 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson Already talked to @Cmjohnson - he will replace it today. I manually failed it. [14:29:55] thanks [14:30:09] jynus: I didn't get your last comment about the triggers and the binlogs [14:39:48] let me find the issue [14:39:58] :) [14:44:16] https://mariadb.com/kb/en/library/running-triggers-on-the-slave-for-row-based-events/ [14:45:20] Ah right [14:45:39] You want to log them, no? [14:47:48] or you want to get the labsdb servers to run the triggers too? [14:50:59] https://phabricator.wikimedia.org/T190704#4213808 [14:51:12] I want to test which is the right option [14:51:31] because we saw loggin them with yes [14:52:02] so basically, see the effects of the 2 configurations and "prove" we are doing it right (or correct if there is an issue) [14:52:15] I have updated with the documentation links [14:52:21] yeah, just saw it [14:53:00] As per the documentation, it looks like YES is the option we want [14:56:36] shouldn't we want LOGGING? [14:56:43] I am not sure [14:56:52] I would like to test both, and see the differences [14:56:57] sure, feel free :) [14:57:04] I am reading this https://mariadb.org/mariadb-10-1-1-triggers-for-rbr/ [14:57:18] just wanted to add the task as unofficially pary of the goal [14:57:18] but it doesn't give any new info [14:57:19] on logging [14:57:35] to make sure we are not doing anything wrong [14:57:52] (and there is not much to do about that until the hardware arrives) [14:58:05] the eqiad hw was supposed to arrive today [14:58:08] * marostegui crosses his fingers [14:58:38] I will leave the partitioning on db1105 over the weekend [14:59:14] coool [14:59:14] did you try to stop and restart the servers you just setup? [14:59:15] yeah [14:59:16] a few times [14:59:18] the 4 of them [14:59:19] XD [14:59:22] to try to replicate the errors [14:59:32] so do you have a better guess of the reason? [15:00:13] I guess the gtid table [15:00:26] But it was truncated (I am 99% sure I truncated it before) [15:00:57] I think there is weirdness with gtid [15:01:05] populating that table [15:01:14] from binlogs, from master, etc. [15:01:56] I think logging isn't what we want [15:02:00] But not sure of course [15:02:19] that is why I want to test [15:02:43] If we invoke triggers on sanitarium, then we are already redacting the data, if we include their effects on the binlogs, what will arrive to the labs hosts? [15:02:44] see the effect and the propose an alternative, if needed [15:02:56] i don't know [15:02:59] not sure if 2 events [15:03:00] We will be sending via replication the before and the after [15:03:01] or 1 [15:03:09] I don't know [15:03:11] whereas now, we only send the after, which is already sanitized [15:03:27] technically, row sends the before and the after [15:03:52] that's true [15:04:19] I want to audit everthing, so we actually know [15:04:40] even if not part of what we do now [15:04:45] yeah [15:05:08] You can take one of the codfw instances if you like [15:06:38] don't want to touch real ones [15:06:52] will use a test deployment [15:06:59] but probably not today [15:07:11] I have broken enough things this week [15:07:26] will stop working today soon [15:07:33] The first week you break something, I still am quite lots of breakages ahead of you! [15:44:43] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214123 (10Marostegui) Still looking good after 10 hours: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : O... [15:45:45] 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4214130 (10Marostegui) [15:49:32] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4214145 (10Marostegui) [15:49:54] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [15:49:58] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4208342 (10Marostegui) [15:50:02] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [15:50:05] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4208357 (10Marostegui) [15:52:11] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214156 (10Marostegui) For the record, after the reboot: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 48 C Temperature ``` [17:14:17] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4214576 (10Marostegui) Thanks Chris ``` root@db1066:~# megacli -PDRbld -ShowProg -PhysDrv [32:6] -aALL Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 8% in 9 Minutes. ``` [18:09:22] 10DBA: Failover s2 primary master - https://phabricator.wikimedia.org/T194870#4215205 (10jcrespo) [18:56:43] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215416 (10Marostegui) This is all good now ``` root@db1066:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prim... [18:57:07] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215417 (10Marostegui) 05Open>03Resolved [19:55:11] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215535 (10Urbanecm) [19:55:26] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215547 (10Urbanecm) [20:15:07] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008#4215719 (10Urbanecm) a:05Urbanecm>03None I'm not supposed to do this...