[05:58:20] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4252661 (10Marostegui) db2095:s4 has been finally moved under db2073 as db2051 (codfw master) already caught up with eqiad and... [05:58:44] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4252662 (10Marostegui) [07:16:55] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4252739 (10Marostegui) This schema change broke replication on the sanitarium hosts because of the triggers. ```... [07:48:14] db2059 complains about its tls certificate to be revoked [07:48:32] is it pending to be reimaged/was reimaged? [07:48:55] It is pending to be reimaged after failing on the non format srv and all that [07:49:01] I was planning to take a look later [07:49:20] I couldn't find why it was failing on the installation on friday, and left it for this week [07:49:46] then the reimage removed and disabled its puppt cert [07:50:01] I will take a look in a sec [07:52:16] I would like to find out why it wasn't reimaged correctly [08:34:43] 10DBA, 10Operations, 10ops-codfw: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10jcrespo) [08:35:12] 10DBA, 10Operations, 10ops-codfw: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10Marostegui) After hard resetting it and checking how it boots: ``` Enumerating Boot options... Enumerating Boot options... Done UEFI0067: A PCIe link training failure is observed in Embedded Network Device... [08:36:54] 10DBA, 10Operations, 10ops-codfw: pc2005 down - https://phabricator.wikimedia.org/T196339#4252928 (10MoritzMuehlenhoff) a:03Papaul [08:38:08] I will depool the server [08:38:30] thanks [08:39:02] 10DBA, 10Operations, 10ops-codfw: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10Marostegui) p:05Triage>03High [08:45:00] and that is why I want 4 and not only 3 servers [08:45:05] :) [09:00:26] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4252966 (10jcrespo) [09:01:57] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10jcrespo) @papaul Please try any trivial thing you want, but these hosts are leased, and provider should take care of any hw issues. [09:14:33] jynus: ^ that's an assumption that doesn't necessarily hold :) [09:15:00] not sure what you mean? [09:15:47] you're saying that for leased hosts the provider should take care of issues [09:16:02] that's an assumption, but it's not the case actually [09:16:17] well, they where bought with pro support, as usual [09:16:21] our leasing is purely a financial construct, it doesn't change warranties/service in any way [09:16:35] so it's the same as with purchased servers really [09:16:38] my point is [09:16:45] once the leased finishes [09:17:22] if we were to continue, we would pay for no support? [09:18:21] depends, I would need to look at the contract what options exist beyond the term [09:18:30] but there's no intention to do that if we can avoid it [09:18:54] i know we have the option to buy the servers beyond the lease term, so we would then become the owners of them (and still have no support) [09:19:23] yes, I am just justifying my comment on that, I wouls expect (maybe wrongly) to pay for continuous suppor on that situation [09:19:52] or as you said, owning them [09:20:46] anyway, I can change the comment from lease to 3-year support period [09:20:58] which is what I meant [09:21:20] yeah, it has nothing to do with leasing really [12:40:32] 10DBA, 10User-Jayprakash12345: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358#4253485 (10Jayprakash12345) [12:41:43] 10DBA, 10Cloud-Services: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358#4253501 (10Jayprakash12345) 05Open>03stalled [12:43:01] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pswikivoyage - https://phabricator.wikimedia.org/T196359#4253505 (10Urbanecm) 05Open>03stalled [12:44:41] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362#4253548 (10Urbanecm) p:05Triage>03Low [12:45:06] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362#4253548 (10Urbanecm) p:05Low>03Triage [12:49:06] 10DBA, 10Cloud-Services: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358#4253599 (10Jayprakash12345) [12:51:31] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362#4253635 (10Urbanecm) 05Open>03stalled Wiki will be created later. [12:54:52] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for pswikivoyage - https://phabricator.wikimedia.org/T196359#4253651 (10Marostegui) Let us know when this is created so we can sanitize it. [12:55:06] 10DBA, 10Cloud-Services: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358#4253654 (10Marostegui) Let us know when this is created so we can sanitize it. [12:55:27] Didn't realise there were so many wikis to create again... [12:55:30] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362#4253657 (10Marostegui) Let us know when this is created so we can sanitize it. [12:55:38] Reedy: Yeah, there are a few already [12:55:47] Thought there was only 1 or 2 [12:56:02] 5 that I can see on our blocked column [12:56:23] I can see why I was being poked to find a window then [13:12:11] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#4253694 (10Marostegui) p:05Triage>03Normal [13:16:00] 10DBA: Implement a script to facilitate sanitarium failovers between DCs - https://phabricator.wikimedia.org/T196367#4253713 (10Marostegui) p:05Triage>03Normal [13:51:52] jynus: let's move labsdb1010 in around 10 mins or so? [13:52:24] ok [13:52:29] \o/ [13:52:44] let me finish the revert of codfw master's config [13:52:50] and get a coffee [13:52:56] sure thing [13:53:01] I will do the pre steps and all that [14:14:04] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4253882 (10Anomie) >>! In T192926#4252739, @Marostegui wrote: > This would need to be done before any schema change on the sanitarium... [14:15:55] All the pre steps are done [14:16:07] coordenates left at: /home/marostegui/db1124.txt and db1125.txt on neodymium [14:16:43] labsdb1010 is up to date, so we should be good to go and reset slave and configure new threads [14:20:37] ok to me [14:20:54] let's go then! [14:22:45] started slave s5, now going to start sanitarium master for s5 and see how it goes [14:23:09] replication caught up finely [14:23:21] *cathing up [14:23:24] catching [14:24:08] I see new users being created and redacted correctly too [14:24:18] going to start all slaves now [14:24:25] ok [14:24:43] Done [14:25:02] I see all sections delayed, so that's good, no errors [14:29:30] I think we are good [14:30:36] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4253943 (10Marostegui) [14:30:59] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231531 (10Marostegui) labsdb1010 was switched over to the new sanitarium hosts. [15:05:38] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4254026 (10dbarratt) >>! In T193449#4248663, @jcrespo wrote: > I don't expect having prepared code in advance, but... [15:13:26] 10DBA: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376#4254041 (10Marostegui) 05Open>03stalled p:05Triage>03Normal [15:14:02] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231989 (10Marostegui) [15:21:39] 10DBA, 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#4254070 (10jcrespo) [15:21:45] 10DBA, 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review: Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#4254068 (10jcrespo) 05Open>03Resolved I am going to consider this resolved- testing was done, it is not enough... [15:25:30] 10DBA, 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#4254103 (10jcrespo) [15:39:48] 10DBA, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379#4254136 (10Anomie) [15:40:28] 10DBA, 10Schema-change: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379#4254147 (10Marostegui) p:05Triage>03Normal [15:59:59] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254230 (10Cmjohnson) @Marostegui I need to access the servers smart storage administrator which requires me to boot into during the post. When would be a good time for me to take the server down for 15... [16:00:57] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254231 (10Marostegui) @Cmjohnson I can depool the server for you tomorrow. Does that work? [16:13:06] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254286 (10Cmjohnson) @marostegui that will work! Thanks [17:02:08] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4254487 (10dmaza) >>! In T193449#4254026, @dbarratt wrote: > What is the expected growth of this(these) table(s)?... [17:18:24] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234323 (10jcrespo) We can put labsdb1009 down, but for the future, we should install the utilities on the appropiate hosts- we shouldn't have to restart a server just to be able to change a disk. [17:24:22] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4254584 (10jcrespo) What is the rough timeline to production? Any hard deadlines you know as of now (if you have th... [17:41:51] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254633 (10Marostegui) I thought about it but there are no deb packages: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_b256f556f71b41cf99c67fc608&swEnvOid=4004#tab1 We can probably use alie... [17:45:55] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234323 (10MoritzMuehlenhoff) @Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already. [17:48:58] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254650 (10Marostegui) Looks like it was easier than expected and I was able to extract the binary after converting the rpm to deb. I have run: ``` root@labsdb1009:/home/marostegui# ./hpssaducli -ssd -f... [17:50:09] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254653 (10MoritzMuehlenhoff) >>! In T195690#4254646, @MoritzMuehlenhoff wrote: > @Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already. And the component is enabled... [17:51:18] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254655 (10Marostegui) >>! In T195690#4254653, @MoritzMuehlenhoff wrote: >>>! In T195690#4254646, @MoritzMuehlenhoff wrote: >> @Marostegui : hpssaducli is present in the thirdparty/hwraid component for s... [22:31:54] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870#4255530 (10Krinkle)