[05:10:59] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) a:03Cmjohnson @Cmjohnson can we get this disk replaced? This host is old and will be replaced "soon", but this is m1 primary master, so better to have it replaced. We are in process of switc... [05:13:00] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) p:05Triage→03Normal [05:29:02] 10DBA: Replace db2044 with db2063 - https://phabricator.wikimedia.org/T230459 (10Marostegui) 05Open→03Declined This host is still failing with the idrac not being able to work. I think I will just decommission this one and pick another one, no need to waste more time with it. [05:29:04] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [05:34:31] 10DBA, 10Operations: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) [05:35:36] 10DBA, 10Operations: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) p:05Triage→03Normal [05:38:21] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10MoritzMuehlenhoff) db2044 now has a second disk in predictive failure: ` # hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I... [05:39:32] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5420746, @MoritzMuehlenhoff wrote: > db2044 now has a second disk in predictive failure: > > ` > > # hpssacli controller all show config > > Smart Array P420i in Slot... [05:40:32] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10Marostegui) [05:40:46] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10Marostegui) [05:40:48] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:47:41] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Marostegui) The alert cleared up - thanks! [06:12:09] FYI, I'm upgrading PHP on tendril/dbmonitor, will be unavailable for a few seconds [07:11:31] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2067.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201908190711_marostegui_14402... [07:11:38] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10Marostegui) p:05Triage→03Normal [07:12:37] 10DBA, 10Operations: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) [07:19:00] 10DBA, 10Operations: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:25:52] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2067.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2067.codfw.wmnet'] ` [07:28:38] 10DBA, 10Operations, 10ops-eqiad: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) >>! In T229452#5416446, @Cmjohnson wrote: > @Marostegui I see a potential issue with B3 as well. I will need to do a DIMM swap A -> B side and see if the e... [07:44:50] 10DBA, 10Data-Services: Create replica of napwikisource on labs - https://phabricator.wikimedia.org/T230485 (10Marostegui) 05Open→03Invalid As Jaime pointed out, this will be handled at {T210762}, I am going to close this as duplicate. I am already working on T210762 [07:55:04] 10DBA, 10Data-Services, 10Operations: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10Marostegui) Once this wiki is created, please let us know so we can sanitize it on labs and sanitarium before creating the views on the wikireplicas. [08:07:35] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2067.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201908190807_marostegui_15464... [08:23:11] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2067.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2067.codfw.wmnet'] ` [08:35:56] 10DBA, 10Cloud-Services, 10cloud-services-team, 10User-Banyek: Prepare and check storage layer for nap.wikisource - https://phabricator.wikimedia.org/T210762 (10Marostegui) I have sanitized this on db1124 and db2094. I have checked that the new users are being sanitized correctly on the sanitarium and labs... [08:39:43] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10Marostegui) @jcrespo - I have been thinking about this ticket lately. Given that switchover.py works so well already, do you think it would be doable to do a --emergency-slave-sw... [08:48:32] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10jcrespo) Sadly switchover.py wouldn't be reusable or helpful (the replication and other libraries may be) for an emergency- it has to start from 0. Switchover.py assumes all host... [08:58:55] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10Marostegui) Ah, I see!. Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be... [09:03:49] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10jcrespo) >>! In T196366#5420943, @Marostegui wrote: > Ah, I see!. > Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to... [09:07:23] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2067.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201908190907_marostegui_16701... [09:07:29] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10jcrespo) Maybe gtid will become usable at 10.4 ? https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=132462#comment-132462 [09:08:35] 10DBA: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10Marostegui) >>! In T196366#5420946, @jcrespo wrote: >>>! In T196366#5420943, @Marostegui wrote: > > and a way to detect replicas from a master down (tendril replacement "zarcill... [09:37:20] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2067.codfw.wmnet'] ` and were **ALL** successful. [09:37:36] volans: ^ \o/ [09:38:32] :) [11:00:42] 10DBA: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 (10Marostegui) db2067 is now replicating from db2044. I am going to give it a few hours before promoting it to m2 codfw master. [12:34:11] 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for nap.wikisource - https://phabricator.wikimedia.org/T210762 (10Marostegui) a:05Marostegui→03None All the check data came up clean after the sanitization. This is ready for #cloud-services-team to create the views on the lab... [12:39:04] 10DBA, 10MediaWiki-File-management, 10Performance-Team (Radar): Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 (10Marostegui) Confirmed with the wiki that was created past Wed 14th Aug 2019 (T210762#5413963) - this table is still created when a new wiki is created: ` root@db1075.... [12:41:48] 10DBA, 10Operations: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) [12:42:38] 10DBA, 10Operations: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) p:05Triage→03Normal [12:42:53] 10DBA, 10Operations: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [13:01:02] 10DBA, 10Operations: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) [13:48:03] 10DBA, 10Wikimedia-General-or-Unknown, 10Wikimedia-database-error: Some rows (from the year 2004) in SQL databases have text in latin1 encoding - https://phabricator.wikimedia.org/T108434 (10Scott) [14:03:54] 10DBA, 10Operations, 10observability: Generate instance list of active database hosts to be monitored from prometheus - https://phabricator.wikimedia.org/T145072 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This happened in parent task! [14:03:58] 10DBA, 10Operations, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10fgiunchedi) [15:22:39] 10DBA, 10CirrusSearch, 10Discovery-Search, 10MediaWiki-Categories: Special:RandomInCategory does not return all pages with equal probability - https://phabricator.wikimedia.org/T200703 (10Marostegui) It would be helpful if you guys can come with some specific example queries we could run on production, che...