[05:57:05] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) p:05Triage→03Normal [06:03:24] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) There are no hardware logs: ` /admin1-> racadm getsel Record: 1 Date/Time: 07/12/2019 21:38:11 Source: system Severity: Ok Description: Log cleared. ----------------------------------------------------... [06:13:45] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) The first traces of crash are: ` Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: a long semaphore wait: Nov 23 08:25:35 db2125 mysqld[13682]: --Thread 139387736135424 has waited at row0purge.cc line 772 for 24... [06:18:16] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) More logs from the console: ` [10086760.709402] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u480:0:6] [10086764.636175] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [sshd:107965] [1008676... [06:20:28] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [06:27:55] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Nothing apart from this on OS logs: ` Nov 23 08:17:09 db2125 systemd[1]: Started Time & Date Service. Nov 23 08:18:01 db2125 CRON[107127]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var... [06:31:51] 10DBA, 10Operations: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) I have extracted the controller logs....nothing showing up there. [06:34:32] 10DBA, 10Operations, 10ops-codfw: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) a:03Papaul @Papaul can we upgrade firwmare and BIOS on this host? It is a very new host, and if it this crash happens again we might need to contact Dell. [06:41:07] 10DBA, 10Operations, 10ops-codfw: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) [06:41:34] 10DBA, 10Operations, 10ops-codfw: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) This can be related: T238305 [07:26:40] 10DBA, 10Patch-For-Review: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10Marostegui) db2134 is now m3 codfw master db2065 will be decommissioned in a few days ` mysql --skip-ssl -hm3-master.codfw.wmnet -e "select @@hostname" +------------+ | @@hostname | +------------+ | db2134 |... [07:27:43] 10DBA: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) [07:28:00] 10DBA: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) Let's wait a few days before starting to decommission this host. [07:28:29] 10DBA: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) [07:28:31] 10DBA, 10Operations: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:28:43] 10DBA, 10Operations: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:11:05] 10DBA, 10Patch-For-Review: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1067.eqiad.wmnet` - db1067.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Downtimed management... [08:13:15] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:30:27] 10DBA, 10Patch-For-Review: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) [08:39:18] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) For what is worth, this is the kernel this host is running at the moment (I have not upgraded it since the crash): ` root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Deb... [09:51:49] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) Just for the record, I observed this again on db1126. Further, when a host gets depooled, the threads keep connecting there: ` [10:31:58] <+logmsgbot> !log marostegui@cum... [10:34:42] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Ladsgroup) >>! In T68025#5678532, @Marostegui wrote: > I think the size as @Ladsgroup points out could be pretty cool. Not sure if... [10:40:29] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite, and 2 others: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Ladsgroup) [11:43:02] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10ArielGlenn) Snapshot1006 was running regular wikidata dumps. We don't flush LB config after every query for obvious reasons, though page content fetchers should fail and restart with a... [11:46:17] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) Thanks for the explanation. However, db1126 isn't configured as vslow,dumps for s8 (wikidata) - which is the whole point of this ticket and the reason I was reporting it so... [11:57:14] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10ArielGlenn) Sure thing! I'm just not sure of the way forward right now. Just to clarify, what is/was it pooled as, and when did it change from vslow? [14:14:27] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) >>! In T143870#5689066, @ArielGlenn wrote: > Sure thing! I'm just not sure of the way forward right now. > > Just to clarify, what is/was it pooled as, and when did it ch... [14:39:56] 10DBA: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 (10Marostegui) [16:02:31] 10DBA, 10CPT Initiatives (MCR Schema Migration): Drop pre-MCR schema fields in Wikimedia production - https://phabricator.wikimedia.org/T238966 (10WDoranWMF) Following on from email I sent as a heads up for this work, I'm tagging DBA so it's on their radar. [16:09:00] 10DBA, 10CPT Initiatives (MCR Schema Migration): Drop pre-MCR schema fields in Wikimedia production - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#5690005, @WDoranWMF wrote: > Following on from email I sent as a heads up for this work, I'm tagging DBA so it's on their radar. Thanks... [16:16:55] 10DBA, 10CPT Initiatives (MCR Schema Migration): Drop pre-MCR schema fields in Wikimedia production - https://phabricator.wikimedia.org/T238966 (10Anomie) Duplicate of {T184615}? [17:20:02] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Papaul) a:05Papaul→03Marostegui complete Before BIOS Version 2.2.11 iDRAC Firmware Version 3.34.34.34 After BIOS Version 2.4.7 iDRAC Firmware Version 3.36.36.36 [17:21:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Thank you Papaul. I will start MySQL and do a data consistency check. [17:44:46] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Kernel upgraded and host rebooted: ` root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux ` [21:47:41] 10DBA, 10cloud-services-team (Kanban): new nova database on m5 - https://phabricator.wikimedia.org/T239170 (10Andrew) [21:47:58] 10DBA, 10cloud-services-team (Kanban): nova: set up cell and host mappings - https://phabricator.wikimedia.org/T239160 (10Andrew) [21:49:15] 10DBA, 10cloud-services-team (Kanban): new nova database on m5 - https://phabricator.wikimedia.org/T239170 (10Andrew) [21:49:17] 10DBA, 10cloud-services-team (Kanban): nova: set up cell and host mappings - https://phabricator.wikimedia.org/T239160 (10Andrew) [21:49:41] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Andrew)