[02:08:13] 10DBA, 10Wiki-Loves-Monuments-Database: mysqldump is timing out preventing all tables from being included in the dump - https://phabricator.wikimedia.org/T138517#2801443 (10Platonides) >>! In T138517#2796670, @Lokal_Profil wrote: > `ERROR 1214 (HY000) at line 15: The used table type doesn't support FULLTEXT in... [07:11:56] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801687 (10Marostegui) This is what I have seen - There is a big spike on disk writes just before the server died - The ILO logs after the reset show: ``` description=POST Error: 1792-Slot X Dr... [07:43:52] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801712 (10Marostegui) db2050 is located right on top of db2049 and its logs do not reveal any warning or any trace of overheat [08:29:50] 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10Marostegui) @Cmjohnson did HP come back to you about this issue? Thanks! [08:52:59] what is your opinion about db2049? [08:53:26] I think we should try to restart it manually and it if goes fine, start replication [08:53:30] should we try to overload it to see if the problem repeats? [08:54:14] did you ever see an overheat problem in codfw or is this the first one? [08:54:20] obviously if it is a one -time hw issue we cannot do anything about it [08:54:43] I think the last one was db1073, I think [08:54:52] that is why I checked the activity graphs to see if it was under heavy load when it crashed [08:55:03] I only saw the spike in disk activity [08:55:11] but you say after [08:55:17] which would be normal [08:55:49] yes, that is why I am saying that maybe overloading it will not reproduce the crash, as it was not in heavy load [08:56:02] But we can try some cpuburn for some hours [08:56:05] to see what happens [08:56:16] I do not know, I was asking your opinion [08:56:37] the idea is [08:56:46] if it happens again, we want to have more info [08:56:52] yeah, totally [08:57:00] I do not want another db2034 :( [08:57:01] so what can we do about it? [08:57:13] about having more info, not about the crash [08:58:02] We can try the cpuburn approach for some hours, to see if it crashes, and if not, we can maybe discard "heavy load" as a cause [09:03:38] whatever we decide, I would reboot it manually before [09:08:07] yes [09:13:07] marostegui: I would check also that remote IPMI is working fine too ;) [09:13:22] what do you mean? [09:14:23] for example that you're able to do a chassis power status with ipmitool from neodymium [09:14:30] after the reboot [09:15:14] ah right [09:15:15] sure [09:25:06] I am going to reboot it now [09:25:09] jynus ^ [09:25:33] ok [09:27:26] downtime on icinga ;) [09:27:53] shit [09:28:23] i thought it was downtime from yesterday [09:28:29] let's see if I have been fast enough [09:29:14] you should log it anyway [09:29:41] that is more important IMHO [09:29:58] helps identify is something crashed or was purposedly down [09:30:11] true, I did it retroactively [09:30:13] for the record [09:30:15] thanks [09:37:42] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801859 (10Marostegui) After the reboot the Cache message is gone. [09:53:25] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801893 (10Marostegui) I am running a burn test - I have started burning 3 CPUs and will leave it for a little while before starting with 3 more. [09:56:15] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801894 (10jcrespo) a:03Marostegui Assigning it to you to credit you are working more on this. [10:05:30] 2 wikibugs bots? ;) [12:14:29] 10DBA: Set barracuda InnoDB file format as the deafault configuration everywhere - https://phabricator.wikimedia.org/T150949#2802293 (10jcrespo) [12:14:41] 10DBA: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949#2802293 (10jcrespo) [12:20:43] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2802309 (10jcrespo) [12:27:05] 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#2802317 (10jcrespo) [15:03:30] they fixed a problem I reported on 8.0.0 which interest us: https://bugs.mysql.com/?id=83706 [15:04:17] ah nice [15:04:20] that was fast [15:11:12] could it be that dbstore2X have each 24GB of buffer pool? [15:13:08] maybe we should create a dbstore2 role, like sanitarium2 [15:13:14] that is what the config says indeed 24GB [15:13:43] for innodb-only dbstore [15:15:49] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802648 (10jcrespo) p:05Triage>03Normal [16:00:41] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802731 (10Marostegui) I am burning 12 CPUs now. For the night I am planning to leave 24 of them and see what happens tomorrow morning. [16:04:48] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 - https://phabricator.wikimedia.org/T150960#2802738 (10Marostegui) [16:17:30] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2802766 (10Papaul) The IOS file is about 6.5 GB and has all firmware update for the Proliant G6 G7 and G8 . I had to burn it on a DVD and boot the server from it. it took approximately 15 minutes t... [16:20:50] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802774 (10Cmjohnson) @jcrespo the disk has been swapped please let me know if you need anything else. [16:27:39] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802785 (10jcrespo) Yes, I mentioned changing the thermal paste (unless you see some other reason to create thermal issues, such as a malfunctioning fan) and wiping the logical disk volume (not ph... [16:34:48] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802801 (10Cmjohnson) @jcrespo sure, can I power down anytime? [16:35:15] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802802 (10jcrespo) Yes. [16:47:06] 10DBA, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2802859 (10Cmjohnson) p:05Normal>03Low [16:53:19] 10DBA, 10Phabricator, 06Release-Engineering-Team: pbraicator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802868 (10jcrespo) [16:53:56] 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2802881 (10Cmjohnson) HP Support Case Opened. Case ID: 5315048494 Case title: Failed Power Supply Severity 3-Normal [16:54:02] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabiicator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802882 (10jcrespo) [16:56:58] jynus: I am not finding any pattern (on queries) for 1001 and 1003 crashes [16:57:06] (so far) [17:01:04] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802900 (10Aklapper) [17:03:47] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802917 (10mmodell) p:05Triage>03High [17:09:37] marostegui, I would put a watchdog [17:09:57] on memory usage [17:10:06] per user [17:10:38] that is, information_scheam.user_statistics [17:11:49] put it to log every minute, then we can see where it is taking so much memory [17:12:12] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802937 (10mmodell) This seems to be related to fulltext search, I'm still investigating further. [17:14:11] jynus: You mean querying that for that specific user and logging it? [17:14:32] which user, I did not mention any user [17:14:44] just the table showing all activity [17:14:50] sure [17:15:36] selecting that and keeping it as a record [17:15:59] until we debug qhere the ooms come from [17:17:38] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802956 (10jcrespo) https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1043&from=now-24h&to=now This graph is intere... [17:18:41] jynus: I will leave it in both labsdb1001 and 1003 [17:18:50] thank you [17:19:24] the probem with that table, and P_S is that it gets deleted on restart [17:19:32] I am logging it to a file [17:19:49] which is exactly when normally you need it :-) [17:20:06] phab people are already aware [17:20:20] I may leave a pt-kill instance on on a screen session [17:20:28] will document it on the ticket [17:20:44] sure, sounds like a sane idea [17:23:50] 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802986 (10Marostegui) I am burning 32 cores until tomorrow morning. [17:27:06] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803004 (10jcrespo) As I may disconnect soon, I've left a screen on db1043 killing queries running for over 600 seconds called 878.kill-long-queries as a mitiga... [17:42:02] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803048 (10mmodell) ok so I just deployed a hotfix which should limit those token_* queries to no more than 5 tokens and I filtered short words which could have... [17:43:07] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803056 (10mmodell) @jcrespo: The hotfix should be live, assuming it works (I'm pretty sure it will) then your mitigation script may not be needed. [17:43:47] yep, assuming it works :-) [17:43:57] >>> UNRECOVERABLE FATAL ERROR <<< [17:49:08] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803063 (10mmodell) @jcrespo: Thanks for catching this, I wouldn't have noticed until everything died. [17:50:47] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803065 (10Paladox) I wonder what changed since the last phabricator update to cause this problem? [17:54:20] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803071 (10jcrespo) p:05High>03Low I have changed the script to print rather than to kill, will close this tomorrow and stop the script if I get no other in... [17:54:30] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803074 (10mmodell) @paladox: These two commits touched PhabricatorProjectQuery but I haven't figured out what might have caused the issue exactly: * {rPHAB9a1... [17:55:10] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803075 (10Paladox) @mmodell thanks, maybe this https://secure.phabricator.com/rPe053534c7e84b09e5f01ac3acb41352bb6a37e05 will improve things? [17:56:08] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803076 (10mmodell) [18:09:42] 10DBA, 06Operations, 10ops-codfw: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2803133 (10Papaul) a:05Papaul>03Marostegui @Marostegui disk replacement complete [19:16:40] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2798143 (10chasemp) I'm not sure how to know this is OK to expose. Is there anywhere that someo... [19:31:29] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2803491 (10kaldari) @dpatrick: Any chance you could OK replicating this data to Tool Labs? The d... [19:33:52] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2803498 (10chasemp) once @dpatrick gives this a once over you can assign to me and I'll knock it... [20:56:47] 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2803851 (10jmatazzoni) [21:37:22] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2803972 (10Deskana) 05Open>03Resolved a:03Deskana [21:39:04] jynus: So those replags that I think someone was mentioning last night? I think they're repeating about every hour.... [21:40:02] https://logstash.wikimedia.org/goto/e0d0c19d3ba53494d64f06a3e0f9f4bc [21:45:59] what you are looking is at millions of logs sayinh "Server db1056 (#2) has >= 1.2185490131378 seconds of lag" [21:46:20] when there is a +-1 second error on lag measuring [21:46:49] what it is difficult is when it is not logging [21:47:26] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2803999 (10Deskana) 05Open>03Resolved [21:49:11] jynus: Ah ok. So basically a logging/metrics problem and not a real problem? [21:49:32] I mean, 20 minutes of lag is a problem [21:49:40] 1 second it is not [21:49:52] there is not perfect limit inbetween [21:50:00] * ostriches nods [21:50:05] but less than 15 second I do not think it should be a warning [21:50:08] only debug [21:50:22] there maybe a problem there, though [21:50:50] I see lots of "Server db1091 (#8) is not replicating?" [21:51:36] but you are talking to the wrong person- I do not know who created that logging channel and what is the logic behind it [21:53:10] It's just a filter on the MW replication lag errors. [21:53:45] So anytime MW is noticing, it's logging to the usual MW log pile, the link I gave just filtered out that type of error. [21:54:10] and when does it say "Server XXXXX (#8) is not replicating?"? [21:54:56] 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2804020 (10jmatazzoni) Pau and I discussed this and have decided to go back to... [21:57:28] jynus: I'm not entirely sure... [21:57:51] well, neither I am :-) [21:58:42] I have my own logs, I plot them in a nice graph: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen [22:01:02] this one is better: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?from=now-24h&to=now [22:23:40] 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2804147 (10mmodell) 05Open>03Resolved a:03mmodell [22:38:49] jynus: I see spikes in yours that correspond to ones in my MW logging. Perhaps MW is just too sensitive about it and needs to take a chill pill :)