[02:08:13] <wikibugs>	 10DBA, 10Wiki-Loves-Monuments-Database: mysqldump is timing out preventing all tables from being included in the dump - https://phabricator.wikimedia.org/T138517#2801443 (10Platonides) >>! In T138517#2796670, @Lokal_Profil wrote: > `ERROR 1214 (HY000) at line 15: The used table type doesn't support FULLTEXT in...
[07:11:56] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801687 (10Marostegui) This is what I have seen  - There is a big spike on disk writes just before the server died - The ILO logs after the reset show:  ```     description=POST Error: 1792-Slot X Dr...
[07:43:52] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801712 (10Marostegui) db2050 is located right on top of db2049 and its logs do not reveal any warning or any trace of overheat
[08:29:50] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10Marostegui) @Cmjohnson did HP come back to you about this issue? Thanks!
[08:52:59] <jynus>	 what is your opinion about db2049?
[08:53:26] <marostegui>	 I think we should try to restart it manually and it if goes fine, start replication
[08:53:30] <jynus>	 should we try to overload it to see if the problem repeats?
[08:54:14] <marostegui>	 did you ever see an overheat problem in codfw or is this the first one?
[08:54:20] <jynus>	 obviously if it is a one -time hw issue we cannot do anything about it
[08:54:43] <jynus>	 I think the last one was db1073, I think
[08:54:52] <marostegui>	 that is why I checked the activity graphs to see if it was under heavy load when it crashed
[08:55:03] <marostegui>	 I only saw the spike in disk activity
[08:55:11] <jynus>	 but you say after
[08:55:17] <jynus>	 which would be normal
[08:55:49] <marostegui>	 yes, that is why I am saying that maybe overloading it will not reproduce the crash, as it was not in heavy load
[08:56:02] <marostegui>	 But we can try some cpuburn for some hours
[08:56:05] <marostegui>	 to see what happens
[08:56:16] <jynus>	 I do not know, I was asking your opinion
[08:56:37] <jynus>	 the idea is
[08:56:46] <jynus>	 if it happens again, we want to have more info
[08:56:52] <marostegui>	 yeah, totally
[08:57:00] <marostegui>	 I do not want another db2034 :(
[08:57:01] <jynus>	 so what can we do about it?
[08:57:13] <jynus>	 about having more info, not about the crash
[08:58:02] <marostegui>	 We can try the cpuburn approach for some hours, to see if it crashes, and if not, we can maybe discard "heavy load" as a cause
[09:03:38] <marostegui>	 whatever we decide, I would reboot it manually before
[09:08:07] <jynus>	 yes
[09:13:07] <volans>	 marostegui: I would check also that remote IPMI is working fine too ;)
[09:13:22] <marostegui>	 what do you mean?
[09:14:23] <volans>	 for example that you're able to do a chassis power status with ipmitool from neodymium
[09:14:30] <volans>	 after the reboot
[09:15:14] <marostegui>	 ah right
[09:15:15] <marostegui>	 sure
[09:25:06] <marostegui>	 I am going to reboot it now
[09:25:09] <marostegui>	 jynus ^
[09:25:33] <jynus>	 ok
[09:27:26] <volans>	 downtime on icinga ;)
[09:27:53] <marostegui>	 shit
[09:28:23] <marostegui>	 i thought it was downtime from yesterday 
[09:28:29] <marostegui>	 let's see if I have been fast enough
[09:29:14] <jynus>	 you should log it anyway
[09:29:41] <jynus>	 that is more important IMHO
[09:29:58] <jynus>	 helps identify is something crashed or was purposedly down
[09:30:11] <marostegui>	 true, I did it retroactively
[09:30:13] <marostegui>	 for the record
[09:30:15] <marostegui>	 thanks
[09:37:42] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801859 (10Marostegui) After the reboot the Cache message is gone.
[09:53:25] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801893 (10Marostegui) I am running a burn test - I have started burning 3 CPUs and will leave it for a little while before starting with 3 more.
[09:56:15] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801894 (10jcrespo) a:03Marostegui Assigning it to you to credit you are working more on this.
[10:05:30] <volans>	 2 wikibugs bots? ;)
[12:14:29] <wikibugs_>	 10DBA: Set barracuda InnoDB file format as the deafault configuration everywhere - https://phabricator.wikimedia.org/T150949#2802293 (10jcrespo)
[12:14:41] <wikibugs>	 10DBA: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949#2802293 (10jcrespo)
[12:20:43] <wikibugs>	 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2802309 (10jcrespo)
[12:27:05] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#2802317 (10jcrespo)
[15:03:30] <jynus>	 they fixed a problem I reported on 8.0.0 which interest us: https://bugs.mysql.com/?id=83706
[15:04:17] <marostegui>	 ah nice
[15:04:20] <marostegui>	 that was fast
[15:11:12] <jynus>	 could it be that dbstore2X have each 24GB of buffer pool?
[15:13:08] <jynus>	 maybe we should create a dbstore2 role, like sanitarium2
[15:13:14] <marostegui>	 that is what the config says indeed 24GB
[15:13:43] <jynus>	 for innodb-only dbstore
[15:15:49] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802648 (10jcrespo) p:05Triage>03Normal
[16:00:41] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802731 (10Marostegui) I am burning 12 CPUs now. For the night I am planning to leave 24 of them and see what happens tomorrow morning.
[16:04:48] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 - https://phabricator.wikimedia.org/T150960#2802738 (10Marostegui)
[16:17:30] <wikibugs>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2802766 (10Papaul) The IOS file is about 6.5 GB and has all firmware update for the Proliant G6 G7 and G8 .  I had to burn it on a DVD and boot the server from it. it took approximately 15 minutes t...
[16:20:50] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802774 (10Cmjohnson) @jcrespo the disk has been swapped please let me know if you need anything else.
[16:27:39] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802785 (10jcrespo) Yes, I mentioned changing the thermal paste (unless you see some other reason to create thermal issues, such as a malfunctioning fan) and wiping the logical disk volume (not ph...
[16:34:48] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802801 (10Cmjohnson) @jcrespo sure, can I power down anytime?
[16:35:15] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802802 (10jcrespo) Yes.
[16:47:06] <wikibugs>	 10DBA, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2802859 (10Cmjohnson) p:05Normal>03Low
[16:53:19] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: pbraicator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802868 (10jcrespo)
[16:53:56] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2802881 (10Cmjohnson) HP Support Case Opened.  Case ID: 5315048494 Case title: Failed Power Supply Severity 3-Normal
[16:54:02] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabiicator close to saturate its database connections  - https://phabricator.wikimedia.org/T150965#2802882 (10jcrespo)
[16:56:58] <marostegui>	 jynus: I am not finding any pattern (on queries) for 1001 and 1003 crashes
[16:57:06] <marostegui>	 (so far)
[17:01:04] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections  - https://phabricator.wikimedia.org/T150965#2802900 (10Aklapper)
[17:03:47] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802917 (10mmodell) p:05Triage>03High
[17:09:37] <jynus>	 marostegui, I would put a watchdog
[17:09:57] <jynus>	 on memory usage
[17:10:06] <jynus>	 per user
[17:10:38] <jynus>	 that is, information_scheam.user_statistics
[17:11:49] <jynus>	 put it to log every minute, then we can see where it is taking so much memory
[17:12:12] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802937 (10mmodell) This seems to be related to fulltext search, I'm still investigating further.
[17:14:11] <marostegui>	 jynus: You mean querying that for that specific user and logging it?
[17:14:32] <jynus>	 which user, I did not mention any user
[17:14:44] <jynus>	 just the table showing all activity
[17:14:50] <marostegui>	 sure
[17:15:36] <jynus>	 selecting that and keeping it as a record
[17:15:59] <jynus>	 until we debug qhere the ooms come from
[17:17:38] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2802956 (10jcrespo) https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1043&from=now-24h&to=now  This graph is intere...
[17:18:41] <marostegui>	 jynus: I will leave it in both labsdb1001 and 1003
[17:18:50] <jynus>	 thank you
[17:19:24] <jynus>	 the probem with that table, and P_S is that it gets deleted on restart
[17:19:32] <marostegui>	 I am logging it to a file
[17:19:49] <jynus>	 which is exactly when normally you need it :-)
[17:20:06] <jynus>	 phab people are already aware
[17:20:20] <jynus>	 I may leave a pt-kill instance on on a screen session
[17:20:28] <jynus>	 will document it on the ticket
[17:20:44] <marostegui>	 sure, sounds like a sane idea
[17:23:50] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802986 (10Marostegui) I am burning 32 cores until tomorrow morning.
[17:27:06] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803004 (10jcrespo) As I may disconnect soon, I've left a screen on db1043 killing queries running for over 600 seconds called 878.kill-long-queries as a mitiga...
[17:42:02] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803048 (10mmodell) ok so I just deployed a hotfix which should limit those token_* queries to no more than 5 tokens and I filtered short words which could have...
[17:43:07] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803056 (10mmodell) @jcrespo: The hotfix should be live, assuming it works (I'm pretty sure it will) then your mitigation script may not be needed.
[17:43:47] <jynus>	 yep, assuming it works :-)
[17:43:57] <jynus>	 >>> UNRECOVERABLE FATAL ERROR <<<
[17:49:08] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803063 (10mmodell) @jcrespo: Thanks for catching this, I wouldn't have noticed until everything died.
[17:50:47] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803065 (10Paladox) I wonder what changed since the last phabricator update to cause this problem?
[17:54:20] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803071 (10jcrespo) p:05High>03Low I have changed the script to print rather than to kill, will close this tomorrow and stop the script if I get no other in...
[17:54:30] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803074 (10mmodell) @paladox: These two commits touched PhabricatorProjectQuery but I haven't figured out what might have caused the issue exactly:  * {rPHAB9a1...
[17:55:10] <wikibugs_>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803075 (10Paladox) @mmodell thanks, maybe this https://secure.phabricator.com/rPe053534c7e84b09e5f01ac3acb41352bb6a37e05 will improve things?
[17:56:08] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2803076 (10mmodell)
[18:09:42] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2803133 (10Papaul) a:05Papaul>03Marostegui @Marostegui  disk replacement complete
[19:16:40] <wikibugs>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2798143 (10chasemp) I'm not sure how to know this is OK to expose.  Is there anywhere that someo...
[19:31:29] <wikibugs_>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2803491 (10kaldari) @dpatrick: Any chance you could OK replicating this data to Tool Labs? The d...
[19:33:52] <wikibugs_>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, 13Patch-For-Review: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2803498 (10chasemp) once @dpatrick gives this a once over you can assign to me and I'll knock it...
[20:56:47] <wikibugs>	 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2803851 (10jmatazzoni)
[21:37:22] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2803972 (10Deskana) 05Open>03Resolved a:03Deskana
[21:39:04] <ostriches>	 jynus: So those replags that I think someone was mentioning last night? I think they're repeating about every hour....
[21:40:02] <ostriches>	 https://logstash.wikimedia.org/goto/e0d0c19d3ba53494d64f06a3e0f9f4bc
[21:45:59] <jynus>	 what you are looking is at millions of logs sayinh "Server db1056 (#2) has >= 1.2185490131378 seconds of lag"
[21:46:20] <jynus>	 when there is a +-1 second error on lag measuring
[21:46:49] <jynus>	 what it is difficult is when it is not logging
[21:47:26] <wikibugs_>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2803999 (10Deskana) 05Open>03Resolved
[21:49:11] <ostriches>	 jynus: Ah ok. So basically a logging/metrics problem and not a real problem?
[21:49:32] <jynus>	 I mean, 20 minutes of lag is a problem
[21:49:40] <jynus>	 1 second it is not
[21:49:52] <jynus>	 there is not perfect limit inbetween
[21:50:00] * ostriches nods
[21:50:05] <jynus>	 but less than 15 second I do not think it should be a warning
[21:50:08] <jynus>	 only debug
[21:50:22] <jynus>	 there maybe a problem there, though
[21:50:50] <jynus>	 I see lots of "Server db1091 (#8) is not replicating?"
[21:51:36] <jynus>	 but you are talking to the wrong person- I do not know who created that logging channel and what is the logic behind it
[21:53:10] <ostriches>	 It's just a filter on the MW replication lag errors.
[21:53:45] <ostriches>	 So anytime MW is noticing, it's logging to the usual MW log pile, the link I gave just filtered out that type of error.
[21:54:10] <jynus>	 and when does it say "Server XXXXX (#8) is not replicating?"?
[21:54:56] <wikibugs_>	 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2804020 (10jmatazzoni) Pau and I discussed this and have decided to go back to...
[21:57:28] <ostriches>	 jynus: I'm not entirely sure...
[21:57:51] <jynus>	 well, neither I am :-)
[21:58:42] <jynus>	 I have my own logs, I plot them in a nice graph: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen
[22:01:02] <jynus>	 this one is better: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?from=now-24h&to=now
[22:23:40] <wikibugs>	 10DBA, 10Phabricator, 06Release-Engineering-Team: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965#2804147 (10mmodell) 05Open>03Resolved a:03mmodell
[22:38:49] <ostriches>	 jynus: I see spikes in yours that correspond to ones in my MW logging. Perhaps MW is just too sensitive about it and needs to take a chill pill :)