[00:31:54] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) The query was executing for too long then. [06:56:44] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) >>! In T246970#5952464, @Mike_Peel wrote: > I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things ar... [09:15:12] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) [09:15:28] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) [10:06:34] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) [10:09:31] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) > From time to time Could you be more specific, at random times? When under high load? Which approximate frequency: Once every month, every week, every day? I would bet (this is... [10:10:34] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) Thanks for the extra info, I commented before you added the logstash link. Any way to reproduce it manually? [10:28:43] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) I could not find a way to reproduce it. From the log of events, that seems to be a transient issue, occurred early in january and again now. It started being noticeable for the last... [10:34:11] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) The last ERROR actually happened during a period of low load on the database: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1132&var-port=9104&var-dc=eqiad%2... [10:37:26] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) > doing too many connections at the same time Not that, it is the first thing I checked, at least not from the perspective of the server (we have metrics of that, and it is not ob... [10:41:22] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) The other things that is interesting is the: > proceedHandshakeWithPluggableAuthentication Could it be some kind of strange compatibility issue? MySQL 8 recently changed how auth... [10:45:42] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) Excellent @jcrespo thank you very much for the detailed answers! I have no idea how Gerrit manages the database connections, the settings seem to be all default. Thus I don't know... [10:45:54] jynus: thank you :] I have all the details I needed [10:46:08] i will dig in gerrit db configuration settings related to the connection pool/keepalive etc [10:46:21] and try to get jdbc connections monitoring to be fixed [10:46:28] so for DBA, I think that is all set :] [10:48:03] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) Oh, wait, this will connect using the a dbproxy, so maybe the issue is there, not on mysql. Will give that a look. [10:58:12] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) I don't see immediate concerns about haproxy health, but I can see timeout is set as follows: ` timeout connect 3000ms timeout client 28800s timeout server 28800s ` Maybe the c... [10:58:19] See my last comment [10:58:28] it could be something related to the proxy [10:59:10] sadly, we don't have lots of historical metrics for those yet [10:59:51] but I could see those overloading at times [11:00:47] the stack trace is nice because it says what, but doesn't say why [11:01:07] but it looks to me more like a tcp issue than a mysql issue, probably at proxy level [11:04:10] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Paladox) >>! In T247591#5966895, @hashar wrote: > Excellent @jcrespo thank you very much for the detailed answers! > > I have no idea how Gerrit manages the database connections, the setti... [11:05:41] This also happened last year, were we running MySQL 8 last year jynus ? [11:08:00] we are not running mysql 8 at all [11:10:39] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Paladox) https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html [11:24:09] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) From Gerrit database configuration documentation ( https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#database ) > **`database.connectionPool`** > > If `true`, use co... [11:46:50] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) I have looked at the [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=gerrit1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&fro... [12:06:06] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) @hashar - I recommend the following: * Wait for T246098, which may impact negatively or positively this * Later, try to get some stats regarding TCP for the gerrit host and the pr... [12:11:12] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10jcrespo) >>! In T247591#5967006, @hashar wrote: > I have looked at the [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=gerrit1001&var-datasource=eqi... [12:34:22] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Paladox) @jynus could this https://bugs.mysql.com/bug.php?id=93590 be it? Also that’s fixed in https://dev.mysql.com/doc/relnotes/connector-j/5.1/en/news-5-1-48.html [17:55:35] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) @paladox unlikely :] We need traces and try to find the root cause first! Thanks for more hints Jaime. Indeed lets hold for the m2 upgrade. We might also look into switching to c... [17:55:55] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) 05Open→03Stalled p:05Triage→03Medium