[06:33:48] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) I have enabled sync_binlog and trx_commit on db2052 db2051 db2048 db2040 db2039 db2035 - I'll keep an eye on the lag for those sections during the day [06:34:10] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) [06:38:09] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) Enabled GTID on db2034 and position starts correctly: ` Dec 17 06:36:50 db2034 mysqld[2632]: 2018-12-17 6:36:50 139848782440192 [Note] Slave SQL thread exiting, replication stoppe... [06:49:36] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) Enabled GTID on db2035 and position starts correctly: ` Dec 17 06:48:22 db2035 mysqld[64960]: 2018-12-17 6:48:22 139957454210816 [Note] Slave SQL thread exiting, replication stopp... [07:57:25] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) A compare.py on x1 between codfw master and a eqiad slave for enwiki echo_notification table reveals no differences. A compare.py on s2 between codfw master and eqiad slave for en... [07:59:35] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) Enabled GTID on db2052 ` Dec 17 07:58:08 db2052 mysqld[1565]: 2018-12-17 7:58:08 139774979925760 [Note] Slave SQL thread exiting, replication stopped in log 'db1070-bin.002351' at... [08:10:39] what happened with db1115? When I logged in I've seen you were already there, and I read the logs, but didn't seen anything why it restarted [08:12:18] 07:28 < marostegui> Going to stop mysql for now [08:12:18] 07:28 < marostegui> !log Stop MySQL on db1115 so tendril can get back to work [08:12:51] I guess it's better to read irc first then. :( [08:12:55] sorry [08:13:02] Yes [09:33:07] 10DBA: db1115 (tendril DB) had OOM for some processes - https://phabricator.wikimedia.org/T196726 (10Marostegui) This happened again Sunday 16th Dec around 7:14AM (UTC) and paged due to nagios restarting (also tendril stopped working): ` 07:17 <+icinga-wm> PROBLEM - puppet last run on db1115 is CRITICAL: connect... [09:36:42] I am comparing the `mariadb::core::multiinstnce` `mariadb::sanitarium_multiinstance` `mariadb::dbstore_multiinstance` because I am trying to figure out if it worth to create a separate profile for `mariadb::analytics_multiinstance` and I am pretty sure, it's not. They're practically the same all, with minor differences (on `mariadb::core::multiinstance` a 'section' could be 'critical', and in `mariadb::santiarium_multiinstance` there's [09:36:42] no `x1` section. [09:38:10] I propose to create a `mariadb::multiinstance` profile, and use that in the roles, and when there will anything 'special' appear create profles like `mariadb::multiinstance::dbstore` or `mariadb::multiinstance::sanitarium` etc. [09:38:53] It wouldn't be too much work [09:40:24] I don't know, send a patch with your proposal [09:40:33] Because sanitarium is _very_ different from the rest [09:41:41] kk [09:41:57] Keep in mind that the analytics dbtore is more or less the same nowadays, that doesn't mean it will be the same in the future [09:42:08] And if analytics want to change stuff, if you have one same profile, that might affect our services [09:42:15] Or viceversa [09:42:32] Send a patch with a proposal and we can discuss there anyways [09:43:12] yeah, that's a point. We'll have a brainstorm today with elukey, I am curious what might pop out [09:43:48] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) A compare.py on s5 between codfw master and a eqiad slave for dewiki revision table reveals no differences. [09:44:52] * elukey sees DBAs always blaming analytics [09:45:19] * elukey blames Manuel with italian gestures [09:46:31] https://usercontent.irccloud-cdn.com/file/TwQNP72g/like%20this%3F [10:00:39] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) s4 codfw master GTID enabled and SQL position looking fine before and after the stop: ` Dec 17 09:44:22 db2051 mysqld[1407]: 2018-12-17 9:44:22 140456208512768 [Note] Slave SQL th... [12:36:04] marostegui: Hey, do you know about this? https://phabricator.wikimedia.org/T211849#4820772 [13:36:35] Amir1: I haven't seen that on the logs [13:36:38] Is it happening often? [13:36:57] We had two reports of that in the past couple of days [13:37:16] Amir1: I haven't touched change_tag schema change [13:37:27] It's probably happening more often. The codebase for that is actually removed and will be deployed in the next three days [13:37:38] but something else might be happening here [13:38:20] There is not much we an do from a DB point of view, if the transaction is taking that long that it doesn't allow others [13:38:53] For what is worth, that report is for enwiki master [13:51:03] 10DBA: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 (10Marostegui) GTID enabled on s3 codfw master, db2043 and position before and after looking good: ` Dec 17 13:49:47 db2043 mysqld[3352]: 2018-12-17 13:49:47 139902475896576 [Note] Slave SQL thre... [14:14:01] 10DBA: Check GTID, consistency options, notifications across the fleet and db-eqiad.php weights - https://phabricator.wikimedia.org/T211973 (10Marostegui) [14:26:42] 10DBA: db1115 (tendril DB) had OOM for some processes - https://phabricator.wikimedia.org/T196726 (10Marostegui) Forgot to paste this graph: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=1537280704382&to... [16:43:23] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Marostegui) >>! In T210693#4825092, @bd808 wrote: >>>! In T210693#4824836, @Milimetric wrote: >> I'm not sure fo... [20:01:18] 10DBA, 10Recommendation-API, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10mobrovac) The important thing to note here is that each worker tries to connect to the DB, so at th... [20:05:26] 10DBA, 10Recommendation-API, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Pchelolo) > The important thing to note here is that each worker tries to connect to the DB, so at... [20:14:34] 10DBA, 10Recommendation-API, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10mobrovac) >>! In T212154#4828982, @Pchelolo wrote: >> The important thing to note here is that each... [23:46:09] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Here are some example queries to help with reviewing the DDL. @Marostegui, I'm especially interested in your feedback o...