[00:44:40] PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [00:51:32] PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:05:12] RECOVERY - MariaDB sustained replica lag on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [05:29:20] 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) [05:29:27] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) 05Stalled→03Open [05:35:07] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2094.codfw.wmnet'] ` The log can be found in `/var/log/wmf... [05:35:39] marostegui: hi, stupid question. Do you know what I can do to debug and find where this index is being used? https://phabricator.wikimedia.org/T163532 [05:35:57] it can be any code, from extensions to core [05:36:45] Amir1: Other than grepping in code there's no much way to know where it is being used :( [05:37:02] Amir1: I can only tell if it is being used, but not by whom [05:37:22] yeah, i was thinking maybe binlogs or something like that [05:37:24] but I doubt [05:37:38] nope :( [05:38:03] 😭 [05:39:38] Amir1: the only thing we could do would be to enable slow query log, grab a bunch of queries and then run explain and see if any of them use that index [05:39:44] but that would be pure luck if we find it [05:40:30] yeah, one suggestion is to keep track of its usage in performance schema tables, if it changes quite often, then our luck is higher [05:42:57] mysql:root@localhost [sys]> select * from schema_index_statistics where index_name like 'rev_page_id'; [05:42:58] +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+ [05:42:58] | table_schema | table_name | index_name | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency | [05:42:58] +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+ [05:42:58] | enwiki | revision | rev_page_id | 61283890 | 2.24 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps | [05:42:58] +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+ [05:42:59] 1 row in set (0.014 sec) [05:43:01] gah, sorry [05:43:17] it is definitely being used super often (almost every second) [05:43:27] so we could analyze the queries on the slow query log [05:43:32] and try to see if we find the one [05:45:17] Thanks <3 [05:45:49] I will get back to you today or Monday on that task if that's ok? [05:46:12] oh of course, this can wait [05:46:37] thanks :) [05:46:49] specially due to past experience with dropping indexes, we really need to be due diligent [05:50:35] yeah, this one is definitely being used [05:57:10] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2094.codfw.wmnet'] ` and were **ALL** successful. [06:03:06] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) db2094 has been migrated to Buster. I am now checking all the tables to look for corruption, as we saw some during while setting up the buster sanitarium hosts... [06:06:57] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2095.codfw.wmnet'] ` The log can be found in `/var/log/wmf... [06:29:09] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2095.codfw.wmnet'] ` and were **ALL** successful. [06:33:27] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) db2095 has been migrated to Buster. I am now checking all the tables to look for corruption, as we saw some during while setting up the buster sanitarium hosts... [06:34:56] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) [06:34:59] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [06:35:06] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) a:03Marostegui [06:38:50] 10DBA: Migrate sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280315 (10Marostegui) [06:39:02] 10DBA: Migrate sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280315 (10Marostegui) p:05Triage→03Medium [06:45:06] 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) There's some corruption on both, db2094 and db2095, so once all the checks are finished, I will need to rebuild the affected tables. [07:01:50] Amir1: to make things more interesting, I am seeing the index being used, but the query capture and the explain doesn't show the index being used, cause probably it is a case of explain being different from the real plan being used (which can be seen by show explain for, which is almost impossible to run if the query is fast) [07:02:03] I have captured around 66k queries [07:02:32] I will see what I can do [07:02:38] oh :( [07:04:03] my hypotheses is that the index is not that different from another similar index and it just chose it at random [07:04:22] could be, I will try to dig more and see what I can find with this capture [07:04:33] thanks [09:19:39] Amir1 I have found something...2 DERIVED revision index_merge PRIMARY,rev_timestamp,page_timestamp,rev_page_id,rev_page_actor_timestamp rev_page_id,page_timestamp 8,24 NULL 3 Using sort_union(rev_page_id,page_timestamp); Using where [09:19:53] I now need to find what does that belong to :) [09:23:38] It is just one query out of 10k I captured [09:23:49] I need to find which one [09:27:48] oh thanks [09:28:17] that sounds terrible :( [09:29:57] I think I am getting closer! [09:30:41] It is one of this: MediaWiki\Revision\RevisionStore::countRevisionsBetween [09:31:47] I think it is a fun one, I got 13 identical queries, and only one seem to be using rev_page_id [09:31:49] let me confirm it [09:32:30] haha yes [09:32:36] they all use: rev_timestamp,page_timestamp [09:32:44] but only one uses: rev_page_id,page_timestamp [09:33:14] Amir1: SELECT /* MediaWiki\Revision\RevisionStore::countRevisionsBetween */ COUNT(*) AS `rowcount` FROM (SELECT 1 FROM `revision` WHERE rev_page = 26382735 AND ((rev_deleted & 1) = 0) AND ((rev_timestamp = '20210319080213' AND rev_id > 1012961028) OR rev_timestamp > '20210319080213') AND ((rev_timestamp = '20210319080238' AND rev_id < 1012961077) OR rev_timestamp < '20210319080238') LIMIT 1001 ) `tmp_count`; [09:33:16] that one [09:33:34] \o/ [09:33:36] Thanks [09:33:51] I will check what I can do about it and get rid of it ^^ [09:33:54] Not sure why that one uses rev_page_id,page_timestamp and the others use rev_timestamp,page_timestamp [09:34:01] Let me check on other hosts [09:35:36] so yes, all the hosts use rev_page_id,page_timestamp for that query [09:36:03] Amir1: EXPLAIN SELECT /* MediaWiki\Revision\RevisionStore::countRevisionsBetween */ COUNT(*) AS `rowcount` FROM (SELECT 1 FROM `revision` WHERE rev_page = 3957 AND ((rev_deleted & 1) = 0) AND ((rev_timestamp = '20151105194319' AND rev_id > 689230365) OR rev_timestamp > '20151105194319') AND ((rev_timestamp = '20151106134059' AND rev_id < 689338263) OR rev_timestamp < '20151106134059') LIMIT 1001 ) `tmp_count`; [09:36:13] This one uses rev_timestamp,page_timestmap [09:36:22] And it is the same query from what I can see [09:37:02] so basically this is just randomness? [09:37:14] yes, but that specific query uses that index on all the hosts [09:37:24] I can try to see what the optimizer is doing with its trace [09:37:44] so it's not even connected to cardinality [09:37:44] Amir1: https://phabricator.wikimedia.org/P15385 [09:37:47] ugh [09:38:07] interesting [11:04:42] Amir1: if I ignore rev_page_id for that specific query, it falls back to use rev_timestamp,page_timestamp and the query is as fast [11:05:54] oh noice [11:06:03] can we delete it now 🥺 [11:06:13] slowly in some replicas? [11:06:16] hahaha [11:06:29] yeah, maybe we can drop it on one replica of enwiki [11:06:31] and see how it goes [11:06:42] cooool [11:06:47] not today though :p [11:07:33] https://media.giphy.com/media/l0IsIZw8doJm3ysRq/giphy.gif [11:07:44] hahahaha [11:07:51] marostegui: I made popcorns :( [11:08:03] okay, next week then [11:08:04] * marostegui drops the index and switches off his phone, see you monday o/ [11:08:19] can you happily assign it to kormat ? [11:08:44] hahaha [12:17:27] bullseye release seems very imminent: https://lists.debian.org/debian-release/2021/04/msg00187.html [12:19:00] production seems very far from that, but regarding backups I would like to start playing around locally with it to make sure backup codebase keeps working [12:19:05] Exciting times! [12:19:53] Is bullseye + 10.4 the most reasonable combination to test for me for now, marostegui? [12:20:16] as in, mariabackup 10.4 on bullseye [12:21:13] if it helps, I think debian will ship 10.5 [12:23:15] and there has not been a single 10.6 tag, neither preview, but I think some fixes landed on that branch [12:26:58] I don't intend to spent time on this, but I want to know early if "backups will be completely broken on next debian version" (I hope not :-)) [14:28:15] I think bullseye will arrive before 10.5 for us yes [14:34:22] that being said, I do want to include 10.5 testing in our next year roadmap [14:34:57] it's to early to decide on that, I guess [14:35:01] *too [14:35:39] but with so many debian packages we maintain now, I don't want to leave for the last minute the next os version support [14:36:01] marostegui: bullseye is expected by end of may according to the release list [14:36:43] RhinosF1: and we haven't still moved out from stretch :) [14:36:48] Heh [14:37:25] * RhinosF1 is currently trying to actually come up with a plan for bullseye. He imagines at lot of pulling hair and waiting will be involved. [14:37:31] jynus: sure, but we are not even on buster yet 100% [14:38:09] I don't think it is buster what was the blocker, but the mariadb upgrade [14:38:41] you could consider staying on 10.4 and upgrade only the os-if that could help (but again, too early to say) [14:39:47] for now I am focusing on python libraries and interpreter and its dependencies [14:39:56] jynus: there was no 10.1 for buster anyways [14:40:36] buster shipped with 10.3, didn't it? [14:40:57] I think we gambled on going for a later version to not get too outdated, iirc [14:41:29] we may end up on an earlier or a later version (e.g. 10.6) by the time a proper plan is set, again, too early to say [14:41:42] we'll see [14:41:49] or -you know- migrating to postgresql :-) [14:42:00] I was thinking about MS Access [14:42:14] I think only MS SQL server has linux support [14:43:12] https://docs.microsoft.com/en-us/sql/linux/quickstart-install-connect-ubuntu?view=sql-server-ver15 [14:47:33] we can migrate to windows too [14:47:43] it is just an OS migration anyways [14:49:58] plus we can always run containers on WLS2, we only have to convince serviceops to support that stack [19:16:43] 10DBA, 10Wikimedia-Rdbms, 10Tracking-Neverending: Database replication lag issues (tracking) - https://phabricator.wikimedia.org/T3268 (10Krinkle) [21:59:47] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui do you have any advice on how to configure clouddb1021 memory / memory alerts? Would it be worth doing to...