[00:44:40] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[00:51:32] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:05:12] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[05:29:20] <wikibugs>	 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui)
[05:29:27] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) 05Stalled→03Open
[05:35:07] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2094.codfw.wmnet'] ` The log can be found in `/var/log/wmf...
[05:35:39] <Amir1>	 marostegui: hi, stupid question. Do you know what I can do to debug and find where this index is being used? https://phabricator.wikimedia.org/T163532 
[05:35:57] <Amir1>	 it can be any code, from extensions to core
[05:36:45] <marostegui>	 Amir1: Other than grepping in code there's no much way to know where it is being used :(
[05:37:02] <marostegui>	 Amir1: I can only tell if it is being used, but not by whom
[05:37:22] <Amir1>	 yeah, i was thinking maybe binlogs or something like that
[05:37:24] <Amir1>	 but I doubt
[05:37:38] <marostegui>	 nope :(
[05:38:03] <Amir1>	 😭
[05:39:38] <marostegui>	 Amir1: the only thing we could do would be to enable slow query log, grab a bunch of queries and then run explain and see if any of them use that index
[05:39:44] <marostegui>	 but that would be pure luck if we find it
[05:40:30] <Amir1>	 yeah, one suggestion is to keep track of its usage in performance schema tables, if it changes quite often, then our luck is higher
[05:42:57] <marostegui>	 mysql:root@localhost [sys]> select * from schema_index_statistics where index_name like 'rev_page_id';
[05:42:58] <marostegui>	 +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+
[05:42:58] <marostegui>	 | table_schema | table_name | index_name  | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency |
[05:42:58] <marostegui>	 +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+
[05:42:58] <marostegui>	 | enwiki       | revision   | rev_page_id |      61283890 | 2.24 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
[05:42:58] <marostegui>	 +--------------+------------+-------------+---------------+----------------+---------------+----------------+--------------+----------------+--------------+----------------+
[05:42:59] <marostegui>	 1 row in set (0.014 sec)
[05:43:01] <marostegui>	 gah, sorry
[05:43:17] <marostegui>	 it is definitely being used super often (almost every second)
[05:43:27] <marostegui>	 so we could analyze the queries on the slow query log
[05:43:32] <marostegui>	 and try to see if we find the one
[05:45:17] <Amir1>	 Thanks <3
[05:45:49] <marostegui>	 I will get back to you today or Monday on that task if that's ok?
[05:46:12] <Amir1>	 oh of course, this can wait
[05:46:37] <marostegui>	 thanks :)
[05:46:49] <Amir1>	 specially due to past experience with dropping indexes, we really need to be due diligent 
[05:50:35] <marostegui>	 yeah, this one is definitely being used
[05:57:10] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2094.codfw.wmnet'] `  and were **ALL** successful.
[06:03:06] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) db2094 has been migrated to Buster. I am now checking all the tables to look for corruption, as we saw some during while setting up the buster sanitarium hosts...
[06:06:57] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2095.codfw.wmnet'] ` The log can be found in `/var/log/wmf...
[06:29:09] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2095.codfw.wmnet'] `  and were **ALL** successful.
[06:33:27] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) db2095 has been migrated to Buster. I am now checking all the tables to look for corruption, as we saw some during while setting up the buster sanitarium hosts...
[06:34:56] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui)
[06:34:59] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui)
[06:35:06] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) a:03Marostegui
[06:38:50] <wikibugs>	 10DBA: Migrate sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280315 (10Marostegui)
[06:39:02] <wikibugs>	 10DBA: Migrate sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280315 (10Marostegui) p:05Triage→03Medium
[06:45:06] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) There's some corruption on both, db2094 and db2095, so once all the checks are finished, I will need to rebuild the affected tables.
[07:01:50] <marostegui>	 Amir1: to make things more interesting, I am seeing the index being used, but the query capture and the explain doesn't show the index being used, cause probably it is a case of explain being different from the real plan being used (which can be seen by show explain for, which is almost impossible to run if the query is fast)
[07:02:03] <marostegui>	 I have captured around 66k queries
[07:02:32] <marostegui>	 I will see what I can do
[07:02:38] <Amir1>	 oh :(
[07:04:03] <Amir1>	 my hypotheses is that the index is not that different from another similar index and it just chose it at random
[07:04:22] <marostegui>	 could be, I will try to dig more and see what I can find with this capture
[07:04:33] <Amir1>	 thanks
[09:19:39] <marostegui>	  Amir1 I have found something...2	DERIVED	revision	index_merge	PRIMARY,rev_timestamp,page_timestamp,rev_page_id,rev_page_actor_timestamp	rev_page_id,page_timestamp	8,24	NULL	3	Using sort_union(rev_page_id,page_timestamp); Using where
[09:19:53] <marostegui>	 I now need to find what does that belong to :)
[09:23:38] <marostegui>	 It is just one query out of 10k I captured
[09:23:49] <marostegui>	 I need to find which one
[09:27:48] <Amir1>	 oh thanks
[09:28:17] <Amir1>	 that sounds terrible :(
[09:29:57] <marostegui>	 I think I am getting closer!
[09:30:41] <marostegui>	 It is one of this: MediaWiki\Revision\RevisionStore::countRevisionsBetween
[09:31:47] <marostegui>	 I think it is a fun one, I got 13 identical queries, and only one seem to be using rev_page_id
[09:31:49] <marostegui>	 let me confirm it
[09:32:30] <marostegui>	 haha yes
[09:32:36] <marostegui>	 they all use: rev_timestamp,page_timestamp
[09:32:44] <marostegui>	 but only one uses: rev_page_id,page_timestamp
[09:33:14] <marostegui>	 Amir1: SELECT /* MediaWiki\Revision\RevisionStore::countRevisionsBetween  */  COUNT(*) AS `rowcount`  FROM (SELECT  1  FROM `revision`    WHERE rev_page = 26382735 AND ((rev_deleted & 1) = 0) AND ((rev_timestamp = '20210319080213' AND rev_id > 1012961028) OR rev_timestamp > '20210319080213') AND ((rev_timestamp = '20210319080238' AND rev_id < 1012961077) OR rev_timestamp < '20210319080238')  LIMIT 1001  ) `tmp_count`;
[09:33:16] <marostegui>	 that one
[09:33:34] <Amir1>	 \o/
[09:33:36] <Amir1>	 Thanks
[09:33:51] <Amir1>	 I will check what I can do about it and get rid of it ^^
[09:33:54] <marostegui>	 Not sure why that one uses rev_page_id,page_timestamp and the others use rev_timestamp,page_timestamp
[09:34:01] <marostegui>	 Let me check on other hosts
[09:35:36] <marostegui>	 so yes, all the hosts use rev_page_id,page_timestamp for that query
[09:36:03] <marostegui>	 Amir1: EXPLAIN SELECT /* MediaWiki\Revision\RevisionStore::countRevisionsBetween  */  COUNT(*) AS `rowcount`  FROM (SELECT  1  FROM `revision`    WHERE rev_page = 3957 AND ((rev_deleted & 1) = 0) AND ((rev_timestamp = '20151105194319' AND rev_id > 689230365) OR rev_timestamp > '20151105194319') AND ((rev_timestamp = '20151106134059' AND rev_id < 689338263) OR rev_timestamp < '20151106134059')  LIMIT 1001  ) `tmp_count`;
[09:36:13] <marostegui>	 This one uses rev_timestamp,page_timestmap
[09:36:22] <marostegui>	 And it is the same query from what I can see
[09:37:02] <Amir1>	 so basically this is just randomness?
[09:37:14] <marostegui>	 yes, but that specific query uses that index on all the hosts
[09:37:24] <marostegui>	 I can try to see what the optimizer is doing with its trace
[09:37:44] <Amir1>	 so it's not even connected to cardinality 
[09:37:44] <marostegui>	 Amir1: https://phabricator.wikimedia.org/P15385
[09:37:47] <Amir1>	 ugh
[09:38:07] <marostegui>	 interesting
[11:04:42] <marostegui>	 Amir1: if I ignore rev_page_id for that specific query, it falls back to use rev_timestamp,page_timestamp and the query is as fast 
[11:05:54] <Amir1>	 oh noice
[11:06:03] <Amir1>	 can we delete it now 🥺
[11:06:13] <Amir1>	 slowly in some replicas?
[11:06:16] <marostegui>	 hahaha
[11:06:29] <marostegui>	 yeah, maybe we can drop it on one replica of enwiki
[11:06:31] <marostegui>	 and see how it goes
[11:06:42] <Amir1>	 cooool
[11:06:47] <marostegui>	 not today though :p
[11:07:33] <Amir1>	 https://media.giphy.com/media/l0IsIZw8doJm3ysRq/giphy.gif
[11:07:44] <marostegui>	 hahahaha
[11:07:51] <Amir1>	 marostegui: I made popcorns :(
[11:08:03] <Amir1>	 okay, next week then
[11:08:04] * marostegui drops the index and switches off his phone, see you monday o/
[11:08:19] <Amir1>	 can you happily assign it to kormat ?
[11:08:44] <marostegui>	 hahaha
[12:17:27] <jynus>	 bullseye release seems very imminent: https://lists.debian.org/debian-release/2021/04/msg00187.html
[12:19:00] <jynus>	 production seems very far from that, but regarding backups I would like to start playing around locally with it to make sure backup codebase keeps working
[12:19:05] <sobanski>	 Exciting times!
[12:19:53] <jynus>	 Is bullseye + 10.4 the most reasonable combination to test for me for now, marostegui?
[12:20:16] <jynus>	 as in, mariabackup 10.4 on bullseye
[12:21:13] <jynus>	 if it helps, I think debian will ship 10.5
[12:23:15] <jynus>	 and there has not been a single 10.6 tag, neither preview, but I think some fixes landed on that branch
[12:26:58] <jynus>	 I don't intend to spent time on this, but I want to know early if "backups will be completely broken on next debian version" (I hope not :-))
[14:28:15] <marostegui>	 I think bullseye will arrive before 10.5 for us yes
[14:34:22] <marostegui>	 that being said, I do want to include 10.5 testing in our next year roadmap
[14:34:57] <jynus>	 it's to early to decide on that, I guess
[14:35:01] <jynus>	 *too
[14:35:39] <jynus>	 but with so many debian packages we maintain now, I don't want to leave for the last minute the next os version support
[14:36:01] <RhinosF1>	 marostegui: bullseye is expected by end of may according to the release list
[14:36:43] <marostegui>	 RhinosF1: and we haven't still moved out from stretch :)
[14:36:48] <RhinosF1>	 Heh
[14:37:25] * RhinosF1 is currently trying to actually come up with a plan for bullseye. He imagines at lot of pulling hair and waiting will be involved.
[14:37:31] <marostegui>	 jynus: sure, but we are not even on buster yet 100%
[14:38:09] <jynus>	 I don't think it is buster what was the blocker, but the mariadb upgrade
[14:38:41] <jynus>	 you could consider staying on 10.4 and upgrade only the os-if that could help (but again, too early to say)
[14:39:47] <jynus>	 for now I am focusing on python libraries and interpreter and its dependencies
[14:39:56] <marostegui>	 jynus: there was no 10.1 for buster anyways
[14:40:36] <jynus>	 buster shipped with 10.3, didn't it?
[14:40:57] <jynus>	 I think we gambled on going for a later version to not get too outdated, iirc
[14:41:29] <jynus>	 we may end up on an earlier or a later version (e.g. 10.6) by the time a proper plan is set, again, too early to say
[14:41:42] <marostegui>	 we'll see
[14:41:49] <jynus>	 or -you know- migrating to postgresql :-)
[14:42:00] <marostegui>	 I was thinking about MS Access
[14:42:14] <jynus>	 I think only MS SQL server has linux support
[14:43:12] <jynus>	 https://docs.microsoft.com/en-us/sql/linux/quickstart-install-connect-ubuntu?view=sql-server-ver15
[14:47:33] <marostegui>	 we can migrate to windows too
[14:47:43] <marostegui>	 it is just an OS migration anyways 
[14:49:58] <jynus>	 plus we can always run containers on WLS2, we only have to convince serviceops to support that stack
[19:16:43] <wikibugs>	 10DBA, 10Wikimedia-Rdbms, 10Tracking-Neverending: Database replication lag issues (tracking) - https://phabricator.wikimedia.org/T3268 (10Krinkle)
[21:59:47] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui do you have any advice on how to configure clouddb1021 memory / memory alerts? Would it be worth doing to...