[03:46:22] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm)
[03:47:54] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm)
[04:16:29] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) p:05Triage→03Medium Let me know when you want this to be done, if it requires coordination with you or can be done any...
[04:30:08] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1118.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202104290429_mar...
[04:39:49] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1156 pooled in s2 with minimal weight
[04:52:27] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1118.eqiad.wmnet'] `  and were **ALL** successful.
[04:54:47] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) Transfer from db1083 to db1118 on-going
[05:00:59] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Automatically pooling db1156 into s2.
[05:01:08] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:01:27] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) All the hosts in this task have been productionized. Pending: decommission the old ones.
[05:19:22] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) @jcrespo I have noticed this on db2100 (10.1 backup source): ` Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: tried to purge sec index entry not marked for deletion in Apr 27 13:28:37 db2100 mysqld[883]: I...
[05:35:20] <wikibugs>	 10DBA, 10Patch-For-Review: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) 05Open→03Resolved This has been enabled everywhere
[06:26:41] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) db1118 cloned from db1083, checking its tables now.
[06:26:59] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui)
[06:36:51] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) Will update here next time db backups run there.  I also saw a filesystem backup of 1.6GB here: https://grafana.wikimedia.org/d/413r2vbWk/bacu...
[06:55:07] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui)
[06:55:55] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) 05Open→03Resolved db1118 has been cloned from db1083. Once the tables are finished their checking it will be pooled. Once it's been working fine for a few days, db1083 will be sent to decommissioning. Closin...
[06:55:59] <wikibugs>	 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui)
[06:56:01] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[06:56:03] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui)
[06:56:48] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui)
[06:57:18] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) 05Open→03Stalled This is not yet ready, db1118 needs to be pooled, and once done, let's wait a week or so before decommissioning db1083.
[06:57:52] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui)
[06:57:54] <wikibugs>	 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui)
[06:57:56] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[06:58:17] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:04:43] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) >>! In T274463#7043866, @brennen wrote: > This sounds like the right approach if it's something we can reasonably do.  I would support that if...
[07:05:42] <wikibugs>	 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) a:03Marostegui
[07:12:09] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) Thanks for the heads up. The fact that this is recent and dumps showed no error/warning message makes me think it is not a fatal error and that data was kept intact. I will make sure to rebuilt it for 10.4...
[07:17:30] <marostegui>	 I would appreciate a sanity check on this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/683515
[07:20:52] <jynus>	 checking
[07:21:09] <marostegui>	 thanks
[07:23:31] <jynus>	 one check if failing, but not sure why
[07:23:34] <jynus>	 *is
[07:23:50] <marostegui>	 yeah, it was the same a few days ago when I depooled es4
[07:23:56] <marostegui>	 I didn't give it much importance
[07:24:32] <marostegui>	 the funny thing is that it said failure, but when I looked at the console output, it said success
[07:24:41] <jynus>	 "failed: No such file or directory (2)" on rsync, so not code related
[07:45:49] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) Doing on backup1001|1002|2001|2002:  ` rm /etc/ssl/openssl.cnf apt install --reinstall -o Dpkg::Options::="--force-confask...
[08:08:28] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) I think it can be done anytime but @ladsgroup should confirm.
[08:08:53] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) Thanks - I will wait for the confirmation
[08:18:19] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo)
[08:37:51] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo)
[08:38:21] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) 05Open→03Resolved This has been successfully reverted and a backup each has been run from both stretch and buster host...
[08:41:56] <marostegui>	 kormat jynus I am about to finish upgrading all the buster kernels in codfw, only pending db2093, which I will do in around 30m when I am done with the pending hosts. so orchestrator will be failing during the reboot
[08:45:17] <jynus>	 ok
[09:02:49] <kormat>	 marostegui: reminder that db2093 is treacherous and will page
[09:03:26] <marostegui>	 i downtimed it and db1115
[09:03:34] <kormat>	 👍
[09:05:50] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) Yes. Sorry I just woke up. Let's do it!
[09:07:19] <marostegui>	 orchestrator is back
[09:07:25] <marostegui>	 all codfw is now done
[09:22:27] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) It looks like a single entry on the index, forcing table rebuilt: `name=db2100[(none)]> check table metawiki.watchlist; +--------------------+-------+----------+--------------------------------------------...
[09:24:57] <marostegui>	 Amir1: Do we have mailman processes writing to mailinglist table?
[09:25:20] <marostegui>	 (or reading)
[09:25:32] <marostegui>	 I am trying to alter the table, but I am not able to overcome the metadata lock
[09:25:42] <Amir1>	 yeah, mailman-web service in lists1001
[09:25:50] <marostegui>	 right...
[09:26:01] <Amir1>	 do you want me to down it for a minute?
[09:26:03] <marostegui>	 maybe it can be stopped for a sec?
[09:26:04] <marostegui>	 yeah
[09:26:21] <Amir1>	 sure, let me grab my coffee. Is it okay to do it in five minutes?
[09:26:27] <marostegui>	 sure
[09:26:32] <marostegui>	 no problem
[09:32:19] <Amir1>	 back. Since it's a new service and not widely used, we can do it. For future cases I think we should make a down-time request, etc.
[09:32:22] <Amir1>	 anyway
[09:32:32] <marostegui>	 sounds good yeah
[09:34:16] <marostegui>	 Amir1: the alter went thru, I am now running it at testmailman3
[09:34:24] <marostegui>	 which is having the same issue :)
[09:35:08] <Amir1>	 ugh, I just down that quickly
[09:35:38] <Amir1>	 marostegui: go
[09:35:41] <marostegui>	 Amir1: done
[09:36:38] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) 05Open→03Resolved a:03Marostegui All done: ` root@db1128.eqiad.wmnet[mailman3]> ALTER TABLE mailinglist MODIFY COLUM...
[09:36:47] <marostegui>	 Amir1: ^ \o/
[09:36:59] <Amir1>	 YAY
[09:38:06] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) I confirm it fixed the issue (tried it on https://lists.wikimedia.org/postorius/lists/lgbt.lists.wikimedia.org/)  mailman w...
[11:36:47] <wikibugs>	 10DBA: New database request: image_matching - https://phabricator.wikimedia.org/T280042 (10gmodena) >>! In T280042#7035989, @Eevans wrote: >>>! In T280042#7034256, @gmodena wrote: >> Maybe premature optimisation, but this dataset stores text fields (part of a potential primary key) that can be relatively long (p...
[11:41:22] <wikibugs>	 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s2 is fully done apart from the master (db1122)
[11:41:49] <wikibugs>	 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui)
[11:59:28] <wikibugs>	 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s5 is fully done apart from the master (db1100)
[12:00:07] <wikibugs>	 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui)
[12:18:05] <wikibugs>	 10DBA: New database request: image_matching - https://phabricator.wikimedia.org/T280042 (10gmodena) >> The full dataset for ImageMatching, generated on 321 wikis, is 2.6GB. It contains `23585365` records. > To be clear, a //record// as it is referred to here is one globally unique primary key, and the correspond...
[12:28:02] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) I assume the size is for the search index.
[13:05:57] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[13:08:34] <jynus>	 ^ marostegui to downtime on icinga or real issue?
[13:11:42] <marostegui>	 the usual pc lag
[13:11:44] <jynus>	 I think is not maintenance, but a real issues
[13:11:45] <marostegui>	 we need to adjust it
[13:11:54] <jynus>	 there is increase uncached traffic
[13:11:58] <jynus>	 s1 got a spike
[13:12:10] <jynus>	 s/issues/traffic changes/
[13:12:14] <jynus>	 not an issue at the moment
[13:12:15] <marostegui>	 yeah, all pc are lagging a bit
[13:12:20] <marostegui>	 which is "usual" :(
[13:17:55] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 26 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[13:23:55] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[13:28:42] <jynus>	 es4 and es5 have increased a lot the reads, which likely means this is forced reparses of old content
[13:29:01] <jynus>	 however, it was mostly http gets, I think
[13:29:07] <jynus>	 so a scrapper, maybe?
[13:29:29] <marostegui>	 that would correlate with parsercache yeah
[13:30:02] <jynus>	 as nothing seems broken, I am going into a meeting, ping if things start to get interesting :-)
[13:30:15] <marostegui>	 thanks <3
[13:30:44] <jynus>	 it would be a good time to tune s1 weight, if things were inbalanced
[13:31:05] <marostegui>	 I tuned it a few weeks ago during the last spike XD
[13:31:13] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 4.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[13:31:17] <marostegui>	 So far nothing pops out too crazy on tendril
[13:31:19] <marostegui>	 regarding QPS
[13:31:30] <marostegui>	 They are pretty balanced
[13:31:40] <marostegui>	 between 10-13k qps
[13:37:31] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[13:40:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[13:55:41] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[14:00:46] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s3 is fully done apart from the master (db1123)
[14:01:04] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui)
[14:10:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[14:13:35] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 9 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[14:14:45] <marostegui>	 tendril is struggling again
[14:15:15] <marostegui>	 It might be our global_status friend
[14:15:41] <marostegui>	 or again general_log_sampled
[14:15:59] <marostegui>	 -rw-rw---- 1 mysql mysql 1.1T Apr 29 14:15 general_log_sampled.ibd
[14:16:01] <marostegui>	 awwwww
[14:16:45] <marostegui>	 I am stopping the event_scheduler for now and let things go thru and then I will truncate that
[14:17:54] <wikibugs>	 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) Tendril is struggling again, the size of the table after a week since its truncation: ` -rw-rw---- 1 mysql mysql 1.1T Apr 29 14:15 general_log_sampled.ibd `
[14:21:20] <marostegui>	 I think it is also global_status causing issues
[14:22:56] <marostegui>	 tendril is back
[14:23:03] <marostegui>	 It was a combination of both tables from what I can see
[14:23:15] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 31.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[14:23:23] <marostegui>	 I truncated the 1T one and it sort of recovered, but it didn't fully recover until I truncated global_status_log
[14:23:45] <wikibugs>	 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) For the record: ` [16:21:20]  <marostegui> I think it is also global_status causing issues [16:22:56]  <marostegui> tendril is back [16:23:02]  <marostegui> It was a combination of both tables from what I...
[14:26:18] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui)
[14:26:30] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) p:05Triage→03High
[14:39:35] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I am doing some archeology and it looks like this only affects: https://tendril.wikimedia.org/report/sampled_queries and NOT https://tendril.wikimedia.org/...
[14:41:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[14:45:21] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[14:49:48] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) https://tendril.wikimedia.org/report/slow_queries doesn't seem to be using that table for anything: ` | 64231005 | tendril_web     | 208.80.155.104:43546 |...
[15:02:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 8.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[15:12:45] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[15:20:03] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[15:21:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[15:22:02] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Urbanecm) Tagging #dba, as they might be able to offer some guidance on finding the issue here.
[15:36:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 3.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[15:40:51] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[15:45:52] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10jcrespo) p:05Triage→03Unbreak! This should be a blocker- es traffic has grown almost grown 100x since 14 april, correlat...
[15:51:14] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10LarsWirzenius) ACK, I'll make it a train blocker.
[15:51:30] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10LarsWirzenius)
[15:51:46] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 7.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[15:54:57] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) Given we only make requests to external storage when parsercache has a miss, it seemed sensible to look for correspondi...
[15:57:00] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[16:00:33] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) >>! In T281480#7046160, @Joe wrote: > Given we only make requests to external storage when parsercache has a miss, it s...
[16:08:12] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Pchelolo) > A better candidate for changing something is probably https://gerrit.wikimedia.org/r/c/mediawiki/core/+/677299....
[16:11:01] <wikibugs>	 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) There is definitely something going very wrong with memcached:  https://grafana.wikimedia.org/d/000000316/memcache?view...
[16:25:44] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1087 is CRITICAL: 10 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1087&var-port=9104
[16:25:52] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2079 is CRITICAL: 37.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2079&var-port=9104
[16:28:54] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2152 is CRITICAL: 29 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2152&var-port=9104
[16:34:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2152 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2152&var-port=9104
[16:36:04] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[16:41:04] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[16:49:13] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2080 is CRITICAL: 344.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104
[16:54:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1087 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1087&var-port=9104
[16:57:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2080 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104
[16:58:41] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2079 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2079&var-port=9104
[17:39:06] <wikibugs>	 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Krinkle)
[18:06:23] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10brennen) > Databases used to use that model (until we found alternative methods for snapshoting), we will still have the code on puppet- with backups t...
[18:08:26] <wikibugs>	 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Addshore)
[18:08:32] <wikibugs>	 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Addshore)
[18:20:32] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) Frankyl, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshotting (and partman) setup (this quarter). I hav...
[19:07:03] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10brennen) > Frankly, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshotting (and partman) setup (this quarter). I...
[19:56:37] <marostegui>	 tendril is again dead
[19:58:44] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Tendril is again stuck and this is what I have seen: ` | 66202925 | root            | 10.64.32.25          | tendril            | Connect     |   606 | Sen...
[20:13:10] <marostegui>	 I have no idea what is causing the issues, still investigating
[20:37:18] <marostegui>	 still no idea
[20:37:21] <marostegui>	 this is very strange
[20:39:21] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) For now I have: `  mysql:root@localhost [tendril]> alter table general_log_sampled ENGINE = BLACKHOLE; Query OK, 0 rows affected (0.36 sec) Records: 0  Dup...
[20:42:12] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I have rebooted the host and will investigate if those memory errors are the source of hangs or a consequence
[20:47:13] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) And after the reboot and the mysql start: ` root@db1115:~# w  20:47:03 up 4 min,  2 users,  load average: 264.16, 65.70, 22.02  `
[20:53:59] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The host is really unusable after some minutes, I am going to hard reset it, leave mysql stop and run some xfs fragmentation reports ` -bash-4.4$ w  20:53:...
[20:57:26] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) ` root@db1115:~# xfs_db -c frag -r /dev/md2 actual 17139, ideal 8175, fragmentation factor 52.30% `
[21:01:43] <wikibugs>	 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Tendril and dbtree will remain stopped for the next few hours. I am defragmenting xfs