[03:46:22] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) [03:47:54] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) [04:16:29] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) p:05Triage→03Medium Let me know when you want this to be done, if it requires coordination with you or can be done any... [04:30:08] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1118.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202104290429_mar... [04:39:49] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1156 pooled in s2 with minimal weight [04:52:27] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1118.eqiad.wmnet'] ` and were **ALL** successful. [04:54:47] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) Transfer from db1083 to db1118 on-going [05:00:59] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Automatically pooling db1156 into s2. [05:01:08] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:01:27] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) All the hosts in this task have been productionized. Pending: decommission the old ones. [05:19:22] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) @jcrespo I have noticed this on db2100 (10.1 backup source): ` Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: tried to purge sec index entry not marked for deletion in Apr 27 13:28:37 db2100 mysqld[883]: I... [05:35:20] 10DBA, 10Patch-For-Review: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) 05Open→03Resolved This has been enabled everywhere [06:26:41] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) db1118 cloned from db1083, checking its tables now. [06:26:59] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:36:51] 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) Will update here next time db backups run there. I also saw a filesystem backup of 1.6GB here: https://grafana.wikimedia.org/d/413r2vbWk/bacu... [06:55:07] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:55:55] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) 05Open→03Resolved db1118 has been cloned from db1083. Once the tables are finished their checking it will be pooled. Once it's been working fine for a few days, db1083 will be sent to decommissioning. Closin... [06:55:59] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:56:01] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:56:03] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:56:48] 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) [06:57:18] 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) 05Open→03Stalled This is not yet ready, db1118 needs to be pooled, and once done, let's wait a week or so before decommissioning db1083. [06:57:52] 10DBA, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) [06:57:54] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:57:56] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:58:17] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:04:43] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) >>! In T274463#7043866, @brennen wrote: > This sounds like the right approach if it's something we can reasonably do. I would support that if... [07:05:42] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) a:03Marostegui [07:12:09] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) Thanks for the heads up. The fact that this is recent and dumps showed no error/warning message makes me think it is not a fatal error and that data was kept intact. I will make sure to rebuilt it for 10.4... [07:17:30] I would appreciate a sanity check on this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/683515 [07:20:52] checking [07:21:09] thanks [07:23:31] one check if failing, but not sure why [07:23:34] *is [07:23:50] yeah, it was the same a few days ago when I depooled es4 [07:23:56] I didn't give it much importance [07:24:32] the funny thing is that it said failure, but when I looked at the console output, it said success [07:24:41] "failed: No such file or directory (2)" on rsync, so not code related [07:45:49] 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) Doing on backup1001|1002|2001|2002: ` rm /etc/ssl/openssl.cnf apt install --reinstall -o Dpkg::Options::="--force-confask... [08:08:28] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) I think it can be done anytime but @ladsgroup should confirm. [08:08:53] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) Thanks - I will wait for the confirmation [08:18:19] 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) [08:37:51] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [08:38:21] 10Data-Persistence-Backup, 10SRE, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) 05Open→03Resolved This has been successfully reverted and a backup each has been run from both stretch and buster host... [08:41:56] kormat jynus I am about to finish upgrading all the buster kernels in codfw, only pending db2093, which I will do in around 30m when I am done with the pending hosts. so orchestrator will be failing during the reboot [08:45:17] ok [09:02:49] marostegui: reminder that db2093 is treacherous and will page [09:03:26] i downtimed it and db1115 [09:03:34] 👍 [09:05:50] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) Yes. Sorry I just woke up. Let's do it! [09:07:19] orchestrator is back [09:07:25] all codfw is now done [09:22:27] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10jcrespo) It looks like a single entry on the index, forcing table rebuilt: `name=db2100[(none)]> check table metawiki.watchlist; +--------------------+-------+----------+--------------------------------------------... [09:24:57] Amir1: Do we have mailman processes writing to mailinglist table? [09:25:20] (or reading) [09:25:32] I am trying to alter the table, but I am not able to overcome the metadata lock [09:25:42] yeah, mailman-web service in lists1001 [09:25:50] right... [09:26:01] do you want me to down it for a minute? [09:26:03] maybe it can be stopped for a sec? [09:26:04] yeah [09:26:21] sure, let me grab my coffee. Is it okay to do it in five minutes? [09:26:27] sure [09:26:32] no problem [09:32:19] back. Since it's a new service and not widely used, we can do it. For future cases I think we should make a down-time request, etc. [09:32:22] anyway [09:32:32] sounds good yeah [09:34:16] Amir1: the alter went thru, I am now running it at testmailman3 [09:34:24] which is having the same issue :) [09:35:08] ugh, I just down that quickly [09:35:38] marostegui: go [09:35:41] Amir1: done [09:36:38] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) 05Open→03Resolved a:03Marostegui All done: ` root@db1128.eqiad.wmnet[mailman3]> ALTER TABLE mailinglist MODIFY COLUM... [09:36:47] Amir1: ^ \o/ [09:36:59] YAY [09:38:06] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) I confirm it fixed the issue (tried it on https://lists.wikimedia.org/postorius/lists/lgbt.lists.wikimedia.org/) mailman w... [11:36:47] 10DBA: New database request: image_matching - https://phabricator.wikimedia.org/T280042 (10gmodena) >>! In T280042#7035989, @Eevans wrote: >>>! In T280042#7034256, @gmodena wrote: >> Maybe premature optimisation, but this dataset stores text fields (part of a potential primary key) that can be relatively long (p... [11:41:22] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s2 is fully done apart from the master (db1122) [11:41:49] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [11:59:28] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s5 is fully done apart from the master (db1100) [12:00:07] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [12:18:05] 10DBA: New database request: image_matching - https://phabricator.wikimedia.org/T280042 (10gmodena) >> The full dataset for ImageMatching, generated on 321 wikis, is 2.6GB. It contains `23585365` records. > To be clear, a //record// as it is referred to here is one globally unique primary key, and the correspond... [12:28:02] 10DBA, 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) I assume the size is for the search index. [13:05:57] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [13:08:34] ^ marostegui to downtime on icinga or real issue? [13:11:42] the usual pc lag [13:11:44] I think is not maintenance, but a real issues [13:11:45] we need to adjust it [13:11:54] there is increase uncached traffic [13:11:58] s1 got a spike [13:12:10] s/issues/traffic changes/ [13:12:14] not an issue at the moment [13:12:15] yeah, all pc are lagging a bit [13:12:20] which is "usual" :( [13:17:55] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 26 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [13:23:55] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [13:28:42] es4 and es5 have increased a lot the reads, which likely means this is forced reparses of old content [13:29:01] however, it was mostly http gets, I think [13:29:07] so a scrapper, maybe? [13:29:29] that would correlate with parsercache yeah [13:30:02] as nothing seems broken, I am going into a meeting, ping if things start to get interesting :-) [13:30:15] thanks <3 [13:30:44] it would be a good time to tune s1 weight, if things were inbalanced [13:31:05] I tuned it a few weeks ago during the last spike XD [13:31:13] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 4.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [13:31:17] So far nothing pops out too crazy on tendril [13:31:19] regarding QPS [13:31:30] They are pretty balanced [13:31:40] between 10-13k qps [13:37:31] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [13:40:49] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [13:55:41] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [14:00:46] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s3 is fully done apart from the master (db1123) [14:01:04] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [14:10:01] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [14:13:35] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 9 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [14:14:45] tendril is struggling again [14:15:15] It might be our global_status friend [14:15:41] or again general_log_sampled [14:15:59] -rw-rw---- 1 mysql mysql 1.1T Apr 29 14:15 general_log_sampled.ibd [14:16:01] awwwww [14:16:45] I am stopping the event_scheduler for now and let things go thru and then I will truncate that [14:17:54] 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) Tendril is struggling again, the size of the table after a week since its truncation: ` -rw-rw---- 1 mysql mysql 1.1T Apr 29 14:15 general_log_sampled.ibd ` [14:21:20] I think it is also global_status causing issues [14:22:56] tendril is back [14:23:03] It was a combination of both tables from what I can see [14:23:15] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 31.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [14:23:23] I truncated the 1T one and it sort of recovered, but it didn't fully recover until I truncated global_status_log [14:23:45] 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) For the record: ` [16:21:20] I think it is also global_status causing issues [16:22:56] tendril is back [16:23:02] It was a combination of both tables from what I... [14:26:18] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) [14:26:30] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) p:05Triage→03High [14:39:35] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I am doing some archeology and it looks like this only affects: https://tendril.wikimedia.org/report/sampled_queries and NOT https://tendril.wikimedia.org/... [14:41:45] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [14:45:21] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [14:49:48] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) https://tendril.wikimedia.org/report/slow_queries doesn't seem to be using that table for anything: ` | 64231005 | tendril_web | 208.80.155.104:43546 |... [15:02:09] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 8.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [15:12:45] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [15:20:03] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [15:21:23] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [15:22:02] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Urbanecm) Tagging #dba, as they might be able to offer some guidance on finding the issue here. [15:36:01] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 3.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [15:40:51] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [15:45:52] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10jcrespo) p:05Triage→03Unbreak! This should be a blocker- es traffic has grown almost grown 100x since 14 april, correlat... [15:51:14] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10LarsWirzenius) ACK, I'll make it a train blocker. [15:51:30] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10LarsWirzenius) [15:51:46] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 7.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [15:54:57] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) Given we only make requests to external storage when parsercache has a miss, it seemed sensible to look for correspondi... [15:57:00] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [16:00:33] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) >>! In T281480#7046160, @Joe wrote: > Given we only make requests to external storage when parsercache has a miss, it s... [16:08:12] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Pchelolo) > A better candidate for changing something is probably https://gerrit.wikimedia.org/r/c/mediawiki/core/+/677299.... [16:11:01] 10DBA, 10MediaWiki-Revision-backend, 10Platform Engineering, 10Wikidata, and 2 others: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 (10Joe) There is definitely something going very wrong with memcached: https://grafana.wikimedia.org/d/000000316/memcache?view... [16:25:44] PROBLEM - MariaDB sustained replica lag on db1087 is CRITICAL: 10 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1087&var-port=9104 [16:25:52] PROBLEM - MariaDB sustained replica lag on db2079 is CRITICAL: 37.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2079&var-port=9104 [16:28:54] PROBLEM - MariaDB sustained replica lag on db2152 is CRITICAL: 29 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2152&var-port=9104 [16:34:30] RECOVERY - MariaDB sustained replica lag on db2152 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2152&var-port=9104 [16:36:04] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:41:04] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:49:13] PROBLEM - MariaDB sustained replica lag on db2080 is CRITICAL: 344.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104 [16:54:05] RECOVERY - MariaDB sustained replica lag on db1087 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1087&var-port=9104 [16:57:45] RECOVERY - MariaDB sustained replica lag on db2080 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104 [16:58:41] RECOVERY - MariaDB sustained replica lag on db2079 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2079&var-port=9104 [17:39:06] 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Krinkle) [18:06:23] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10brennen) > Databases used to use that model (until we found alternative methods for snapshoting), we will still have the code on puppet- with backups t... [18:08:26] 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Addshore) [18:08:32] 10DBA, 10MediaWiki-Cache, 10MediaWiki-Revision-backend, 10Platform Engineering, and 2 others: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 (10Addshore) [18:20:32] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) Frankyl, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshotting (and partman) setup (this quarter). I hav... [19:07:03] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10brennen) > Frankly, I am not sure I have the resources and knowledge to get into an entirely new LVM snapshotting (and partman) setup (this quarter). I... [19:56:37] tendril is again dead [19:58:44] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Tendril is again stuck and this is what I have seen: ` | 66202925 | root | 10.64.32.25 | tendril | Connect | 606 | Sen... [20:13:10] I have no idea what is causing the issues, still investigating [20:37:18] still no idea [20:37:21] this is very strange [20:39:21] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) For now I have: ` mysql:root@localhost [tendril]> alter table general_log_sampled ENGINE = BLACKHOLE; Query OK, 0 rows affected (0.36 sec) Records: 0 Dup... [20:42:12] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) I have rebooted the host and will investigate if those memory errors are the source of hangs or a consequence [20:47:13] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) And after the reboot and the mysql start: ` root@db1115:~# w 20:47:03 up 4 min, 2 users, load average: 264.16, 65.70, 22.02 ` [20:53:59] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) The host is really unusable after some minutes, I am going to hard reset it, leave mysql stop and run some xfs fragmentation reports ` -bash-4.4$ w 20:53:... [20:57:26] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) ` root@db1115:~# xfs_db -c frag -r /dev/md2 actual 17139, ideal 8175, fragmentation factor 52.30% ` [21:01:43] 10DBA: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 (10Marostegui) Tendril and dbtree will remain stopped for the next few hours. I am defragmenting xfs