[07:43:08] kormat: I am still with mobile connection [07:43:26] they will come check fiber at midday [07:43:35] ouch, ok [07:43:45] I am checking dbtree [07:44:16] it seems it is taking 11-13 seconds to load from icinga and times out consistently since 0h [07:44:33] I have no reason why that is happening [07:45:06] the check does from icinga host: sudo -u nagios /usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org [07:45:26] and curl tells me it takes just over 10 seconds [07:45:46] no deployments seem to have happened at that time [07:51:46] I finally found something potentially relevant: https://grafana.wikimedia.org/d/000000377/host-overview?panelId=13&fullscreen&orgId=1&refresh=5m&from=1597866688191&to=1597909888191&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [07:51:54] the timestamp matches the alert [07:55:02] db1115 didn't crash, but there was a clear stall around 0h: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1115&var-port=9104&from=now-12h&to=now [07:55:40] I see weird patterns since then [07:57:13] should we try to restart db1115? [07:58:07] huh. there's a ton of errors in the mariadb logs on db1115 [07:59:52] I see cron job deleting millions or rows: Queried about 2674029 rows, Deleted about 2674000 rows [08:00:30] kormat: is it only event scheduler or innodb errors? [08:00:39] event scheduler [08:00:51] nah, we don't care about those [08:01:03] those are "normal" in tendril level [08:01:29] although there could be more than usual [08:01:39] there's a shitton [08:02:07] 30k from 00 to 06 [08:03:27] my bet is the cron job, would fit if configured at 0h [08:03:43] is it still running? [08:03:44] causing stalls/locks [08:03:46] yep [08:03:52] is it important? [08:04:16] it is clean up - delete old rows [08:04:22] not important for day to day [08:04:34] but killing it may make the problem worse [08:04:45] because it takes 10x more to rollback than to commit [08:05:23] ugh [08:05:36] you know that if it was for me I would pour gasline over tendril [08:05:44] :-) [08:05:47] hehe [08:05:55] ok, so i guess we let it finish [08:07:03] let's research how often this crons jobs run [08:13:23] it is not a cron, it is an event [08:13:38] and it runs every 5 minutes [08:13:54] so maybe it is not the real cause of the stalling/slowdown [08:16:02] I found the imediate issue [08:16:17] I used tendril to debug tendril, and found this: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1115&user=&schema=&qmode=eq&query=&hours=12 [08:18:27] that query started taking an average of 11 seconds to run around the same time the webrequests started to take the same amount [08:18:38] huuh [08:20:08] I am going to create a ticket to share that before it gets lost in histor [08:20:12] *y [08:20:13] +1 [08:30:56] 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) [08:31:27] the hypothesis is that a query that used to take very little time, for some reason now chooses a bad query plan [08:31:39] I will check that hypothesis [08:32:13] that looks like "calculating qps" kind of query to show on dbtree [08:32:27] which would fit [08:33:53] 228 rows in set (12.38 sec) [08:38:29] uff [08:38:36] 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) `name=query plan root@db1115[tendril]> EXPLAIN select srv.id ... *************************** 1. row *************************** id: 1 select_type: SIMPLE table: str type: const possible_keys:... [08:41:22] I am running analyse table on the global_status_log table, that could cause some slowdowns [08:42:43] I will cancell if it takes too much time [09:01:51] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Restored the data from backups, machine has caught up on replication with no further failures. No new errors in the hw logs. Going to repool it now. [09:03:37] what do you think about giving the cpu a run to discard it won't crash immediately? e.g. running a mysqldump --single-transaction --all-databases -h localhost > /dev/null [09:04:19] (stressing cpu, memory and storage) [09:04:27] and touching every single row on the db [09:04:55] ah. should i depool it again first? [09:09:00] not sure, if you want to be very safe yes [09:10:06] it should be online but it may add io stress to the point it could lag a bit [09:10:54] after running analyze on the tables, the query time is the same (11.2 seconds) so it will need more digging [12:06:41] I'm back [12:44:21] I am going to ack dbprov100[12] backup freshness, as they will become stale until backup2001 is back working [12:55:22] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Hey, cloud people! :-D Could I get an ack of this ticket from someone at cloud (or a redirection to the right server owners)? I don't particularly need this f... [12:57:06] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) [13:05:46] I am going to restart db2125 on prometheus exporter [13:05:52] or rather, the other way around [13:05:58] hehe [13:06:23] and starting to think we should start prometheus on every pupper run... [13:06:57] or a cron [13:07:31] I want to remember moritz had someting like that for other services [13:10:48] and now going back to implementing "index" hints on tendril, fighting one alert at a time [13:11:15] will ack the alert and update the ticket now that I know the cause [13:11:19] when i get time i want to fix prom-mysql-exporter so this issue stops happening [13:11:22] cool [13:11:34] do you remember the issue? [13:11:41] I think it was the package, not our puppet [13:12:03] oh there's 2 separate issues [13:12:12] that is bailed out on error rather than keep trying? [13:12:13] one is that the non-multiinstance one tries to run sometimes on multiinstance hosts [13:12:16] something like that? [13:12:27] the other i'm pretty sure is a code bug where it keeps failing to access the mysql domain socket [13:12:44] the first issue is probably a packaging and/or puppet bug [13:12:46] nah, that is not a huge issue as it only happens on first install [13:12:56] but the other is annoying [13:13:37] I honestly don't remember the details, but I remember finding it [13:13:45] them forgetting [13:13:58] *then [13:39:38] 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) I beieve this is what happened, after further investigation: 1. 2160c63c8bd7f66c73d639 (for T252331) and related was deployed, increasing the purge rate of global_status_log table 2. 3 months later, after enough rows have... [13:52:55] 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) Current plan: ` +------+-------------+-------+-------+----------------------------+-------------+---------+-------+------+-------------- | id | select_type | table | type | possible_keys | key | key... [13:58:49] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Completed successfully: ` kormat@db2125:~(0:0)$ time sudo mysqldump --single-transaction --all-databases > /dev/null real 249m58.312s user 132m25.215... [14:01:37] ^indeed it was: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=25&orgId=1&var-server=db2125&var-datasource=thanos&var-cluster=mysql&from=1597912299929&to=1597932046507 [14:35:59] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10Andrew) Sorry about the slow response! Since you first opened this ticket I've moved all VMs off this puppetmaster; it's now set to role::spare::system. Anything you... [14:39:48] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Thank you very much @andrew! Indeed, backups jobs have been automatically removed, so no need for any further action, except revert from the alert ignore list.... [15:00:38] 10DBA, 10Patch-For-Review: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) p:05Triage→03Medium [15:03:53] 10DBA, 10Patch-For-Review: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) The above should patch fix the alert, please have a look for obvious security bugs. A similar patch has to be ported to tendril, for both the tree and other views. [15:12:29] jynus: i know absolutely nothing about php i'm afraid [15:12:42] ha ha [15:12:52] but you may know about sql injection [15:13:12] I just want a pair of eyes checking very obvious mistakes [15:13:46] I tested it in the testing environement for dbtree, aka production [15:14:25] the existing orm is not great but we can get away with it because we are the only users of it [15:15:19] 10DBA, 10Patch-For-Review, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) 05Open→03Resolved a:03jcrespo [15:15:37] dbtree has no parameter input, so the suface of attack is very small [15:15:50] unless as I said, very obvious mistakes [15:16:02] tendril is more flexible, but it is behind auth [15:17:25] if you still don't feel confortable, I can try phish someone on sre [15:17:50] i literally cannot evaluate this at all [15:18:00] on, np [15:18:03] *ok [15:18:05] :-) [15:18:06] so yeah i suggest you trick someone else into it :) [17:34:11] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 (10nskaggs) a:03Bstorm [17:34:25] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) a:03nskaggs [19:15:50] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 (10Bstorm) 05Open→03Resolved All scripts run. Seems to work well from Toolforge. [19:16:36] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) a:05nskaggs→03Bstorm Stealing this task from @nskaggs to get a successful run of a new script. [19:22:17] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) It failed on something that should have worked and worked before. I wonder why. [19:39:10] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) I didn't see you stealing this @Bstorm :-) [19:41:13] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) a:05Bstorm→03nskaggs Giving task back to @nskaggs to finalize. The DNS step is done if you haven't done it, but the script would not have done an... [19:53:27] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) 05Open→03Resolved This should be fully complete now. [19:54:28] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Urbanecm) ` MariaDB [thankyouwiki_p]> select count(*) from user; +----------+ | count(*) | +----------+ | 5 | +----------+ 1 row in set (0.01 sec) `...