[07:43:08] <jynus>	 kormat: I am still with mobile connection
[07:43:26] <jynus>	 they will come check fiber at midday
[07:43:35] <kormat>	 ouch, ok
[07:43:45] <jynus>	 I am checking dbtree
[07:44:16] <jynus>	 it seems it is taking 11-13 seconds to load from icinga and times out consistently since 0h
[07:44:33] <jynus>	 I have no reason why that is happening
[07:45:06] <jynus>	 the check does from icinga host: sudo -u nagios /usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org
[07:45:26] <jynus>	 and curl tells me it takes just over 10 seconds
[07:45:46] <jynus>	 no deployments seem to have happened at that time
[07:51:46] <jynus>	 I finally found something potentially relevant: https://grafana.wikimedia.org/d/000000377/host-overview?panelId=13&fullscreen&orgId=1&refresh=5m&from=1597866688191&to=1597909888191&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql
[07:51:54] <jynus>	 the timestamp matches the alert
[07:55:02] <jynus>	 db1115 didn't crash, but there was a clear stall around 0h: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1115&var-port=9104&from=now-12h&to=now
[07:55:40] <jynus>	 I see weird patterns since then
[07:57:13] <jynus>	 should we try to restart db1115?
[07:58:07] <kormat>	 huh. there's a ton of errors in the mariadb logs on db1115
[07:59:52] <jynus>	 I see cron job deleting millions or rows: Queried about 2674029 rows, Deleted about 2674000 rows
[08:00:30] <jynus>	 kormat: is it only event scheduler or innodb errors?
[08:00:39] <kormat>	 event scheduler
[08:00:51] <jynus>	 nah, we don't care about those
[08:01:03] <jynus>	 those are "normal" in tendril level
[08:01:29] <jynus>	 although there could be more than usual
[08:01:39] <kormat>	 there's a shitton
[08:02:07] <kormat>	 30k from 00 to 06
[08:03:27] <jynus>	 my bet is the cron job, would fit if configured at 0h
[08:03:43] <kormat>	 is it still running?
[08:03:44] <jynus>	 causing stalls/locks
[08:03:46] <jynus>	 yep
[08:03:52] <kormat>	 is it important?
[08:04:16] <jynus>	 it is clean up - delete old rows
[08:04:22] <jynus>	 not important for day to day
[08:04:34] <jynus>	 but killing it may make the problem worse
[08:04:45] <jynus>	 because it takes 10x more to rollback than to commit
[08:05:23] <kormat>	 ugh
[08:05:36] <jynus>	 you know that if it was for me I would pour gasline over tendril
[08:05:44] <jynus>	 :-)
[08:05:47] <kormat>	 hehe
[08:05:55] <kormat>	 ok, so i guess we let it finish
[08:07:03] <jynus>	 let's research how often this crons jobs run
[08:13:23] <jynus>	 it is not a cron, it is an event
[08:13:38] <jynus>	 and it runs every 5 minutes
[08:13:54] <jynus>	 so maybe it is not the real cause of the stalling/slowdown
[08:16:02] <jynus>	 I found the imediate issue
[08:16:17] <jynus>	 I used tendril to debug tendril, and found this: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1115&user=&schema=&qmode=eq&query=&hours=12
[08:18:27] <jynus>	 that query started taking an average of 11 seconds to run around the same time the webrequests started to take the same amount
[08:18:38] <kormat>	 huuh
[08:20:08] <jynus>	 I am going to create a ticket to share that before it gets lost in histor
[08:20:12] <jynus>	 *y
[08:20:13] <kormat>	 +1
[08:30:56] <wikibugs>	 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo)
[08:31:27] <jynus>	 the hypothesis is that a query that used to take very little time, for some reason now chooses a bad query plan
[08:31:39] <jynus>	 I will check that hypothesis
[08:32:13] <jynus>	 that looks like "calculating qps" kind of query to show on dbtree
[08:32:27] <jynus>	 which would fit
[08:33:53] <jynus>	 228 rows in set (12.38 sec)
[08:38:29] <kormat>	 uff
[08:38:36] <wikibugs>	 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo)  `name=query plan root@db1115[tendril]> EXPLAIN select srv.id ... *************************** 1. row ***************************            id: 1   select_type: SIMPLE         table: str          type: const possible_keys:...
[08:41:22] <jynus>	 I am running analyse table on the global_status_log table, that could cause some slowdowns
[08:42:43] <jynus>	 I will cancell if it takes too much time
[09:01:51] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Restored the data from backups, machine has caught up on replication with no further failures. No new errors in the hw logs. Going to repool it now.
[09:03:37] <jynus>	 what do you think about giving the cpu a run to discard it won't crash immediately? e.g. running a mysqldump --single-transaction --all-databases -h localhost > /dev/null
[09:04:19] <jynus>	 (stressing cpu, memory and storage)
[09:04:27] <jynus>	 and touching every single row on the db
[09:04:55] <kormat>	 ah. should i depool it again first?
[09:09:00] <jynus>	 not sure, if you want to be very safe yes
[09:10:06] <jynus>	 it should be online but it may add io stress to the point it could lag a bit
[09:10:54] <jynus>	 after running analyze on the tables, the query time is the same (11.2 seconds) so it will need more digging
[12:06:41] <jynus>	 I'm back
[12:44:21] <jynus>	 I am going to ack dbprov100[12] backup freshness, as they will become stale until backup2001 is back working
[12:55:22] <wikibugs>	 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Hey, cloud people! :-D  Could I get an ack of this ticket from someone at cloud (or a redirection to the right server owners)? I don't particularly need this f...
[12:57:06] <wikibugs>	 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo)
[13:05:46] <jynus>	 I am going to restart db2125 on prometheus exporter
[13:05:52] <jynus>	 or rather, the other way around
[13:05:58] <kormat>	 hehe
[13:06:23] <jynus>	 and starting to think we should start prometheus on every pupper run...
[13:06:57] <jynus>	 or a cron
[13:07:31] <jynus>	 I want to remember moritz had someting like that for other services
[13:10:48] <jynus>	 and now going back to implementing "index" hints on tendril, fighting one alert at a time
[13:11:15] <jynus>	 will ack the alert and update the ticket now that I know the cause
[13:11:19] <kormat>	 when i get time i want to fix prom-mysql-exporter so this issue stops happening
[13:11:22] <kormat>	 cool
[13:11:34] <jynus>	 do you remember the issue?
[13:11:41] <jynus>	 I think it was the package, not our puppet
[13:12:03] <kormat>	 oh there's 2 separate issues
[13:12:12] <jynus>	 that is bailed out on error rather than keep trying?
[13:12:13] <kormat>	 one is that the non-multiinstance one tries to run sometimes on multiinstance hosts
[13:12:16] <jynus>	 something like that?
[13:12:27] <kormat>	 the other i'm pretty sure is a code bug where it keeps failing to access the mysql domain socket
[13:12:44] <kormat>	 the first issue is probably a packaging and/or puppet bug
[13:12:46] <jynus>	 nah, that is not a huge issue as it only happens on first install
[13:12:56] <jynus>	 but the other is annoying
[13:13:37] <jynus>	 I honestly don't remember the details, but I remember finding it
[13:13:45] <jynus>	 them forgetting
[13:13:58] <jynus>	 *then
[13:39:38] <wikibugs>	 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) I beieve this is what happened, after further investigation:  1. 2160c63c8bd7f66c73d639 (for T252331) and related was deployed, increasing the purge rate of global_status_log table 2. 3 months later, after enough rows have...
[13:52:55] <wikibugs>	 10DBA: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) Current plan:  ` +------+-------------+-------+-------+----------------------------+-------------+---------+-------+------+-------------- | id   | select_type | table | type  | possible_keys              | key         | key...
[13:58:49] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Completed successfully: ` kormat@db2125:~(0:0)$ time sudo mysqldump --single-transaction --all-databases > /dev/null  real    249m58.312s user    132m25.215...
[14:01:37] <jynus>	 ^indeed it was: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=25&orgId=1&var-server=db2125&var-datasource=thanos&var-cluster=mysql&from=1597912299929&to=1597932046507
[14:35:59] <wikibugs>	 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10Andrew) Sorry about the slow response!  Since you first opened this ticket I've moved all VMs off this puppetmaster; it's now set to role::spare::system.  Anything you...
[14:39:48] <wikibugs>	 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Thank you very much @andrew! Indeed, backups jobs have been automatically removed, so no need for any further action, except revert from the alert ignore list....
[15:00:38] <wikibugs>	 10DBA, 10Patch-For-Review: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) p:05Triage→03Medium
[15:03:53] <wikibugs>	 10DBA, 10Patch-For-Review: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 (10jcrespo) The above should patch fix the alert, please have a look for obvious security bugs. A similar patch has to be ported to tendril, for both the tree and other views.
[15:12:29] <kormat>	 jynus: i know absolutely nothing about php i'm afraid
[15:12:42] <jynus>	 ha ha
[15:12:52] <jynus>	 but you may know about sql injection
[15:13:12] <jynus>	 I just want a pair of eyes checking very obvious mistakes
[15:13:46] <jynus>	 I tested it in the testing environement for dbtree, aka production
[15:14:25] <jynus>	 the existing orm is not great but we can get away with it because we are the only users of it
[15:15:19] <wikibugs>	 10DBA, 10Patch-For-Review, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) 05Open→03Resolved a:03jcrespo
[15:15:37] <jynus>	 dbtree has no parameter input, so the suface of attack is very small
[15:15:50] <jynus>	 unless as I said, very obvious mistakes
[15:16:02] <jynus>	 tendril is more flexible, but it is behind auth
[15:17:25] <jynus>	 if you still don't feel confortable, I can try phish someone on sre
[15:17:50] <kormat>	 i literally cannot evaluate this at all
[15:18:00] <jynus>	 on, np
[15:18:03] <jynus>	 *ok
[15:18:05] <jynus>	 :-)
[15:18:06] <kormat>	 so yeah i suggest you trick someone else into it :)
[17:34:11] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 (10nskaggs) a:03Bstorm
[17:34:25] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) a:03nskaggs
[19:15:50] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 (10Bstorm) 05Open→03Resolved All scripts run. Seems to work well from Toolforge.
[19:16:36] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) a:05nskaggs→03Bstorm Stealing this task from @nskaggs to get a successful run of a new script.
[19:22:17] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) It failed on something that should have worked and worked before. I wonder why.
[19:39:10] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) I didn't see you stealing this @Bstorm  :-)
[19:41:13] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Bstorm) a:05Bstorm→03nskaggs Giving task back to @nskaggs to finalize. The DNS step is done if you haven't done it, but the script would not have done an...
[19:53:27] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10nskaggs) 05Open→03Resolved This should be fully complete now.
[19:54:28] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 (10Urbanecm) `  MariaDB [thankyouwiki_p]> select count(*) from user; +----------+ | count(*) | +----------+ |        5 | +----------+ 1 row in set (0.01 sec) `...