[08:22:16] <jynus>	 there are some connection errors on db1082, db1098, maybe weights or latency should be checked
[08:22:59] <marostegui>	 maybe because of the depools
[08:23:11] <jynus>	 just a FYI
[08:27:43] <marostegui>	 There are more locks on wikidata, but I would assume that is because of the WRITE_BOTH from Amir1 
[08:36:16] <Amir1>	 Yeah, it should slowly get better. 
[08:36:40] <Amir1>	 As caches warm up
[08:40:03] <marostegui>	 thanks
[08:59:28] <arturo>	 FYI after the openstack upgrade yesterday, nova-conductor is again running out of mysql connections
[08:59:29] <arturo>	 T234876
[08:59:30] <stashbot>	 T234876: nova-conductor running out of mysql connections - https://phabricator.wikimedia.org/T234876
[08:59:43] <arturo>	 we have a fix in place, but sharing here just FYI
[08:59:53] <marostegui>	 arturo: why is it requiring more?
[09:00:11] <arturo>	 I don't know; new version new behaviour?
[09:00:42] <marostegui>	 not nice, but glad you have a fix in place :)
[09:01:25] <arturo>	 also AFAIK openstack in general is a great DB connection consumer
[09:06:23] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[09:10:22] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) We have to compress also the `logging` table once it gets its partitioning removed.
[09:12:29] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[09:19:15] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[09:21:14] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[09:22:02] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[09:27:45] <arturo>	 I remember you had a handy graph for openstack DB connections, right?
[09:31:41] <wikibugs>	 10DBA, 10Cloud-VPS, 10cloud-services-team (Kanban): CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830 (10aborrero)
[09:32:02] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1570089136586&to=1570095339868&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104
[09:32:04] <marostegui>	 this is the master
[09:32:16] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104
[09:33:22] <arturo>	 seems to confirm what I said earlier
[09:33:47] <arturo>	 (we have more connections now, that is)
[09:33:55] <marostegui>	 yep, looks like 
[09:34:14] <arturo>	 some openstack docs suggest we should use a 2000 connection limits in the DB side
[09:34:38] <marostegui>	 that's a bit crazy :-/
[09:34:40] <arturo>	 what do you think about that from the DB point of view?
[09:35:09] <marostegui>	 nova has now 100 connections
[09:35:11] <marostegui>	 (limit)
[09:35:24] <marostegui>	 2000 sounds just crazy to me
[09:37:04] <arturo>	 I have no previous knowledge in this field. Is that something a database can not handle?
[09:37:39] <marostegui>	 It can probably handle it, but it is shared with other things (like wikitech) but going from 100 to 2000 is just insane, and I think we'd need a more realistic limit
[09:37:56] <marostegui>	 2000 running connections is just crazy
[09:38:46] <arturo>	 I have another question. Perhaps we raise the limit from lets say 100 to 200 and nova just consumes them with no visible improvement?
[09:38:58] <marostegui>	 arturo: we already have more connections there than in a enwiki slave
[09:39:32] <marostegui>	 arturo: yes, I am sure if we raise it, nova will use them even at idle, that's why normally increasing connections under problems doesn't solve things, but make them worse
[09:39:45] <marostegui>	 that's why we normally say "no" to: hey we are getting too many connections errors
[09:39:56] <marostegui>	 Increasing connections will make things worse
[09:40:12] <marostegui>	 This is an enwiki slave: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104
[09:40:26] <marostegui>	 So nova has almost 3x more connections
[09:40:54] <arturo>	 how many DB slaves like this you have?
[09:40:57] <marostegui>	 Under problems, I rather have the host giving "too many connections" than crashing (as we've seen in the past)
[09:41:37] <marostegui>	 for m5? everything goes to the master
[09:41:47] <arturo>	 for enwiki I mean
[09:41:50] <marostegui>	 A few
[09:43:01] <arturo>	 I also see a 2x in QPS in m5 in the 2 dashboards you shared
[09:43:33] <arturo>	 does that number includes schema changes?
[09:44:30] <marostegui>	 I am not sure what you mean, the QPS are the same, but I don't understand your point
[09:44:55] <marostegui>	 I think asking to raise connections from 100 to 2000 is a bit strange
[09:45:19] <arturo>	 I looking at the top left panel "QPS"
[09:45:27] <arturo>	 for db1133
[09:45:41] <marostegui>	 Yes, but what's the point?
[09:46:14] <arturo>	 I'm not trying to make a point :-) I'm just trying to understand what's going on with the new openstack version
[09:46:44] <marostegui>	 arturo: I can provide you the processes that are currently running on the master, if that helps
[09:47:05] <arturo>	 yes please
[09:47:13] <arturo>	 or share the cmdline to generate them
[09:47:19] <marostegui>	 yep
[09:49:12] <marostegui>	 arturo: https://phabricator.wikimedia.org/P9260
[09:50:58] <arturo>	 "sleep" means the connection is doing nothing?
[09:51:15] <marostegui>	 yeah, which is a bad practice to leave stuff there hanging
[09:51:23] <marostegui>	 do your stuff, close connection
[09:51:50] <arturo>	 I guess opening a connection cost a lot of time, right?
[09:53:35] <arturo>	 BTW this is the patch that andrew merged to reduce nova DB connections https://gerrit.wikimedia.org/r/c/operations/puppet/+/541407
[09:54:52] <marostegui>	 there is indeed a reduction in connections at around the time of the merge
[09:54:57] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104&from=1570457315261&to=1570527131375&panelId=37&fullscreen
[09:55:20] <marostegui>	 arturo: As I said, we can increase the number a bit (we now have a more powerful host than we used to) but going from 100 to 2000 is a bit extreme
[09:56:14] <arturo>	 let's wait. I don't see any more leaks or errors this morning
[09:57:13] <marostegui>	 ok!
[09:58:32] <arturo>	 do you know how many QPS can handle a single DB connection?
[09:58:43] <arturo>	 depending on the query?
[09:58:57] <marostegui>	 that's hard to say indeed
[09:59:11] <marostegui>	 a connection should only be able to handle a query no? :)
[09:59:47] <arturo>	 so, shall I understand the 100 connection limit as "nova can do 100 queries in parallel"?
[10:00:03] <marostegui>	 arturo: yes, but the nova user is allowed from various IPs I believe
[10:00:25] <arturo>	 other than cloudcontrol servers?
[10:00:47] <marostegui>	 | nova | 208.80.154.132         |                  100 |
[10:00:52] <marostegui>	 | nova | 208.80.154.23          |                  100 |
[10:00:56] <marostegui>	 | nova | 208.80.154.92          |                  100 |
[10:01:03] <marostegui>	 so 100 connections for each of those IPs
[10:01:11] <marostegui>	 and yes, 100 connections in paralell
[10:01:34] <arturo>	 I don't know what 208.80.154.92 is
[10:01:42] <arturo>	 we may need to review grants/acls
[10:02:02] <marostegui>	 +1
[10:02:07] <marostegui>	 If we can remove that IP, let me know
[10:04:01] <arturo>	 what's the connection limit in enwiki again?
[10:05:03] <marostegui>	 ?
[10:05:45] <arturo>	 you said nova uses 3x more connections that enwiki
[10:06:00] <jynus>	 arturo, per host, no more than 64 connections can run in parallel
[10:06:09] <jynus>	 because otherwise they get queued
[10:06:40] <arturo>	 oh I see, so you have many hosts with lower limits
[10:06:44] <jynus>	 and around that size, running more queries will result in slower query throughput
[10:07:25] <jynus>	 obviously it depends on the queries, but 32-128 is the server side pool based on testing
[10:07:58] <marostegui>	 arturo: https://mariadb.com/kb/en/library/thread-pool-system-status-variables/#thread_pool_size
[10:08:19] <jynus>	 we allow for more coneections so we can kill them after a short spike
[10:08:33] <jynus>	 but we do not allow more than those running connections on production
[10:11:30] <jynus>	 and probably they have similar limits on misc, but I would have to check
[10:14:27] <arturo>	 👍
[10:26:37] <arturo>	 I have no idea how to do T234830, or if that causes downtime, etc
[10:26:37] <stashbot>	 T234830: CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830
[11:11:27] <marostegui>	 arturo: that ticket has almost no information, maybe it is a good idea to start gathering affected databases, tables, desired enconding etc
[11:11:42] <arturo>	 ok, will do
[11:12:11] <marostegui>	 ta
[11:12:40] <wikibugs>	 10DBA, 10Operations, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones pup...
[11:32:34] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[12:31:12] <jynus>	 6 complains of es1012 from xmldumps
[12:31:42] <marostegui>	 I chatted about those things with ariel past week about another es host
[13:12:03] <wikibugs>	 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) > bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case Q, will the bacula backups also include the `log` database?  Might...
[13:16:15] <wikibugs>	 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) >>! In T234826#5555449, @Ottomata wrote: >> bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case > Q, will the bacula ba...
[13:17:50] <wikibugs>	 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) Oh, I think it was a Q for Luca about how we intended to set that up.  I assume we can do it either way.  We wouldn't //have// to back up `log` to Bacula if it is too large since we'd h...
[13:19:36] <wikibugs>	 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) If the log db can be stored in Bacula it would be great! Otherwise HDFS is fine in my opinion..
[13:44:25] <wikibugs>	 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui)
[14:59:28] <jynus>	 should I file a ticket for https://logstash.wikimedia.org/goto/b642781f0c9943e59b05ba13fe904d44 or is it "expected"?
[14:59:43] <jynus>	 it is not a lot of them
[14:59:53] <jynus>	 but I don't know if it is a regression
[15:00:22] <jynus>	 or maybe it is within the range of expected because batch conversion
[15:00:33] <jynus>	 I see no SAL at 0:00
[15:01:25] <jynus>	 it seems some trend started at 2019-10-08T00:00:02
[15:02:17] <jynus>	 ^Amir1 sorry to ping you, but if it is expected I avoid an unnecesary ticket
[15:02:46] <Amir1>	 All good, let me check
[15:02:56] <jynus>	 it is not too worrying
[15:03:21] <jynus>	 but I prefer to check for (potentially new) bad patterns
[15:03:51] <Amir1>	 that doesn't look like anything we started but definitely should be tracked 
[15:04:00] <jynus>	 ok, then I will file a task
[15:04:05] <Amir1>	 (we don't deploy at 00:00 UTC :D)
[15:04:09] <jynus>	 maybe e.g. something got started at that time
[15:04:25] <jynus>	 maybe it could be a cron, e.g. backups
[15:04:29] <Amir1>	 even the maintenance script restarts every :30
[15:04:44] <jynus>	 I will file a task, it is not high importance
[15:04:44] <Amir1>	 yeah, it can be backups or dumpers
[15:04:55] <jynus>	 but I prefer to have it filed
[15:06:35] <Amir1>	 (while dumpers should not cause an insert)
[15:07:25] <jynus>	 who nows at this time! :-P
[15:08:07] <jynus>	 it wouldn't be the weirdest thing I've discovered in my archeological excursions on a database :-D
[15:08:11] * Amir1 imagines sleepy Ariel :D
[15:08:35] <Amir1>	 beside tmp indexes?
[15:08:48] <jynus>	 less jokingly, it could be some read pattern that interacts in some strange way with writes
[15:09:09] <jynus>	 e.g. we had some issues with select for update after table refactoring
[15:15:45] <Amir1>	 it's likely, also some caches can get expired etc.
[15:16:26] <jynus>	 yep
[16:19:09] <wikibugs>	 10DBA, 10Cloud-VPS, 10cloud-services-team (Kanban): CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830 (10bd808) utf8 is probably not the 'right' encoding. Mysql's utf8 is only capable of 3-byte code points. latin1 and utfmb4 would both work for 4...
[18:27:20] <wikibugs>	 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic)
[18:29:26] <wikibugs>	 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) Apologies for splatter-gunning the tags; this is very much a cross-team ques...
[18:31:53] <wikibugs>	 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic)
[20:03:46] <wikibugs>	 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) Related tasks:  * {T230185} – relates to edit summary not displayed nor edit...
[22:33:17] <wikibugs>	 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10matmarex) One more related task, about the same kind of issue in "new" wikitext edito...