[08:22:16] there are some connection errors on db1082, db1098, maybe weights or latency should be checked [08:22:59] maybe because of the depools [08:23:11] just a FYI [08:27:43] There are more locks on wikidata, but I would assume that is because of the WRITE_BOTH from Amir1 [08:36:16] Yeah, it should slowly get better. [08:36:40] As caches warm up [08:40:03] thanks [08:59:28] FYI after the openstack upgrade yesterday, nova-conductor is again running out of mysql connections [08:59:29] T234876 [08:59:30] T234876: nova-conductor running out of mysql connections - https://phabricator.wikimedia.org/T234876 [08:59:43] we have a fix in place, but sharing here just FYI [08:59:53] arturo: why is it requiring more? [09:00:11] I don't know; new version new behaviour? [09:00:42] not nice, but glad you have a fix in place :) [09:01:25] also AFAIK openstack in general is a great DB connection consumer [09:06:23] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [09:10:22] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) We have to compress also the `logging` table once it gets its partitioning removed. [09:12:29] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [09:19:15] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [09:21:14] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [09:22:02] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [09:27:45] I remember you had a handy graph for openstack DB connections, right? [09:31:41] 10DBA, 10Cloud-VPS, 10cloud-services-team (Kanban): CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830 (10aborrero) [09:32:02] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1570089136586&to=1570095339868&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104 [09:32:04] this is the master [09:32:16] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104 [09:33:22] seems to confirm what I said earlier [09:33:47] (we have more connections now, that is) [09:33:55] yep, looks like [09:34:14] some openstack docs suggest we should use a 2000 connection limits in the DB side [09:34:38] that's a bit crazy :-/ [09:34:40] what do you think about that from the DB point of view? [09:35:09] nova has now 100 connections [09:35:11] (limit) [09:35:24] 2000 sounds just crazy to me [09:37:04] I have no previous knowledge in this field. Is that something a database can not handle? [09:37:39] It can probably handle it, but it is shared with other things (like wikitech) but going from 100 to 2000 is just insane, and I think we'd need a more realistic limit [09:37:56] 2000 running connections is just crazy [09:38:46] I have another question. Perhaps we raise the limit from lets say 100 to 200 and nova just consumes them with no visible improvement? [09:38:58] arturo: we already have more connections there than in a enwiki slave [09:39:32] arturo: yes, I am sure if we raise it, nova will use them even at idle, that's why normally increasing connections under problems doesn't solve things, but make them worse [09:39:45] that's why we normally say "no" to: hey we are getting too many connections errors [09:39:56] Increasing connections will make things worse [09:40:12] This is an enwiki slave: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104 [09:40:26] So nova has almost 3x more connections [09:40:54] how many DB slaves like this you have? [09:40:57] Under problems, I rather have the host giving "too many connections" than crashing (as we've seen in the past) [09:41:37] for m5? everything goes to the master [09:41:47] for enwiki I mean [09:41:50] A few [09:43:01] I also see a 2x in QPS in m5 in the 2 dashboards you shared [09:43:33] does that number includes schema changes? [09:44:30] I am not sure what you mean, the QPS are the same, but I don't understand your point [09:44:55] I think asking to raise connections from 100 to 2000 is a bit strange [09:45:19] I looking at the top left panel "QPS" [09:45:27] for db1133 [09:45:41] Yes, but what's the point? [09:46:14] I'm not trying to make a point :-) I'm just trying to understand what's going on with the new openstack version [09:46:44] arturo: I can provide you the processes that are currently running on the master, if that helps [09:47:05] yes please [09:47:13] or share the cmdline to generate them [09:47:19] yep [09:49:12] arturo: https://phabricator.wikimedia.org/P9260 [09:50:58] "sleep" means the connection is doing nothing? [09:51:15] yeah, which is a bad practice to leave stuff there hanging [09:51:23] do your stuff, close connection [09:51:50] I guess opening a connection cost a lot of time, right? [09:53:35] BTW this is the patch that andrew merged to reduce nova DB connections https://gerrit.wikimedia.org/r/c/operations/puppet/+/541407 [09:54:52] there is indeed a reduction in connections at around the time of the merge [09:54:57] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104&from=1570457315261&to=1570527131375&panelId=37&fullscreen [09:55:20] arturo: As I said, we can increase the number a bit (we now have a more powerful host than we used to) but going from 100 to 2000 is a bit extreme [09:56:14] let's wait. I don't see any more leaks or errors this morning [09:57:13] ok! [09:58:32] do you know how many QPS can handle a single DB connection? [09:58:43] depending on the query? [09:58:57] that's hard to say indeed [09:59:11] a connection should only be able to handle a query no? :) [09:59:47] so, shall I understand the 100 connection limit as "nova can do 100 queries in parallel"? [10:00:03] arturo: yes, but the nova user is allowed from various IPs I believe [10:00:25] other than cloudcontrol servers? [10:00:47] | nova | 208.80.154.132 | 100 | [10:00:52] | nova | 208.80.154.23 | 100 | [10:00:56] | nova | 208.80.154.92 | 100 | [10:01:03] so 100 connections for each of those IPs [10:01:11] and yes, 100 connections in paralell [10:01:34] I don't know what 208.80.154.92 is [10:01:42] we may need to review grants/acls [10:02:02] +1 [10:02:07] If we can remove that IP, let me know [10:04:01] what's the connection limit in enwiki again? [10:05:03] ? [10:05:45] you said nova uses 3x more connections that enwiki [10:06:00] arturo, per host, no more than 64 connections can run in parallel [10:06:09] because otherwise they get queued [10:06:40] oh I see, so you have many hosts with lower limits [10:06:44] and around that size, running more queries will result in slower query throughput [10:07:25] obviously it depends on the queries, but 32-128 is the server side pool based on testing [10:07:58] arturo: https://mariadb.com/kb/en/library/thread-pool-system-status-variables/#thread_pool_size [10:08:19] we allow for more coneections so we can kill them after a short spike [10:08:33] but we do not allow more than those running connections on production [10:11:30] and probably they have similar limits on misc, but I would have to check [10:14:27] 👍 [10:26:37] I have no idea how to do T234830, or if that causes downtime, etc [10:26:37] T234830: CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830 [11:11:27] arturo: that ticket has almost no information, maybe it is a good idea to start gathering affected databases, tables, desired enconding etc [11:11:42] ok, will do [11:12:11] ta [11:12:40] 10DBA, 10Operations, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones pup... [11:32:34] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [12:31:12] 6 complains of es1012 from xmldumps [12:31:42] I chatted about those things with ariel past week about another es host [13:12:03] 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) > bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case Q, will the bacula backups also include the `log` database? Might... [13:16:15] 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) >>! In T234826#5555449, @Ottomata wrote: >> bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case > Q, will the bacula ba... [13:17:50] 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) Oh, I think it was a Q for Luca about how we intended to set that up. I assume we can do it either way. We wouldn't //have// to back up `log` to Bacula if it is too large since we'd h... [13:19:36] 10DBA, 10Analytics: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) If the log db can be stored in Bacula it would be great! Otherwise HDFS is fine in my opinion.. [13:44:25] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [14:59:28] should I file a ticket for https://logstash.wikimedia.org/goto/b642781f0c9943e59b05ba13fe904d44 or is it "expected"? [14:59:43] it is not a lot of them [14:59:53] but I don't know if it is a regression [15:00:22] or maybe it is within the range of expected because batch conversion [15:00:33] I see no SAL at 0:00 [15:01:25] it seems some trend started at 2019-10-08T00:00:02 [15:02:17] ^Amir1 sorry to ping you, but if it is expected I avoid an unnecesary ticket [15:02:46] All good, let me check [15:02:56] it is not too worrying [15:03:21] but I prefer to check for (potentially new) bad patterns [15:03:51] that doesn't look like anything we started but definitely should be tracked [15:04:00] ok, then I will file a task [15:04:05] (we don't deploy at 00:00 UTC :D) [15:04:09] maybe e.g. something got started at that time [15:04:25] maybe it could be a cron, e.g. backups [15:04:29] even the maintenance script restarts every :30 [15:04:44] I will file a task, it is not high importance [15:04:44] yeah, it can be backups or dumpers [15:04:55] but I prefer to have it filed [15:06:35] (while dumpers should not cause an insert) [15:07:25] who nows at this time! :-P [15:08:07] it wouldn't be the weirdest thing I've discovered in my archeological excursions on a database :-D [15:08:11] * Amir1 imagines sleepy Ariel :D [15:08:35] beside tmp indexes? [15:08:48] less jokingly, it could be some read pattern that interacts in some strange way with writes [15:09:09] e.g. we had some issues with select for update after table refactoring [15:15:45] it's likely, also some caches can get expired etc. [15:16:26] yep [16:19:09] 10DBA, 10Cloud-VPS, 10cloud-services-team (Kanban): CloudVPS: m5-master databases for openstack may require re-enconding - https://phabricator.wikimedia.org/T234830 (10bd808) utf8 is probably not the 'right' encoding. Mysql's utf8 is only capable of 3-byte code points. latin1 and utfmb4 would both work for 4... [18:27:20] 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) [18:29:26] 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) Apologies for splatter-gunning the tags; this is very much a cross-team ques... [18:31:53] 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) [20:03:46] 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10Pelagic) Related tasks: * {T230185} – relates to edit summary not displayed nor edit... [22:33:17] 10DBA, 10Contributors-Team, 10Editing Design, 10MobileFrontend, and 5 others: Varying approaches to section names in edit summaries (mobile vs desktop and visual vs wikitext) - https://phabricator.wikimedia.org/T234982 (10matmarex) One more related task, about the same kind of issue in "new" wikitext edito...