[06:07:01] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), and 2 others: Wikibase client description API module results in 15k selected rows with new term storage - https://phabricator.wikimedia.org/T246005 (10Marostegui) >>! In T246005#5916080, @Ladsgroup wrote:... [06:08:24] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) I can see the battery - thank you ` root@db1095:~# hpssacli controller all show detail | grep -i battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 1... [06:18:16] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) >>! In T243963#5916919, @jcrespo wrote: > es1019 is just pending the last config push back to normal traffic weights (and reducing the master's). Done! Thank you for handling this [07:29:24] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) [07:30:56] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) p:05Triage→03Medium [07:31:26] jynus: just to get your attention ^ as this is a backup source [07:31:38] is there any on-going stuff there that might have cause this? [07:34:59] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Something happened to its master (db1122), there's a glitch of everything on it: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1582659145265&to=1582702345265&var-dc=eqiad%20prometheus%2... [07:48:27] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) From those graphs there is another glitch that apparently happened around 22:06 yesterday, but that one didn't cause s2 replication to stall on db1140, from what I can see. [07:54:05] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Nothing db1122 HW logs [07:57:42] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) db1122 network graph {F31631537} [07:59:28] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) db1122 writes {F31631542} [08:02:44] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Interestingly, there are no connection errors on logstash: https://logstash.wikimedia.org/goto/c3946ad3027be1f306f2981ebb7f4846 [08:06:06] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) So, given that db1122 is the primary master, I cannot see any other slave apart from db1140 failing to connect: ` Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26 7:18:11 139851380549376 [ERROR] Slave... [08:14:51] network didn't go to 0, maybe only the connection between master and replica went down? [08:15:11] we had network failures like this before, only some internal routes failing [08:15:46] https://phabricator.wikimedia.org/T246198#5918910 [08:16:00] but that might be normal from what I am seeing [08:16:12] I would have expected more slaves failing indeed [08:16:40] let's see if arzhel can give us more graphs from the switch side [08:16:52] normal? [08:17:10] I've seen connection failures before, that is indeed normal [08:17:26] but not not reconnecting after a few retries [08:17:54] No, what I am saying is that those drops in writes are kinda normal from the graphs, which also correlates with the fact that there were no logged errors [08:18:06] In logstash, which I would expect if the master is having connection issues [08:18:10] and there are no errors logged there [08:18:40] it mentions ssl connection issues, could it be something unrelated to network, like e.g. internal clock [08:18:55] yep [08:19:08] it is a very specific error and only from that host [08:19:41] other people on stackoverflow mention the clock as a possible reason [08:20:19] I think that is being monitored, let me see if there is log about clock corrections [08:20:36] just to be sure, you restarted replication later an no issue, right? [08:20:42] nope, I didn't [08:20:52] It retried and succedeed [08:21:00] but it alerted? strange [08:21:09] yeah, cause it took 5 minutes [08:21:15] wow [08:21:16] Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26 7:18:11 139851380549376 [ERROR] Slave I/O: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: SSL connection error: error:00000000:lib(0):func(0):reason(0), Internal MariaDB error code: 2026 [08:21:16] Feb 26 07:23:25 db1140 mysqld[3685]: 2020-02-26 7:23:25 139851380549376 [Note] Slave: connected to master 'repl@db1122.eqiad.wmnet:3306',replication resumed in log 'db1122-bin.001369' at position 786291425 [08:21:19] check the times [08:21:39] let me find clock logs [08:21:43] ok! [08:22:01] wait, but that is db1140 [08:22:29] I though the issue was some other host [08:22:38] db1140 is on 10.1.44 [08:25:20] no, the issue only showed up on db1140 [08:27:30] there was a root login a few minutes before that data, from cumin, will check but maybe for snapshoting [08:27:35] *date [08:29:18] first error: Feb 26 07:15:09 db1140 mysqld[3685]: 2020-02-26 7:15:09 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) [08:30:47] not much else [08:31:14] yeah [08:31:27] and only db1140 from looking at the central logging [08:31:32] did you touch es1019? [08:31:41] I had pending to repool it fully [08:31:45] Yeah [08:31:47] It is done already [08:31:48] if not, I will do it before [08:31:50] ah [08:31:55] including master depool? [08:31:58] yep [08:32:00] ok, thanks [08:32:09] then I can research this further [08:32:25] I can take care unless I see nothing or ask for help [08:32:39] I am going to leave in like 10 minutes, I have day off and I have worked a couple of hours already, so will be back in the evening to finish off [08:32:46] no problem [08:34:35] it is not the network of the host as otherwise the other instances would have also logged something [08:34:47] yeah, I think so too [08:34:50] but it could be the link between the 2 hosts [08:35:05] check the task (if you haven't already) as I have posted there what I have researched so far [08:35:12] thanks [08:35:39] let me paste the previous error [08:38:00] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) a:05Jclark-ctr→03jcrespo Assigning to Jaime to reflect that this is now pending his follow-up Thank you John! [08:38:29] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) first error: ` Feb 26 07:15:09 db1140 mysqld[3685]: 2020-02-26 7:15:09 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) ` Nothi... [08:38:51] the root logins were the snapshots or s3 and x1 10 minutes before [08:39:02] s2 was done hours ago [08:39:19] unless there is io or cpu saturation, it should not have caused issues [08:42:00] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) There was an increase in load at that time, which led to an increase of network errors: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1140&var-datasource=eqiad%2... [08:43:05] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) Thanks, I will repopulate this host with some production data, which may help with T246198. [09:32:21] marostegui: hey, I know you're busy so when you have time, it would be great if you help me figure out the exact number for this batch size: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/574863/4/lib/includes/Store/Sql/Terms/PrefetchingItemTermLookup.php Anything would be good, like "go with N and adjust as necessary" [09:33:14] this is blocking moving to more reads on the new term store (we read more from the new term store than the old one since yesterday) [09:37:17] he is not busy, he is away, but I am around [09:38:32] jynus: oh if he's away, can you help? specially you gave the number for the old term store [09:38:48] https://phabricator.wikimedia.org/T163544#3201562 [09:39:26] honestly, please let me remember the context, I commented that 3 years ago [09:40:45] this is not super similar as it's different table with different indexes and such [09:41:06] The query I want to chunk up is https://phabricator.wikimedia.org/P10524 [09:41:53] so I would stick to 9 based on not approximating query plans [09:41:58] but it depends on the query [09:42:28] where do the ids come from? [09:43:06] cannot they be taken from another query so we avoid the extra client server bandwidth? [09:43:35] no they come from Lua calss [09:43:39] *calls [09:44:52] so I say run it, profile it and get some numbers: execution time if uncached data, rows read, etc. [09:45:24] sure, this is one of the big ones: https://phabricator.wikimedia.org/P10524 [09:45:24] the problem with such queries is that if cached, they may be fast, but if uncached they may be very slow [09:45:50] no numbers there :-) [09:46:27] I mean this is the query :D [09:47:26] If we chunk it to nine, the first batch would be SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin */ wbtl_id, wbtl_type_id, wbxl_language, wbx_text, wbit_item_id FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` WHERE (wbtl_text_in_lang_id=wbxl_id) AND (wbxl_text_id=wbx_id) AND wbit_item_id IN (976055, 976055, 595104, 595104, 7482576, 7482576, 206855, 33938, 33938) AND [09:47:26] wbxl_language IN ('hu', 'en') AND (`wbit_term_in_lang_id` = wbtl_id) [09:47:55] but it has to do it 500 times [09:47:59] the nine was bsed on another query [09:48:05] which had bad query plans [09:48:12] I don't know this one [09:48:19] you must get some stats [09:48:38] yup that's why I'm asking for another number. How can I get some stats, what should I look for? [09:49:14] EXPLAIN, show status like 'Hand%' and latency numbers [09:49:34] (on cached and uncached rows) [09:50:38] Sure [09:51:57] have a way to test it on random valid ids so you can test it with different parameters [09:52:15] maybe it only has issues with certain ranges or amounts of data [09:53:07] I can get some in tendril [09:53:54] I get this on a large batch: [09:53:59] https://www.irccloud.com/pastebin/FvY8DQPn/ [09:54:15] (with 2k entities) [09:54:40] how many rows does that return? [09:54:58] it should return around 4k rows [09:55:22] (two languages) [09:55:33] get the handler status on an actual run, use the slow query replica [09:55:36] maybe less I don't think there are labels in all cases [09:56:02] jynus: so for stupid question but how can I get the handler status? [09:56:06] connect -> run query > SHOW STATUS like 'Hand%'; [09:56:39] thanks! [09:57:16] documentation at https://www.slideshare.net/jynus/query-optimization-with-mysql-80-and-mariadb-103-the-basics/97 [09:58:05] with that we can start talking, but we must gather data first :-) [09:59:24] Another stupid question: I can't find vslow as they are not in the db-eqiad.php anymore. https://noc.wikimedia.org/db.php doesn't have anything either [10:00:47] so, you should not know that, but you should have asked that earlier :-D [10:01:05] Amir1: https://noc.wikimedia.org/dbconfig/eqiad.json [10:02:00] Thanks! [10:02:20] db1101:3318 [10:04:53] for some reason, the link to that is small and at the end of the page [10:07:25] So this is the main query: [10:07:33] https://www.irccloud.com/pastebin/cziiN5fy/ [10:09:09] This is for the batch of 50: [10:09:12] https://www.irccloud.com/pastebin/5aqNnuv6/ [10:09:59] nope, it's wrong, the query has duplicate enteries [10:11:02] 36780 handler_read_next, which means index scanning rows, and 70K Handler_read_key (key lookups), using icp [10:12:20] This is fifty batch [10:12:23] https://www.irccloud.com/pastebin/FMDaw3eY/ [10:12:32] so, first of all, the query is badly written, which is offtopic here [10:12:51] but I cannot even know how that passed query review [10:13:07] what is FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` ????? [10:13:55] they are normalization of the wb_terms entries [10:14:07] that's why it has to join [10:14:16] needs MOAR Join, implicit joins would be the equivalent in programming to reuse the same varaible name for 5 different things [10:14:45] let's rewrite it to use JOIN [10:15:01] yeah, I see [10:15:09] it is "dirty" how it is now [10:15:14] not good practice at all [10:15:21] and doing that leads to bugs [10:15:26] SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin */ wbtl_id, wbtl_type_id, wbxl_language, wbx_text, wbit_item_id FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` WHERE (wbtl_text_in_lang_id=wbxl_id) AND (wbxl_text_id=wbx_id) AND wbit_item_id IN (15, 41, 96, 114, 117, 148, 155, 159, 183, 244, 252, 258, 467, 577, 657, 668, 801, 833, 843, 865, 902, 916, 928, 953, 962, [10:15:26] 1009, 1020, 1032, 1033, 1065, 3542, 3787, 4412, 4917, 6256, 6655, 7159, 7502, 7785, 7809, 7825, 8441, 8449, 8475, 8673, 9043, 9051, 11191, 11229, 14005) AND wbxl_language IN ('hu', 'en') AND (`wbit_term_in_lang_id` = wbtl_id) [10:15:27] because people forget to ON [10:15:35] also let's separate ON and WHERE [10:15:48] those quality of life are not only formatting [10:15:56] this is 50 batch, let rewrite and I apply it (I didn't that query btw) [10:16:29] yeah, not blaming you [10:21:10] o/ I am back for around 1h or so [10:21:31] Amir1: sorry but I have a hectic schedule today, I am taking half day off - thanks Jaime for helping :) [10:22:14] if you can just check alerts/graphs, so I can give my full atention to Amir1 [10:22:48] I was trying to handle db1095 & db1140, but I cannot just do everything :-/ [10:24:14] jynus: definitely! [10:24:41] I am right now moving s3, and only s3 to db1095 [10:25:09] will leave s3 on db1095 and s2/x1 on db1140, that should slove most of our issues, I belive [10:25:44] I forgot to say that I pooled special replicas on s8 with larger weight [10:26:01] because there was a spike of load yesterday evening [10:26:07] I would live it as is for now [10:26:11] *leave [10:26:56] yeah, saw that yesterday [10:26:58] makes sense yeah [10:27:06] they have weight 100, so that's good anyways [10:27:48] I put them to 150 [10:28:12] on main [10:28:31] Oh, then my browser is cached [10:28:32] let me see [10:28:42] yeah, now I see it [10:29:39] not cdanis only wanted them to have weights larger than 100 so that percentage worked well [10:29:43] note [10:30:03] I know that [10:30:10] I don't care if they end up with load 103 or 104 :-) [10:30:45] we can reevaluate at a later time, but there was an issue after you left, so that is the only thing I could think about [10:31:01] thanks [10:31:09] sorry I was at daily [10:31:11] what was s10 btw, I cannot remember [10:31:16] wikitech [10:31:28] cool [10:31:50] Amir1: can we talk later? [10:31:55] sure [10:31:57] if this can wait a bit [10:32:18] we have reduced DBA coverage and there is some things going on at the same time [10:32:30] please get those number I wanted to evaluate performance [10:32:51] Sure [10:32:51] put them on a ticket and I will get to those when I am less stressed 0:-D [10:33:26] reads of 10000+ rows for fast queries are frowned upon [10:33:37] I have been rereading all the stuff about db1122 and db1140, and I think (unless arzhel probes us wrong once he wakes up and provides network graphs and stuff) that the issue is within db1140 and not within db1122 [10:33:47] because very rarely one has to return more than 10000 rows and that means there is room for optimization [10:33:49] Or the communication between the two of them but not with the rest of replicas [10:33:50] which is good [10:34:09] marostegui: my update on ticket pointed to overload due to backup schedule [10:34:24] network overload? [10:34:24] s3 now is "fast" [10:34:27] or just general overload? [10:34:29] to backup [10:34:35] so it overlaps with x1 [10:34:48] and we happen to have both on the same source server [10:34:56] does the timing match with backups generations? [10:35:15] it starts indeed 10 minutes before the first error [10:35:24] and if you look at host graph [10:35:29] there is a spike of network errors [10:35:30] uuuuh suspicius [10:35:47] after all, it is transfering twice as fast as possible [10:35:58] I am moving s3 away [10:36:02] and that should take care of it [10:36:07] that is my best explanation [10:36:15] now that coincidentally db1095 is fixedf [10:36:34] out of precaution, I will use the previous snapshot, not this one [10:37:10] we may also want to stop replication when backing up s3 [10:37:19] like we do for s1, s8 and s4 [10:40:10] marostegui: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1140&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=1582698824998&to=1582706133704 [10:40:29] see how it jumped to 2 dropped packages per second as the second overlapping backup started [10:40:32] that definitely matches the first error, indeed [10:40:48] so "general overload" [10:41:14] yeah [10:41:28] not the first time we see general overload causing packet drops [10:41:29] db1095 should fix it [10:42:14] bbu arrived right on time! [10:42:24] I confess I am a bit more rough than usual with source hosts as errors there are not user errors [10:42:30] what was the other for? [10:43:09] db1084? that's a normal s4 core host [10:43:30] I assume you are talking about what was the other host with BBU issues? [10:43:35] yes, sorry [10:44:00] yep, db1084 (which I powered off today for maintenance once eqiad wakes up) is a s4 core host with broken BBU [10:44:42] the other thing that I would like to know is db108(7?) wikidata compression progress [10:45:14] if you can get based on bytes written a percentage of progress or so [10:48:09] yeah [10:48:14] I checked this morning [10:48:19] I think it will be done in around 24h [10:48:23] cool [10:48:27] that will help [10:48:35] I saw it by accident on a cumin screen [10:48:39] will monitor it tomorrow [10:48:50] yeah, my calculations are that the new table will be around 600GB [10:48:55] anything I should know about that? [10:48:58] and the temp table has now around 420G [10:49:11] before pooling it, aside from the usual? [10:49:14] e.g. pending work [10:49:15] etc [10:49:21] yeah, moving back labs hosts [10:49:25] let's do that together [10:49:29] on friday if you want [10:49:29] yeah [10:49:36] (I don't think it will be ready tomorrow anyways) [10:49:40] as it needs to catch up also [10:49:44] indeed [10:49:51] ok, then I will wait until friday [10:49:56] yeah, I am back friday [10:50:00] but will start replication [10:50:07] if I see it finished the alter [10:51:23] db1087 used to be the vslow host, so its weight was 1, this was the depooling patch: https://phabricator.wikimedia.org/P10493 [10:51:28] I saw it [10:51:54] it was not "normal" to see the current state [10:52:04] so I guessed that was the case [10:52:21] yeah [10:52:22] but I won't rush the repooling [10:52:37] I don't think it will be ready by tomorrow, no [10:52:37] after the alter, buffer pool will not be in a great state [10:56:39] jynus: done now: https://phabricator.wikimedia.org/T246159#5919266 I go to a meeting, ping me once you're done. Thanks! [11:00:46] will do [12:14:40] jynus: I have a tentative fix for what broke SSO auth to tendril yesterday afternoon, can I test that in ~ 15min? I can revert quickly to the LDAP auth if it shouldn't work [12:16:51] yeah, actually now is a great moment [12:16:57] because I am about to take a break [12:17:05] feel free to break anything now :-D [12:17:26] (hopefully it works, I want it to work) [12:17:32] cool, will start in ~ 10m [12:17:36] thanks! [12:17:42] enjoy your break! [12:17:50] yesterday late was a bad timing [12:17:55] because s8 issues, sorry [12:18:01] sure, sure [12:32:01] for when you're back: tendril works fine using CAS now (and also with a fresh (not formerly existing session from a different browser)), maybe do a quick smoke check and then I'll properly transfer my live-hacked change into Puppet [12:32:45] not yet gone [12:33:10] Waiting for idp.wikimedia.org... [12:33:25] let me try on a private window [12:34:14] both tests work [12:34:20] see you [12:34:25] cool, thanks [13:18:34] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10ayounsi) Indeed, from https://librenms.wikimedia.org/graphs/to=1582722600/id=12170/type=port_bits/from=1582693200/ and https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1140&var-datasou... [13:42:34] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Thanks @ayounsi - looks like this confirms that the backups generation caused some temporary issues here. [13:44:14] jynus: do you want to try es4 on Tuesday next week at around 9AM UTC maybe? [13:46:16] 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) 05Open→03Resolved a:03jcrespo Closing this as resolved, as this backup overlap should not happen again due to work on T244958. As a general idea, we should purchase all new DB hosts with 10G cards... [13:46:27] let me see [13:46:59] I have aim for switchover, but that is outdated [13:47:09] so it sould be good. [13:47:10] yeah, that's not happening [13:47:15] let me create a calendar [13:47:19] ok, let me send an invite and block it on deployments [13:47:25] ok, you do it [13:47:25] No, don't worry, I will do that :) [13:47:27] yep [13:47:29] thank you [13:48:04] I will try to the task and the patchset too [13:48:34] will we do an all day deployment thingy? or just the first production change then? [13:48:57] in other words, when are you aiming to complete the switch- along the week? [13:49:26] I am aiming for es4 entirely yeah [13:49:34] I would like to give it a few days before going for es5 [13:49:39] ok, understood [13:49:55] if it goes super smooth, maybe we can do es5 the next day [13:49:56] I don't need exact dates, just understanding the scope of tuesday work [13:50:04] all cool :-D [13:50:09] so for tuesday I would like to leave es4 fully deployed yeah [13:50:13] cool [13:50:16] :* [13:56:58] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not... [13:58:05] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution? [14:15:39] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10Goal, 10Patch-For-Review: Enable es4 and es5 as writable new external store sections - https://phabricator.wikimedia.org/T246072 (10Marostegui) We are aiming to fully deploy es4 Tuesday 3rd March, starting at 09:00 AM UTC. The idea would be to fo... [21:06:01] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) db1087 finished and it is now catching up [21:06:11] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [21:57:08] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Jclark-ctr) [21:59:12] * addshore has things to discuss in the morning :)