[06:07:01] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), and 2 others: Wikibase client description API module results in 15k selected rows with new term storage - https://phabricator.wikimedia.org/T246005 (10Marostegui) >>! In T246005#5916080, @Ladsgroup wrote:...
[06:08:24] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) I can see the battery - thank you ` root@db1095:~# hpssacli  controller all show detail | grep -i battery    No-Battery Write Cache: Disabled    Battery/Capacitor Count: 1...
[06:18:16] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) >>! In T243963#5916919, @jcrespo wrote: > es1019 is just pending the last config push back to normal traffic weights (and reducing the master's).  Done! Thank you for handling this
[07:29:24] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui)
[07:30:56] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) p:05Triage→03Medium
[07:31:26] <marostegui>	 jynus: just to get your attention ^ as this is a backup source
[07:31:38] <marostegui>	 is there any on-going stuff there that might have cause this?
[07:34:59] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Something happened to its master (db1122), there's a glitch of everything on it: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1582659145265&to=1582702345265&var-dc=eqiad%20prometheus%2...
[07:48:27] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) From those graphs there is another glitch that apparently happened around 22:06 yesterday, but that one didn't cause s2 replication to stall on db1140, from what I can see.
[07:54:05] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Nothing db1122 HW logs
[07:57:42] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) db1122 network graph {F31631537}
[07:59:28] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) db1122 writes {F31631542}
[08:02:44] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Interestingly, there are no connection errors on logstash: https://logstash.wikimedia.org/goto/c3946ad3027be1f306f2981ebb7f4846
[08:06:06] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) So, given that db1122 is the primary master, I cannot see any other slave apart from db1140 failing to connect: ` Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26  7:18:11 139851380549376 [ERROR] Slave...
[08:14:51] <jynus>	 network didn't go to 0, maybe only the connection between master and replica went down?
[08:15:11] <jynus>	 we had network failures like this before, only some internal routes failing
[08:15:46] <marostegui>	 https://phabricator.wikimedia.org/T246198#5918910
[08:16:00] <marostegui>	 but that might be normal from what I am seeing
[08:16:12] <marostegui>	 I would have expected more slaves failing indeed
[08:16:40] <marostegui>	 let's see if arzhel can give us more graphs from the switch side
[08:16:52] <jynus>	 normal?
[08:17:10] <jynus>	 I've seen connection failures before, that is indeed normal
[08:17:26] <jynus>	 but not not reconnecting after a few retries
[08:17:54] <marostegui>	 No, what I am saying is that those drops in writes are kinda normal from the graphs, which also correlates with the fact that there were no logged errors
[08:18:06] <marostegui>	 In logstash, which I would expect if the master is having connection issues
[08:18:10] <marostegui>	 and there are no errors logged there
[08:18:40] <jynus>	 it mentions ssl connection issues, could it be something unrelated to network, like e.g. internal clock
[08:18:55] <marostegui>	 yep
[08:19:08] <marostegui>	 it is a very specific error and only from that host
[08:19:41] <jynus>	 other people on stackoverflow mention the clock as a possible reason
[08:20:19] <jynus>	 I think that is being monitored, let me see if there is log about clock corrections
[08:20:36] <jynus>	 just to be sure, you restarted replication later an no issue, right?
[08:20:42] <marostegui>	 nope, I didn't
[08:20:52] <marostegui>	 It retried and succedeed
[08:21:00] <jynus>	 but it alerted? strange
[08:21:09] <marostegui>	 yeah, cause it took 5 minutes
[08:21:15] <jynus>	 wow
[08:21:16] <marostegui>	 Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26  7:18:11 139851380549376 [ERROR] Slave I/O: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0), Internal MariaDB error code: 2026
[08:21:16] <marostegui>	 Feb 26 07:23:25 db1140 mysqld[3685]: 2020-02-26  7:23:25 139851380549376 [Note] Slave: connected to master 'repl@db1122.eqiad.wmnet:3306',replication resumed in log 'db1122-bin.001369' at position 786291425
[08:21:19] <marostegui>	 check the times
[08:21:39] <jynus>	 let me find clock logs
[08:21:43] <marostegui>	 ok!
[08:22:01] <jynus>	 wait, but that is db1140
[08:22:29] <jynus>	 I though the issue was some other host
[08:22:38] <jynus>	 db1140 is on 10.1.44
[08:25:20] <marostegui>	 no, the issue only showed up on db1140
[08:27:30] <jynus>	 there was a root login a few minutes before that data, from cumin, will check but maybe for snapshoting
[08:27:35] <jynus>	 *date
[08:29:18] <jynus>	 first error: Feb 26 07:15:09 db1140 mysqld[3685]: 2020-02-26  7:15:09 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)
[08:30:47] <jynus>	 not much else
[08:31:14] <marostegui>	 yeah 
[08:31:27] <marostegui>	 and only db1140 from looking at the central logging
[08:31:32] <jynus>	 did you touch es1019?
[08:31:41] <jynus>	 I had pending to repool it fully
[08:31:45] <marostegui>	 Yeah
[08:31:47] <marostegui>	 It is done already
[08:31:48] <jynus>	 if not, I will do it before
[08:31:50] <jynus>	 ah
[08:31:55] <jynus>	 including master depool?
[08:31:58] <marostegui>	 yep
[08:32:00] <jynus>	 ok, thanks
[08:32:09] <jynus>	 then I can research this further
[08:32:25] <jynus>	 I can take care unless I see nothing or ask for help
[08:32:39] <marostegui>	 I am going to leave in like 10 minutes, I have day off and I have worked a couple of hours already, so will be back in the evening to finish off 
[08:32:46] <jynus>	 no problem
[08:34:35] <jynus>	 it is not the network of the host as otherwise the other instances would have also logged something
[08:34:47] <marostegui>	 yeah, I think so too
[08:34:50] <jynus>	 but it could be the link between the 2 hosts
[08:35:05] <marostegui>	 check the task (if you haven't already) as I have posted there what I have researched so far
[08:35:12] <jynus>	 thanks
[08:35:39] <jynus>	 let me paste the previous error
[08:38:00] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) a:05Jclark-ctr→03jcrespo Assigning to Jaime to reflect that this is now pending his follow-up Thank you John!
[08:38:29] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) first error:  ` Feb 26 07:15:09 db1140 mysqld[3685]: 2020-02-26  7:15:09 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) ` Nothi...
[08:38:51] <jynus>	 the root logins were the snapshots or s3 and x1 10 minutes before
[08:39:02] <jynus>	 s2 was done hours ago
[08:39:19] <jynus>	 unless there is io or cpu saturation, it should not have caused issues
[08:42:00] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) There was an increase in load at that time, which led to an increase of network errors: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1140&var-datasource=eqiad%2...
[08:43:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) Thanks, I will repopulate this host with some production data, which may help with T246198.
[09:32:21] <Amir1>	 marostegui: hey, I know you're busy so when you have time, it would be great if you help me figure out the exact number for this batch size: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/574863/4/lib/includes/Store/Sql/Terms/PrefetchingItemTermLookup.php Anything would be good, like "go with N and adjust as necessary" 
[09:33:14] <Amir1>	 this is blocking moving to more reads on the new term store (we read more from the new term store than the old one since yesterday)
[09:37:17] <jynus>	 he is not busy, he is away, but I am around
[09:38:32] <Amir1>	 jynus: oh if he's away, can you help? specially you gave the number for the old term store
[09:38:48] <Amir1>	 https://phabricator.wikimedia.org/T163544#3201562
[09:39:26] <jynus>	 honestly, please let me remember the context, I commented that 3 years ago
[09:40:45] <Amir1>	 this is not super similar as it's different table with different indexes and such
[09:41:06] <Amir1>	 The query I want to chunk up is https://phabricator.wikimedia.org/P10524
[09:41:53] <jynus>	 so I would stick to 9 based on not approximating query plans
[09:41:58] <jynus>	 but it depends on the query
[09:42:28] <jynus>	 where do the ids come from?
[09:43:06] <jynus>	 cannot they be taken from another query so we avoid the extra client server bandwidth?
[09:43:35] <Amir1>	 no they come from Lua calss
[09:43:39] <Amir1>	 *calls
[09:44:52] <jynus>	 so I say run it, profile it and get some numbers: execution time if uncached data, rows read, etc.
[09:45:24] <Amir1>	 sure, this is one of the big ones: https://phabricator.wikimedia.org/P10524
[09:45:24] <jynus>	 the problem with such queries is that if cached, they may be fast, but if uncached they may be very slow
[09:45:50] <jynus>	 no numbers there :-)
[09:46:27] <Amir1>	 I mean this is the query :D
[09:47:26] <Amir1>	 If we chunk it to nine, the first batch would be SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin */ wbtl_id, wbtl_type_id, wbxl_language, wbx_text, wbit_item_id FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` WHERE (wbtl_text_in_lang_id=wbxl_id) AND (wbxl_text_id=wbx_id) AND wbit_item_id IN (976055, 976055, 595104, 595104, 7482576, 7482576, 206855, 33938, 33938) AND 
[09:47:26] <Amir1>	 wbxl_language IN ('hu', 'en') AND (`wbit_term_in_lang_id` = wbtl_id)
[09:47:55] <Amir1>	 but it has to do it 500 times
[09:47:59] <jynus>	 the nine was bsed on another query
[09:48:05] <jynus>	 which had bad query plans
[09:48:12] <jynus>	 I don't know this one
[09:48:19] <jynus>	 you must get some stats
[09:48:38] <Amir1>	 yup that's why I'm asking for another number. How can I get some stats, what should I look for?
[09:49:14] <jynus>	 EXPLAIN, show status like 'Hand%' and latency numbers
[09:49:34] <jynus>	 (on cached and uncached rows)
[09:50:38] <Amir1>	 Sure
[09:51:57] <jynus>	 have a way to test it on random valid ids so you can test it with different parameters
[09:52:15] <jynus>	 maybe it only has issues with certain ranges or amounts of data
[09:53:07] <Amir1>	 I can get some in tendril 
[09:53:54] <Amir1>	 I get this on a large batch:
[09:53:59] <Amir1>	 https://www.irccloud.com/pastebin/FvY8DQPn/
[09:54:15] <Amir1>	 (with 2k entities)
[09:54:40] <jynus>	 how many rows does that return?
[09:54:58] <Amir1>	 it should return around 4k rows
[09:55:22] <Amir1>	 (two languages)
[09:55:33] <jynus>	 get the handler status on an actual run, use the slow query replica
[09:55:36] <Amir1>	 maybe less I don't think there are labels in all cases
[09:56:02] <Amir1>	 jynus: so for stupid question but how can I get the handler status?
[09:56:06] <jynus>	 connect -> run query > SHOW STATUS like 'Hand%';
[09:56:39] <Amir1>	 thanks!
[09:57:16] <jynus>	 documentation at https://www.slideshare.net/jynus/query-optimization-with-mysql-80-and-mariadb-103-the-basics/97
[09:58:05] <jynus>	 with that we can start talking, but we must gather data first :-)
[09:59:24] <Amir1>	 Another stupid question: I can't find vslow as they are not in the db-eqiad.php anymore. https://noc.wikimedia.org/db.php doesn't have anything either 
[10:00:47] <jynus>	 so, you should not know that, but you should have asked that earlier :-D
[10:01:05] <jynus>	 Amir1: https://noc.wikimedia.org/dbconfig/eqiad.json
[10:02:00] <Amir1>	 Thanks!
[10:02:20] <jynus>	 db1101:3318
[10:04:53] <jynus>	 for some reason, the link to that is small and at the end of the page
[10:07:25] <Amir1>	 So this is the main query:
[10:07:33] <Amir1>	 https://www.irccloud.com/pastebin/cziiN5fy/
[10:09:09] <Amir1>	 This is for the batch of 50:
[10:09:12] <Amir1>	 https://www.irccloud.com/pastebin/5aqNnuv6/
[10:09:59] <Amir1>	 nope, it's wrong, the query has duplicate enteries
[10:11:02] <jynus>	 36780 handler_read_next, which means index scanning rows, and 70K Handler_read_key (key lookups), using icp
[10:12:20] <Amir1>	 This is fifty batch
[10:12:23] <Amir1>	 https://www.irccloud.com/pastebin/FMDaw3eY/
[10:12:32] <jynus>	 so, first of all, the query is badly written, which is offtopic here
[10:12:51] <jynus>	 but I cannot even know how that passed query review
[10:13:07] <jynus>	 what is FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` ?????
[10:13:55] <Amir1>	 they are normalization of the wb_terms entries
[10:14:07] <Amir1>	 that's why it has to join
[10:14:16] <jynus>	 needs MOAR Join, implicit joins would be the equivalent in programming to reuse the same varaible name for 5 different things
[10:14:45] <jynus>	 let's rewrite it to use JOIN
[10:15:01] <Amir1>	 yeah, I see
[10:15:09] <jynus>	 it is "dirty" how it is now
[10:15:14] <jynus>	 not good practice at all
[10:15:21] <jynus>	 and doing that leads to bugs
[10:15:26] <Amir1>	 SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin */ wbtl_id, wbtl_type_id, wbxl_language, wbx_text, wbit_item_id FROM `wbt_term_in_lang`, `wbt_text_in_lang`, `wbt_text`, `wbt_item_terms` WHERE (wbtl_text_in_lang_id=wbxl_id) AND (wbxl_text_id=wbx_id) AND wbit_item_id IN (15, 41, 96, 114, 117, 148, 155, 159, 183, 244, 252, 258, 467, 577, 657, 668, 801, 833, 843, 865, 902, 916, 928, 953, 962, 
[10:15:26] <Amir1>	 1009, 1020, 1032, 1033, 1065, 3542, 3787, 4412, 4917, 6256, 6655, 7159, 7502, 7785, 7809, 7825, 8441, 8449, 8475, 8673, 9043, 9051, 11191, 11229, 14005) AND wbxl_language IN ('hu', 'en') AND (`wbit_term_in_lang_id` = wbtl_id)
[10:15:27] <jynus>	 because people forget to ON
[10:15:35] <jynus>	 also let's separate ON and WHERE
[10:15:48] <jynus>	 those quality of life are not only formatting
[10:15:56] <Amir1>	 this is 50 batch, let rewrite and I apply it (I didn't that query btw)
[10:16:29] <jynus>	 yeah, not blaming you
[10:21:10] <marostegui>	 o/ I am back for around 1h or so
[10:21:31] <marostegui>	 Amir1: sorry but I have a hectic schedule today, I am taking half day off - thanks Jaime for helping :)
[10:22:14] <jynus>	 if you can just check alerts/graphs, so I can give my full atention to Amir1
[10:22:48] <jynus>	 I was trying to handle db1095 & db1140, but I cannot just do everything :-/
[10:24:14] <marostegui>	 jynus: definitely!
[10:24:41] <jynus>	 I am right now moving s3, and only s3 to db1095
[10:25:09] <jynus>	 will leave s3 on db1095 and s2/x1 on db1140, that should slove most of our issues, I belive
[10:25:44] <jynus>	 I forgot to say that I pooled special replicas on s8 with larger weight
[10:26:01] <jynus>	 because there was a spike of load yesterday evening
[10:26:07] <jynus>	 I would live it as is for now
[10:26:11] <jynus>	 *leave
[10:26:56] <marostegui>	 yeah, saw that yesterday
[10:26:58] <marostegui>	 makes sense yeah
[10:27:06] <marostegui>	 they have weight 100, so that's good anyways
[10:27:48] <jynus>	 I put them to 150
[10:28:12] <jynus>	 on main
[10:28:31] <marostegui>	 Oh, then my browser is cached
[10:28:32] <marostegui>	 let me see
[10:28:42] <marostegui>	 yeah, now I see it
[10:29:39] <jynus>	 not cdanis only wanted them to have weights larger than 100 so that percentage worked well
[10:29:43] <jynus>	 note
[10:30:03] <marostegui>	 I know that
[10:30:10] <jynus>	 I don't care if they end up with load 103 or 104 :-)
[10:30:45] <jynus>	 we can reevaluate at a later time, but there was an issue after you left, so that is the only thing I could think about
[10:31:01] <marostegui>	 thanks
[10:31:09] <Amir1>	 sorry I was at daily
[10:31:11] <jynus>	 what was s10 btw, I cannot remember
[10:31:16] <marostegui>	 wikitech
[10:31:28] <jynus>	 cool
[10:31:50] <jynus>	 Amir1: can we talk later?
[10:31:55] <Amir1>	 sure
[10:31:57] <jynus>	 if this can wait a bit
[10:32:18] <jynus>	 we have reduced DBA coverage and there is some things going on at the same time
[10:32:30] <jynus>	 please get those number I wanted to evaluate performance
[10:32:51] <Amir1>	 Sure
[10:32:51] <jynus>	 put them on a ticket and I will get to those when I am less stressed 0:-D
[10:33:26] <jynus>	 reads of 10000+ rows for fast queries are frowned upon
[10:33:37] <marostegui>	 I have been rereading all the stuff about db1122 and db1140, and I think (unless arzhel probes us wrong once he wakes up and provides network graphs and stuff) that the issue is within db1140 and not within db1122
[10:33:47] <jynus>	 because very rarely one has to return more than 10000 rows and that means there is room for optimization
[10:33:49] <marostegui>	 Or the communication between the two of them but not with the rest of replicas
[10:33:50] <marostegui>	 which is good
[10:34:09] <jynus>	 marostegui: my update on ticket pointed to overload due to backup schedule
[10:34:24] <marostegui>	 network overload?
[10:34:24] <jynus>	 s3 now is "fast"
[10:34:27] <marostegui>	 or just general overload?
[10:34:29] <jynus>	 to backup
[10:34:35] <jynus>	 so it overlaps with x1
[10:34:48] <jynus>	 and we happen to have both on the same source server
[10:34:56] <marostegui>	 does the timing match with backups generations?
[10:35:15] <jynus>	 it starts indeed 10 minutes before the first error
[10:35:24] <jynus>	 and if you look at host graph
[10:35:29] <jynus>	 there is a spike of network errors
[10:35:30] <marostegui>	 uuuuh suspicius 
[10:35:47] <jynus>	 after all, it is transfering twice as fast as possible
[10:35:58] <jynus>	 I am moving s3 away
[10:36:02] <jynus>	 and that should take care of it
[10:36:07] <jynus>	 that is my best explanation
[10:36:15] <jynus>	 now that coincidentally db1095 is fixedf
[10:36:34] <jynus>	 out of precaution, I will use the previous snapshot, not this one
[10:37:10] <jynus>	 we may also want to stop replication when backing up s3
[10:37:19] <jynus>	 like we do for s1, s8 and s4
[10:40:10] <jynus>	 marostegui: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1140&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=1582698824998&to=1582706133704
[10:40:29] <jynus>	 see how it jumped to 2 dropped packages per second as the second overlapping backup started
[10:40:32] <marostegui>	 that definitely matches the first error, indeed
[10:40:48] <jynus>	 so "general overload"
[10:41:14] <marostegui>	 yeah
[10:41:28] <marostegui>	 not the first time we see general overload causing packet drops
[10:41:29] <jynus>	 db1095 should fix it
[10:42:14] <marostegui>	 bbu arrived right on time!
[10:42:24] <jynus>	 I confess I am a bit more rough than usual with source hosts as errors there are not user errors
[10:42:30] <jynus>	 what was the other for?
[10:43:09] <marostegui>	 db1084? that's a normal s4 core host
[10:43:30] <marostegui>	 I assume you are talking about what was the other host with BBU issues?
[10:43:35] <jynus>	 yes, sorry
[10:44:00] <marostegui>	 yep, db1084 (which I powered off today for maintenance once eqiad wakes up) is a s4 core host with broken BBU
[10:44:42] <jynus>	 the other thing that I would like to know is db108(7?) wikidata compression progress
[10:45:14] <jynus>	 if you can get based on bytes written a percentage of progress or so
[10:48:09] <marostegui>	 yeah
[10:48:14] <marostegui>	 I checked this morning
[10:48:19] <marostegui>	 I think it will be done in around 24h
[10:48:23] <jynus>	 cool
[10:48:27] <jynus>	 that will help
[10:48:35] <jynus>	 I saw it by accident on a cumin screen
[10:48:39] <jynus>	 will monitor it tomorrow
[10:48:50] <marostegui>	 yeah, my calculations are that the new table will be around 600GB
[10:48:55] <jynus>	 anything I should know about that?
[10:48:58] <marostegui>	 and the temp table has now around 420G
[10:49:11] <jynus>	 before pooling it, aside from the usual?
[10:49:14] <jynus>	 e.g. pending work
[10:49:15] <jynus>	 etc
[10:49:21] <marostegui>	 yeah, moving back labs hosts
[10:49:25] <marostegui>	 let's do that together
[10:49:29] <marostegui>	 on friday if you want
[10:49:29] <jynus>	 yeah
[10:49:36] <marostegui>	 (I don't think it will be ready tomorrow anyways)
[10:49:40] <marostegui>	 as it needs to catch up also
[10:49:44] <jynus>	 indeed
[10:49:51] <jynus>	 ok, then I will wait until friday
[10:49:56] <marostegui>	 yeah, I am back friday
[10:50:00] <jynus>	 but will start replication
[10:50:07] <jynus>	 if I see it finished the alter
[10:51:23] <marostegui>	 db1087 used to be the vslow host, so its weight was 1, this was the depooling patch: https://phabricator.wikimedia.org/P10493
[10:51:28] <jynus>	 I saw it
[10:51:54] <jynus>	 it was not "normal" to see the current state
[10:52:04] <jynus>	 so I guessed that was the case
[10:52:21] <marostegui>	 yeah
[10:52:22] <jynus>	 but I won't rush the repooling
[10:52:37] <marostegui>	 I don't think it will be ready by tomorrow, no
[10:52:37] <jynus>	 after the alter, buffer pool will not be in a great state
[10:56:39] <Amir1>	 jynus: done now: https://phabricator.wikimedia.org/T246159#5919266 I go to a meeting, ping me once you're done. Thanks!
[11:00:46] <jynus>	 will do
[12:14:40] <moritzm>	 jynus: I have a tentative fix for what broke SSO auth to tendril yesterday afternoon, can I test that in ~ 15min? I can revert quickly to the LDAP auth if it shouldn't work 
[12:16:51] <jynus>	 yeah, actually now is a great moment
[12:16:57] <jynus>	 because I am about to take a break
[12:17:05] <jynus>	 feel free to break anything now :-D
[12:17:26] <jynus>	 (hopefully it works, I want it to work)
[12:17:32] <moritzm>	 cool, will start in ~ 10m
[12:17:36] <jynus>	 thanks!
[12:17:42] <moritzm>	 enjoy your break!
[12:17:50] <jynus>	 yesterday late was a bad timing
[12:17:55] <jynus>	 because s8 issues, sorry
[12:18:01] <moritzm>	 sure, sure
[12:32:01] <moritzm>	 for when you're back: tendril works fine using CAS now (and also with a fresh (not formerly existing session from a different browser)), maybe do a quick smoke check and then I'll properly transfer my live-hacked change into Puppet
[12:32:45] <jynus>	 not yet gone
[12:33:10] <jynus>	 Waiting for idp.wikimedia.org...
[12:33:25] <jynus>	 let me try on a private window
[12:34:14] <jynus>	 both tests work
[12:34:20] <jynus>	 see you
[12:34:25] <moritzm>	 cool, thanks
[13:18:34] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10ayounsi) Indeed, from https://librenms.wikimedia.org/graphs/to=1582722600/id=12170/type=port_bits/from=1582693200/ and https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1140&var-datasou...
[13:42:34] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10Marostegui) Thanks @ayounsi - looks like this confirms that the backups generation caused some temporary issues here.
[13:44:14] <marostegui>	 jynus: do you want to try es4 on Tuesday next week at around 9AM UTC maybe?
[13:46:16] <wikibugs>	 10DBA: db1140 s2 replication disconnected temporarily - https://phabricator.wikimedia.org/T246198 (10jcrespo) 05Open→03Resolved a:03jcrespo Closing this as resolved, as this backup overlap should not happen again due to work on T244958.  As a general idea, we should purchase all new DB hosts with 10G cards...
[13:46:27] <jynus>	 let me see
[13:46:59] <jynus>	 I have aim for switchover, but that is outdated
[13:47:09] <jynus>	 so it sould be good.
[13:47:10] <marostegui>	 yeah, that's not happening
[13:47:15] <jynus>	 let me create a calendar
[13:47:19] <marostegui>	 ok, let me send an invite and block it on deployments
[13:47:25] <jynus>	 ok, you do it
[13:47:25] <marostegui>	 No, don't worry, I will do that :)
[13:47:27] <marostegui>	 yep
[13:47:29] <marostegui>	 thank you
[13:48:04] <marostegui>	 I will try to the task and the patchset too
[13:48:34] <jynus>	 will we do an all day deployment thingy? or just the first production change then?
[13:48:57] <jynus>	 in other words, when are you aiming to complete the switch- along the week?
[13:49:26] <marostegui>	 I am aiming for es4 entirely yeah
[13:49:34] <marostegui>	 I would like to give it a few days before going for es5
[13:49:39] <jynus>	 ok, understood
[13:49:55] <marostegui>	 if it goes super smooth, maybe we can do es5 the next day
[13:49:56] <jynus>	 I don't need exact dates, just understanding the scope of tuesday work
[13:50:04] <jynus>	 all cool :-D
[13:50:09] <marostegui>	 so for tuesday I would like to leave es4 fully deployed yeah
[13:50:13] <jynus>	 cool
[13:50:16] <marostegui>	 :*
[13:56:58] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not...
[13:58:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution?
[14:15:39] <wikibugs>	 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10Goal, 10Patch-For-Review: Enable es4 and es5 as writable new external store sections - https://phabricator.wikimedia.org/T246072 (10Marostegui) We are aiming to fully deploy es4 Tuesday 3rd March, starting at 09:00 AM UTC. The idea would be to fo...
[21:06:01] <wikibugs>	 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) db1087 finished and it is now catching up
[21:06:11] <wikibugs>	 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui)
[21:57:08] <wikibugs>	 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Jclark-ctr)
[21:59:12] * addshore has things to discuss in the morning :)