[06:14:31] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Structured-Data-Commons, 10Wikidata: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153#4067553 (10Marostegui) 05Open>03Resolved [07:03:34] 10DBA, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4067583 (10Marostegui) [07:07:42] 10DBA, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4067588 (10Marostegui) [07:09:33] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4067590 (10Marostegui) a:05Marostegui>03RobH This host is now ready for DC Ops steps. Assigning to @RobH [07:21:42] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4067603 (10Marostegui) [07:22:59] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4067605 (10Marostegui) [07:54:57] I would like to try to deploy https://gerrit.wikimedia.org/r/420331 [07:55:23] may need some help to check heartbeat, prometheus, icinga [07:57:21] sure [07:57:32] btw: https://gerrit.wikimedia.org/r/#/c/420955/ [08:02:00] ok, I would have gone with the next one, but that is ok [08:02:35] why? Because the pc1004 was restarted yesterday and might still not be as warm? [08:02:53] if you want to know why (but don't amend) is that, and that minimize the per-host garbage [08:03:05] it is mostly irrational, but in the past [08:03:12] purges caused issues with performance [08:03:24] I think I asked to fix that, so that should be ok [08:03:27] Ah, good to know :) [08:03:31] Thanks for the context [08:03:48] in the past, 30 days after doing that, we had a pc outage [08:03:59] I think the purge every day also will help [08:04:34] Hopefully! [08:04:37] Going to merge then [08:04:40] Thanks :) [08:05:08] so I will wait for that to be done for the merge [08:05:28] ok, it shouldn't take long - thanks [08:05:41] once you are done, let's change master for db1095 from db1065 to db1106 too [08:05:46] what a nice queue of things to do XD [08:05:54] ok [08:06:28] meanwhile I will clean up mysql config on eqiad, I think it still has some no-longer-in-use hosts [08:06:36] both eqiad and codfw [08:10:02] when we have 4 hosts, 1 spare [08:10:28] we can connect the spare to the one on maintenance and replicate all lost writes [08:10:40] (talking about pc*) [08:10:50] ah right, I was like: I am missing context haha [08:11:02] yeah, that is a good solution [08:11:05] plus not "contaminate" each shard [08:11:15] here shard is the right word [08:11:36] we could also try to redo the shard strings [08:11:58] that is work to do before end of the year [08:12:11] what do you mean shard strings? [08:12:53] the partitioning keys [08:12:59] ah yeah [08:13:14] something more useful like pc1/2/3 [08:13:36] with 4 we could even pre-warm the one under maintenance [08:13:42] yeah, because it is a bit messy and error-prone [08:14:18] although we may need some host as proxy [08:21:04] jynus: I am done with pc1005, we can go ahead with your misc changes [08:23:57] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4067669 (10Marostegui) [08:24:57] ok, doing [08:25:10] nothing should explode, as it is just a file change [08:25:25] but monitoring, alerts, heartbeat could be confused [08:25:30] famous last words :) [08:25:45] we upgraded all of m1,m2 and m5 to stretch, right? [08:26:18] correct, all the new hosts run 10.1 [08:27:23] I think heartbeat is not used on those misc hosts [08:27:34] I mean, for monitoring or application [08:27:51] although maybe should be added, in a non-critical way (emails) [08:27:54] yeah [08:41:15] everthing is looking good to me so far [08:41:45] yeah, I was checking tendril and there was no lag generated or anything inded [08:42:10] oh db1016 and db1020 are is still 10.0 [08:42:14] will they break? [08:42:19] db1020 is stopped [08:42:25] and 16? [08:42:32] we disable puppet? [08:43:02] db1016 will complain, but it will be gone soon [08:43:04] we can just disable puppet [08:43:14] done [08:43:19] \o/ [08:43:36] 1001 will also complain [08:43:40] actually, not done [08:43:41] just disable puppet there too [08:43:49] because puppet ran already [08:43:57] but I can revert the changes manually [08:44:04] I have disabled puppet on 1001 [08:44:13] check if it ran [08:44:25] it didnt :) [08:45:38] cool [08:46:10] I "fixed" /var/lib/prometheus/.my.cnf and /etc/my.cnf [08:46:17] for 1016 [08:46:44] will restart hearbeat on db1051 and db1073 [08:46:51] cool! [08:47:52] after pcs and es are all upgraded (no rush) we could set the default basedir as 10.1 [08:48:22] I will do the last pc tomorrow \o/ [08:49:35] actually, I would not upgrade the pcs [08:49:48] change the socket and upgrade to latest minor, yes [08:49:52] I mean upgraded to 10.0.34 and the new socket path :) [08:50:03] but I would wait the 10.1 upgrade for new hardware [08:50:12] oh yeah, agreed [08:50:13] basically, what you are doing [08:51:01] if you are done, let's do the change master for db1095? [08:51:01] mmm, heartbeat on db1073 failed [08:51:06] and I know why [08:51:10] oh [08:51:12] socket was not changed there [08:51:12] let's see [08:51:22] we might need to restart mysql no? [08:52:00] for now, I am creating a synbolic link [08:52:11] but the basedir will be wrong [08:52:24] ah no [08:52:27] the basedir will be ok [08:52:32] only the socket location [08:53:04] with the symbolic link that will fix prometheus and monitoring [08:53:38] but it will break on restart if the link is missing (it is deleted on server start) [08:54:17] can you help me do a bried health status of all the servers? [08:54:22] monitoring, etc. [08:54:27] *brief [08:54:33] yeah [08:54:38] I was checking the slave of db1073 [08:54:42] ok [08:54:47] which migjhjt have the same issue [08:54:50] also that heartbeat flows normally [08:54:54] oh [08:54:58] create the link, then [08:55:01] or we can restart it [08:55:02] yeah :) [08:55:03] that is easier [08:55:16] yeah, let's restart that one [08:55:18] I will do it [08:55:31] ok, meanwhile, I will check the other hosts [09:05:43] I noticed an error: https://gerrit.wikimedia.org/r/#/c/420970/ [09:10:02] oh [09:10:03] that is bad [09:10:15] how did no one noticed? [09:10:36] I didn't notice either :) [09:10:42] only when restarting db2037 [09:10:45] it is the second time I do one of those [09:10:58] But those are easy to miss, specially because "ops" makes sense [09:11:38] thanks for the update @T148507 [09:12:11] I could also restart db2078 ? [09:12:19] I am doing it now :) [09:12:26] see operations :) [09:12:35] sorry [09:12:49] will enable gtid there [09:12:52] it was disabled [09:12:58] ah, yes [09:13:00] we need to do a loop and check where gtid is disabled [09:13:08] needed for the topology change [09:13:39] it is rather silly that we have to disable gtid to change the position of a replica [09:13:56] yeah, but better be safe XD [09:14:40] you have a sec to do the master change? I basically need another pair of eyes to confirm coordenates of db1106 :) [09:15:52] coordinates? [09:16:04] oh, you are now talking about production [09:16:10] yeah, sorry [09:16:20] db1095 to be changed from db1065 to db1106 :) [09:16:29] one sec I finish with misc [09:16:33] sure! [09:19:17] collection failures db1016 [09:19:29] and db1073 [09:19:40] I guess the last one recovering [09:19:54] where are you seeing those? [09:20:22] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated [09:21:18] manually fixing the opt on db1016 [09:21:32] oki [09:21:58] but the metrics collection still fails [09:22:38] the socket is still pointing to tmp [09:23:22] yeah, but I created a symlink, I think [09:23:46] no, it is not there [09:23:49] or I stopped puppet? [09:23:55] I am confused now [09:24:12] puppet is stopped yes [09:24:16] but the socket is still on tmp [09:24:22] and my.cnf points to tmp [09:24:28] so it is technically correct :) [09:24:35] I will fix it manually [09:25:33] no no [09:25:35] don't [09:25:47] too late [09:25:52] revert [09:25:58] I only changed the socket path [09:26:04] where? [09:26:10] on my.cnf [09:26:13] to what? [09:26:17] to var run [09:26:28] no [09:26:33] I stopped puppet [09:26:40] it should be on /tmp [09:26:55] it is something else [09:27:00] prometheus daemon or something [09:27:10] ok, I will revert [09:27:25] done [09:28:09] Error pinging mysqld: dial unix /tmp/mysql.sock: connect: no such file or directory [09:28:31] that could be while I was restarting [09:29:14] I think it is fixed now [09:29:25] is it only happening on db1016? [09:29:38] yes, it is fixed [09:29:44] so what was it? [09:29:45] only 28 errors now [09:29:54] I had stopped puppet [09:30:01] but not manually restarted prometheus exporter [09:30:04] aha! [09:30:10] as puppet does that automatically [09:30:56] so all good then? [09:31:00] yes [09:31:04] \o/ [09:31:13] I also checked gtid on all replicas [09:31:22] not all, all misc [09:31:32] cool! [09:31:51] what is your plan with db1065? [09:32:08] sooo [09:32:16] the plan is: stop db1065 and db1106 on the same position [09:32:28] are both depooled? [09:32:55] then connect to db1095: SET @@default_master_connection=''s1'; stop slave; reset slave all; [09:32:58] and then: [09:33:36] change master to master_host='db1106.eqiad.wmnet', master_user='repl', master_password='xx' ,master_port=3306, master_log_file='yyy', master_log_pos=zzz; [09:33:40] actually [09:33:48] change master 's1' to master_host='db1106.eqiad.wmnet', master_user='repl', master_password='xx' ,master_port=3306, master_log_file='yyy', master_log_pos=zzz; [09:33:58] put that on the etherpad [09:35:32] check it now [09:36:11] how it is possible that a jenkins edit is at 9:39 AM [09:36:16] if it is :36 now? [09:36:25] https://gerrit.wikimedia.org/r/#/c/420964/ [09:36:39] that time isn't utc [09:36:44] ah! [09:36:47] sorry [09:37:06] I was like, is my clock, bad, do we have issues with ntp? [09:37:09] sorry [09:37:41] that happened to me a few days ago XD [09:39:27] looks ok to me [09:39:48] ok! [09:39:50] let's go then [09:40:10] let me double check db1106 is in row [09:40:16] I know it is on config [09:40:18] yeah :) [09:40:22] I checked the binlog itself [09:40:22] but let me check the binlogs [09:40:25] ok, then [09:40:26] but please, go and check it too :) [09:40:29] ok [09:40:30] no no, go ahead, please [09:42:15] is that a recent change, it was statement early in the morning? [09:42:27] oh, sorry ,y fault [09:42:31] no, it was set up days ago [09:42:32] I was checking the relays [09:42:36] actually, it was started as row [09:42:43] which of course are statement (master) [09:42:44] from the start [09:42:48] hehe yeah [09:43:39] I can see them as row, as expected [09:43:46] ok [09:43:51] will go ahead and stop replication then [09:43:53] thanks for checking [09:44:07] done [09:44:08] let's verify [09:44:29] they are on the same position and it is not moving [09:44:59] go on [09:45:10] I can see it too [09:45:23] File: db1106-bin.000009 [09:45:23] Position: 474802220 [09:45:27] looks the new position for db1095 [09:46:11] confirm [09:46:11] going to stop and reset slave all on db1095 [09:46:38] we mighjt need to manually add all the filters [09:46:47] before starting the new slave thread [09:47:04] probably not, but check it [09:47:18] do a slow slave status before the start [09:47:22] *show [09:47:26] yeah [09:48:03] ok, they are there [09:48:12] ready to start replication, can you double check too? [09:49:14] slave status looks good to me [09:49:20] let's go for it then [09:49:36] ok, connected well [09:49:40] let's start replication on db1106 [09:50:13] I see no errors [09:50:20] it is catching up finely [09:50:54] going to start slave on db1065 too [09:50:58] mediawiki errors seem fine, too [09:51:05] oh, yes, I forgot [09:51:33] it is looking all good [09:51:43] let's leave it a day and tomorrow I will move db1065 to misc [09:52:04] I don't think we need to stop, really [09:52:15] to stop? [09:52:21] to wait, I mean [09:52:24] ah, to wait [09:52:35] ok, let's leave it until lunch XD [09:52:35] if something went wrong, db1065 will not fix it [09:52:39] because replication [09:52:40] that's true [09:52:44] other thing [09:52:45] is [09:52:53] if you don't trust db1106 as slow [09:53:01] but that is a different thing [09:53:09] No, I do :) [09:53:19] It has 50 as main traffic btw [09:53:34] but if for some reason db1106 goes wrong regaring replication [09:53:43] the only fix would be to put another random host [09:54:02] (BTW, we could have, like master, a candidate on ROW, just in case) [09:54:16] to failoer sanitarium? [09:54:19] failover [09:54:22] even on codfw [09:54:32] to failover OR [09:54:48] to apply manualy row-based statements [09:55:02] because if they fail, all others are on statement or mixed [09:55:16] yeah [09:55:26] which shouldn'be a huge problem [09:55:36] but with lower priority [09:55:43] better have a plan B :-) [09:55:48] yeah, agreed! [09:56:16] so my advice would be to setup db1065 right away and decom 1001 asap [09:56:29] yeah, I am going to prepare the puppet patches [09:56:30] I can do that [09:56:52] we need to reimage and all that [09:56:58] I will prepare the patches [09:57:04] and send it to reimage [09:57:04] that is the "easy" part [09:57:21] yeah, then we can stop db1016 and clone it right away [09:57:22] you do that, then? [09:57:28] yep [09:57:33] or at least start? [09:57:41] repool production first [09:57:41] I will try to do the whole thing [09:57:44] or I can dod that [09:57:46] it is done already [09:57:50] ok [09:58:00] so you don't leave me anything? [09:58:05] hahaha [09:58:34] wait till we hit problems and I need the calvary :) [09:58:53] * chivalry [09:59:18] ? [09:59:45] nevermind it was a joke :) [10:01:06] actually, decom 1016 and 1001, one for copying its data, the other for replacing its functionality [10:01:25] yeah :) [10:01:32] I will use db1016 to clone [10:02:14] I will then do the review of the mediawiki config [10:03:14] what do you mean?: https://gerrit.wikimedia.org/r/420976 ? [10:03:54] not only that [10:04:00] there are other hosts that are leftovers [10:04:04] ah ok ok :) [10:04:27] let me show you with a WIP [10:06:06] https://gerrit.wikimedia.org/r/420978 [10:10:58] aaah right right! [10:31:19] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4067978 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1065.eqiad.wmnet ``` The lo... [10:34:58] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4067995 (10Marostegui) [10:36:21] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4065400 (10Marostegui) [10:46:23] 10DBA: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4068032 (10Marostegui) p:05Triage>03Normal [10:46:40] 10DBA: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4068032 (10Marostegui) [10:46:44] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4068044 (10Marostegui) [10:48:44] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026873 (10Marostegui) [10:52:04] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4068065 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1065.eqiad.wmnet'] ``` and were **ALL** successful. [11:11:51] marostegui: will merge https://gerrit.wikimedia.org/r/420978 [11:11:59] will do further cleanup later [11:12:43] then sync with dc-ops to check we are on the same page with what it is decommissioned [11:13:29] jynus: cool! [11:13:39] I talked to robh the other day about the tickets they have pending [11:14:01] it is ok as long as it is tracked [11:14:14] my fear is if there are things that are not tracked [11:14:22] yeah [11:14:23] sure [11:15:14] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4068156 (10Marostegui) db1065 is now replicating in m1. I will leave mysql on db1016 stopped [11:16:22] 10DBA: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4068163 (10Marostegui) db1065 is now replicating in m1, let's wait 24h before going to decommission this host [11:23:19] 10DBA, 10Goal, 10Patch-For-Review: Decommission database hosts <= db2031 (tracking) - https://phabricator.wikimedia.org/T176243#4068183 (10jcrespo) @robh @Papaul From our point of view, all hosts from db2001 to db2032 have been either decommissioned, scheduled for decommission or renamed/never existed. Can y... [11:43:33] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4068279 (10Marostegui) db1016 data has been copied over to db1065 [11:43:47] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4068280 (10Marostegui) [11:45:44] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4068283 (10jcrespo) db1016 removed from tendril (only 1 host, ofc) [11:49:07] we have reduced the number of db hosts to 185, which is a much nicer number than over 200 [12:20:26] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4068432 (10Marostegui) [12:20:56] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026946 (10Marostegui) [12:45:27] I cannot find this weeks's s3 backups [12:46:58] Actually I can only see s6 and m5 [12:51:20] I can see on es2001 that the cron was run correctly yesterday at 19:00 [12:51:29] Well, correctly as in: it was ran [12:54:28] FW issues between es2001 and dbstore2001? [12:54:40] I can connect fine [13:00:49] I don't think it has finished yet [13:01:02] I said it takes 24 hours, and I meant it [13:01:12] but shouldn't be on ongoing? [13:01:18] on the directory, I mean [13:01:26] not necesarilly [13:01:38] mmm [13:01:41] nothing ongoing [13:01:46] yeah :) [13:01:52] no mydumper threads or antyhing [13:02:22] something killed backups to m2 [13:02:56] and killed the whole process [13:02:59] maybe [13:03:02] will check later [13:03:13] maybe all the restarts we were doing or something [13:03:30] it would be nice to have an entry on the logs to say: backup started - timestamp [13:03:54] you already get that [13:04:07] on the metadata file [13:04:30] Started dump at: 2018-03-20 20:33:01 [13:04:36] Finished dump at: 2018-03-21 00:51:03 [13:04:50] ah, the mydumper metadata yeah [13:04:54] but the script apparently failed between that and the next one [13:07:36] so possibly it gave out an exception [13:08:22] it wasn't out of the question- it was the first time it rotated files [13:08:33] so maybe some file operations failed [13:08:50] yeah, I was checking if s3 was finished as I wanted to do an schema change on codfw, and couldn't see it, and then i realised I only saw m2 and s6 [13:09:07] s6 is correctly placed in latest :) [13:09:11] s3 didn't even started [13:09:30] dump.s6.2018-03-20--19-14-16# cat metadata | grep dump [13:09:30] Started dump at: 2018-03-20 19:14:17 [13:09:30] Finished dump at: 2018-03-20 20:33:01 [13:09:33] so that worked :) [13:09:37] it failed at the end of m2 [13:09:39] it was moved from ongoing to latest [13:09:59] m5 is also on latest [13:10:00] and because it failed, it wasn't moved to latest [13:10:10] which is a good failback [13:10:13] indeed [13:10:22] so m5 and s6 worked and were moved to latest [13:10:29] and monitoring should kick in [13:10:34] I also checked [13:10:41] and it deleted correctly older backups [13:10:51] but it is interesting that the metadata of m2 says it actually finished: Finished dump at: 2018-03-21 00:51:03 [13:10:53] I did an ls -lha on archive [13:11:05] you can see it as ls [13:11:29] diff <(ls -lha) <(cat ls) [13:11:37] actuall, not, no file was deleted [13:11:42] so probably that was the issue [13:12:02] one file was scheduled to be purged, and the process failed [13:12:07] not mydumer itself [13:12:18] if you are blocked on m3, I can run it now [13:12:29] and debug later [13:12:57] s3 yes. I wanted to deploy the change either today or tomorrow [13:13:02] but it is not a big thing if it doesn't happen [13:13:36] I will go afk in a bit, so not a problem [13:15:39] well, I have to fix backups today, that is for sure [13:15:48] I am not just 100% back [13:16:19] will have a look at it with the higest priority [13:16:56] I suspect it being file deletions breaking [13:17:05] as this was the first time we did that [13:18:12] because the order is -> purge old ones -> move latest to archive -> move finished ongoing to latest [13:18:40] yeah, we can troubleshoot later, that is no problem :) [13:29:16] oh [13:29:21] I may see the problem [13:29:26] there is a backup on archive [13:29:43] called dump.m2.2018-03-16--16-20-10 [13:29:47] with root privileges [13:30:00] could it be something you created for the m failovers? [13:30:23] if it is roon, it cannot be deleted, and I have not created some kind of check-assuming everthing that matches that patter will be deleted [13:30:44] I mean, it could fail more gracefully, but cannot really continue [13:40:32] I've restarted the whole process, except the ones that worked [13:41:10] we shouldn't touch manually the other dirs, I created /older preciselly for manual backups [15:13:32] ˜/jynus 14:30> could it be something you created for the m failovers? -> yeah, could be [15:14:04] sudo -u dump [15:14:39] yeah, I didn't think of that being an issue, my fault [15:15:12] no ones fault [15:17:56] I see you already fixed the grants, shall I run the backups again then? [15:18:09] they are running already [15:18:13] oh you already did [15:18:13] yeah [15:18:14] I said I would fix it [15:18:22] thanks [15:18:36] didn't feel like logging it as running backups is not really a loggable stuff [15:18:49] yeah no worries I should've checked before asking :) [15:18:57] I've restarted the whole process, except the ones that worked [15:19:31] :) [15:21:03] I think I should convert dump_sections.py into dump_section.py --config=x [15:21:14] <3 <3 <3 <3 <3 [15:21:18] I would love that [15:21:26] and if you don't provide a config file, you can do dump_section.py m3 [15:21:52] or dump_section.py --host --user, etc. [15:22:46] yeah, that'd be really useful for quicks backups before an specific maintenance [15:23:50] the problem is the authentication, I restructed access to the script to a very small set of servers [15:24:08] I will create a task and see the best way to do it [15:45:05] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4069196 (10Marostegui) s6 db1061 master: ``` root@db1061:~# megacli -LDPDInfo -aAll | egrep -i "slot|error|failure count|s.m.a.r.t" Slot Number: 0 Media Error Count: 0 Other Error Count:... [15:46:42] 10DBA, 10Operations, 10ops-eqiad: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4069201 (10Marostegui) p:05Triage>03Normal [15:49:39] 10DBA, 10Operations, 10ops-eqiad: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4069225 (10Marostegui) [15:52:16] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4069231 (10Marostegui) s7 db1062 master: ``` root@db1062:~# megacli -LDPDInfo -aAll | egrep -i "slot|error|failure count|s.m.a.r.t" Slot Number: 0 Media Error Count: 0 Other Error Count:... [15:53:54] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4069249 (10Marostegui) s8 db1071 master: ``` root@db1071:~# megacli -LDPDInfo -aAll | egrep -i "slot|error|failure count|s.m.a.r.t" Slot Number: 0 Media Error Count: 199 Other Error Coun... [15:55:03] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4069256 (10Marostegui) 05Open>03Resolved a:03Marostegui I am going to close this as resolved. We can use this for future tracking to see if errors increased or not. Also going to tra... [15:55:15] 10DBA, 10Operations, 10ops-eqiad: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4069261 (10jcrespo) Ok to me, we should wait for chris to be around. [15:57:20] 10DBA, 10Operations, 10ops-eqiad: db1052 (s1 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4069271 (10Marostegui) p:05Triage>03Normal [16:01:39] 10DBA, 10Operations, 10ops-eqiad: db1054 (s2 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190302#4069301 (10Marostegui) p:05Triage>03Normal [16:03:34] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190303#4069324 (10Marostegui) p:05Triage>03Normal [16:04:28] 10DBA, 10Operations, 10ops-eqiad: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4069340 (10Marostegui) [16:05:35] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#4069343 (10Marostegui) 05Open>03declined This host is no longer the master and will be decommissioned - T190179 [16:08:29] https://phabricator.wikimedia.org/P6873 in case you want to check latencies from neodymium [16:09:33] * marostegui saves that to his notes [16:09:35] thanks! [16:12:22] this is what worries me: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=1521475930648&to=1521648730648 [16:12:34] ufff [16:13:07] but I don't really see changes on the db itself [16:13:34] maybe the latest exporter does more stuff [16:13:56] this is after restart of db1079 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1079&var-port=9104 [16:14:11] :| [16:14:30] note the "Monitoring response time" title is on purpose [16:14:33] I restarted es1019 but nothing seeing there [16:14:34] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1019&var-port=9104 [16:14:43] it doesn't really measure the query or connection response time [16:14:59] only the monitoring response time, which is many queries [16:15:19] maybe the latest version does more stuff, but only is seen after it restarts [16:15:46] otherwise I don't see changes on query pattersns or hw performance [16:16:13] yeah, it is very weird [16:17:10] I think this is the most revealing: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1099&var-port=13311&from=now-12h&to=now [16:17:26] went high at 10:30 [16:17:34] now low again [16:17:39] with only p99 hich [16:17:47] and db1099 wasn't touched today [16:17:56] not that I know [16:18:09] so maybe the latency is from prometheus host [16:18:14] and it got restarted or something [16:18:26] which would explain why it is so slow evertwhere [16:18:35] ah that could be [16:18:41] because db1099 wasn't touched [17:09:09] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4069638 (10Marostegui) So this has been working fine for the last few weeks then, right? [17:09:44] db1113 is you? [17:14:42] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4069663 (10jcrespo) Yes, but I have not yet done a 100% deploy (didn't have time to baby sit it) only on s8 and certain s1 hosts. [17:17:37] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4069687 (10Marostegui) >>! In T183983#4069663, @jcrespo wrote: > Yes, but I have not yet done a 100% deploy (didn't have time to baby sit it) only on s8 and certain s1 hosts. I'm confuse... [17:21:18] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4069701 (10jcrespo) Oh, sorry, I mixed my tickets. I just was editing the related T149421. This seems to work, but I have not touched it since a long time ago. [17:25:27] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4069714 (10Marostegui) >>! In T183983#4069701, @jcrespo wrote: > Oh, sorry, I mixed my tickets. I just was editing the related T149421. This seems to work, but I have not touched it since... [18:24:48] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 7 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4070023 (10Pchelolo) [18:33:57] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 7 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4070065 (10Pchelolo) The change that partitioned the `refreshLinks` topic in line with MySQL sharding has been deployed. Now we just need... [18:37:51] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 7 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4070079 (10mobrovac) Relevant dashboards to monitor (for posterity): - [MySQL open connections](https://grafana-admin.wikimedia.org/dashb...