[07:16:26] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [07:23:00] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) p:05Triage→03High BBU broke ` Battery/Capacitor Count: 0 ` @Cmjohnson Can we give this host some priority? I wouldn't want to have it down for the whole offsite week. I believe its support just... [07:24:25] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) @jcrespo I am going to place db1135 temporarily (T222682) to replace this host until we have found a solution [07:46:24] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [08:02:01] thanks for the quick reaction [08:02:35] jynus: I am going to leave mysql stopped for now, until we hear from cmjohnson1 , once that is done, I will start mysql and check the data [08:02:40] sounds good? [08:02:51] I doubt cmjohnson1 will have a spare BBU, but let's wait [08:03:02] the support finished the 24th may, bad luck :( [08:03:51] I am not sure of that [08:03:55] let me check [08:04:16] not sure of what? [08:05:52] indeed they arrived 3 years ago [08:11:35] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [08:36:55] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/514439/ [08:38:17] 10DBA, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) @Marostegui Ok if I send them to DC Ops? [08:38:34] 10DBA, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) +1 [08:52:50] I am going to try a restore before full decom [08:53:50] sure [08:54:02] can you take a look at the above patch for db1135 when you have a minute? [09:15:46] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [09:22:06] how did you set it up? [09:22:45] with a snapshot [09:22:55] cool [09:23:01] :) [09:23:11] I am recovering to dbprov2001 and 2 [09:23:30] FYI in case something goes wild like consuming 10 TB of data or something [09:23:39] good test indeed [09:23:40] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [09:23:51] the state is a bit worring, will tell later [09:24:25] sure [09:25:52] it has been weeks since I was the DBQuery so clean XD [09:26:03] the DBQuery logstash channel [09:26:07] indeed [09:26:20] s/was/saw [09:26:20] except for the 400K when hw crashes [09:26:27] :-) [09:26:32] haha yeah [09:27:27] actually, aside from the 17 ongoing queries [09:27:30] no fatals [10:01:18] 10DBA, 10Operations, 10observability: Generate instance list of database hosts to be monitored automatically from exported resources - https://phabricator.wikimedia.org/T177779 (10fgiunchedi) I was reviewing #observability backlog, to me it looks like this is a duplicate of {T145072} ? [10:03:35] 10DBA, 10Operations, 10observability: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072 (10jcrespo) [10:03:38] 10DBA, 10Operations, 10observability: Generate instance list of database hosts to be monitored automatically from exported resources - https://phabricator.wikimedia.org/T177779 (10jcrespo) [10:05:13] 10DBA, 10Operations, 10observability: Generate instance list of active database hosts to be monitored from prometheus - https://phabricator.wikimedia.org/T145072 (10jcrespo) [10:18:09] 10DBA, 10MediaWiki-extensions-Renameuser: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863 (10Aklapper) [10:29:57] 10DBA, 10Wikimedia-Site-requests: Global renaming of B_dash → A1Cafel - https://phabricator.wikimedia.org/T224916 (10revi) @jcrespo please consider sending an email to global-renamers - at - lists.wikimedia.org in case of improvements like this -- which everyone will love. [10:36:40] 10DBA, 10Wikimedia-Site-requests: Global renaming of B_dash → A1Cafel - https://phabricator.wikimedia.org/T224916 (10jcrespo) @revi I will suggest the developers to announce it there (all credit to the core team), although I am mostly sure they are aware of this; just not ready enough to be announced yet. [12:25:20] restore works as intended, although there are many thing I don't like [12:26:21] like? [12:26:35] retention and purging, and what that entails [12:26:44] also not a clear recovery path due to the above [12:26:59] policies should be different, etc. [12:27:11] although that was mostly all known [12:27:26] the important thing is that I was able to recover month-old backups [12:27:57] do you want to give the dbstores a last look [12:28:19] or did you do it already, or something else? [12:31:30] no, I already did [12:31:49] nuke them [12:32:30] thanks, sorry to ask many times [12:32:45] one is never sure of potentially distructive things [12:34:38] I saw a backup of db2037 on dbprov2001 [12:34:55] It doesn't bother me, but is it ok if I put a calendar entry for its deletion? [12:39:00] sure [12:39:04] I will do that [12:39:12] 1 month? 3 months? [12:39:18] ok, you handle it [12:39:34] I just want to keep things cleaner on the new servers [12:39:34] I will give it 3 months yeah [12:45:58] 10DBA, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10jcrespo) The hit rate seems back to normal, so I am guessing in a couple of weeks this could get resolved? [12:46:59] 10DBA, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) >>! In T210725#5236616, @jcrespo wrote: > The hit rate seems back to normal, so I am guessing in a couple... [12:48:00] 10DBA, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10jcrespo) > there is still one key to be changed That is what I meant with "resolved" :-P. [12:49:11] wow september, but the year just started! [12:50:20] that's 3 months :p [12:50:26] I know [12:50:45] Of course if we run into disk space issues we can kill it ealier [12:51:19] the recovery is finishing, I have pending to remove those now [12:51:39] bye bye dbstores.... [12:51:43] so long! [13:00:07] 10DBA: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) [13:01:33] 10DBA: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10Marostegui) [13:01:54] I am going to start a new codfw backup into es2002,3 of es2,3 [13:02:05] 10DBA: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) [13:02:21] you will see lots of activity, it should take I think 12 hours or 1 day [13:02:50] 10DBA: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) [13:02:51] great [13:02:55] thanks for the heads up :) [13:03:06] 10DBA: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) p:05Triage→03Normal [13:08:18] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [13:18:20] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=es2&var-shard=es3&var-role=All&from=1559729894252&to=1559740694253 [13:18:40] it may have caused some temporary impact on replication performance on codfw [13:18:45] but I see no lag [13:19:05] lots of rows read of course, as expected [13:19:11] yeah [13:19:17] I am curious to see how long it takes [13:19:41] "but I see no lag" meaning "no further ongoing lag issues after a spike" [13:20:29] speed won't be great as those are pure HD hosts [13:20:40] and not warmed at all [14:21:37] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) a:05jcrespo→03Cmjohnson @robh @Cmjohnson These 3 hosts are ready to be decommissioned. Alerts have been disabled, roles deleted (it is still a spare role),... [14:22:08] Good afternoon! db1091...i do have a spare bbu but that spare has been helpful the last year or so. HP is slow to send out the batteries, they can take days to get because of their slow response time and then having to ship batteries via ground transportation only. If I use it for this server than I am not able to quickly change out the bbu on something that may be more important in the future. The call [14:22:08] is yours since you have the most BBU issues. [14:22:19] cmjohnson1: you have a spare BBU?? [14:22:26] niiiiiiice [14:22:27] i do but see above [14:23:20] cmjohnson1: Yeah, I see, I think we do need it for this host, as it is one of the ones that support most of the weight in s4 (commonswiki) which is one of the biggest wiksi [14:23:42] cmjohnson1: we might get 2 extra hosts at the end of q1 if analytics are able to free them up, but for now I think we do need db1091 in service [14:23:59] okay, works for me I will get to it today...can you leave it down. [14:24:09] I will power it off for you yep [14:24:55] cmjohnson1: db1091 is now poweredoff, thank you so much [14:25:22] YW [14:25:35] I m pasting these comments in the task [14:25:47] great [14:25:48] thanks [14:26:35] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) [14:26:46] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) Good afternoon! db1091...i do have a spare bbu but that spare has been helpful the last year or so. HP is slow to send out the batteries, they can take days to get because of their slow response time... [14:27:37] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) a:05Cmjohnson→03RobH [14:28:40] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) [14:29:00] 10DBA: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10Marostegui) [14:29:27] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) [14:30:14] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) [14:30:23] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) I've marked the hosts as "decomissioning", the template and the wiki seem to be outdated and unclear what to do? [15:59:00] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) I've moved them to active as per volans' advice. [17:10:24] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) 05Open→03Resolved The bbu has been replaced. [17:31:42] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) Thank you so much @Cmjohnson I can see the battery now: ` Cache Backup Power Source: Batteries Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ` Next steps I will take: - Start MySQL...