[04:57:11] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [04:57:30] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2044 again: ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I Port Name: 2I Gen8 ServBP... [04:58:08] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:02:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) [05:03:31] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) 05Open→03Resolved All these host are now ready to be productionized at T220572. There is a problem with the controller exposure to the OS w... [05:52:38] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10Marostegui) >>! In T220572#5104345, @MoritzMuehlenhoff wrote: > The RAID controller shows up in early device detection by the kernel: > > > ` > [ 4.385654] smartpqi 0000:5c:00.0:... [06:07:48] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10MoritzMuehlenhoff) I tried 4.19 on db2102, doesn't make a difference. [06:22:24] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10MoritzMuehlenhoff) HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine. [06:37:43] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10Marostegui) [06:43:05] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10Marostegui) Great catch @MoritzMuehlenhoff thanks! I have created T220787 to follow up our tools and monitoring needed changes to adapt to the new Gen10 [07:07:41] Hey, the url shortener is live since yesterday, looking at the graphs of x1, things are normal but if there's anything. let me know [07:10:03] wilco [07:10:04] thanks [07:27:51] 10DBA: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10Marostegui) p:05Triage→03Normal [07:43:39] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @robh @faidon Re: T219461#5103942 I wonder if we should document this stop as one to do for these models. The sda/sdb renam... [07:55:08] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10MoritzMuehlenhoff) >>! In T219461#5106335, @jcrespo wrote: > @robh @faidon Re: T219461#5103942 I wonder if we should document this s... [07:58:56] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @MoritzMuehlenhoff, just guessing, but I am assuming it is a chassis "bundled" SD card reader, not something we have bought... [07:59:08] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) >>! In T219461#5106371, @MoritzMuehlenhoff wrote: >>>! In T219461#5106335, @jcrespo wrote: >> @robh @faidon Re: T219461#... [09:56:17] parsercache hit ratio is now back at the original values before the key change [10:02:26] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) parsercache hit ratio values are back to normal values after the 1st key change past... [10:10:09] 10DBA, 10Operations, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) A bit of a recap on the original questions: * Parsercache keys are renamed to pc1, pc2, pc3 at: T210725 * Parsercaches are write-wri... [10:31:40] hit ratio is normal, but there is still a bit higher than normal levels of miss (absent), vs lower mix (expired) [10:31:47] *miss [10:32:13] yeah, that's why I want to wait a full month [10:32:17] to see how the purge goes too [10:32:22] we are not in a rush anyways [10:32:23] free space is still going down almost 10% [10:33:17] maybe we should start a faster/harder purging methodology? [10:33:18] yeah, I guess it still needs to stabilizwed [10:33:53] it is at 60% usage now [10:34:07] not worrying, of course [10:34:17] but we should keep an eye on it [10:34:21] yep, I assume it will stabilize [10:34:28] And then it will purge the old entries [10:34:49] maybe purge doesn't work well because of the resharding? [10:35:15] maybe it is looking for those new keys? [10:35:17] after all, there may be a lot of invalid keys now [10:35:34] but those will not be purged until 30 days later [10:35:36] plus [10:35:57] lots of keys will expire at the same time [10:36:06] that caused issues in the past [10:36:25] but the last time when we forgot to enable replication when coming back from codfw, it didn't have an impact [10:36:30] so if the trends keeps happening, we should consider doing some random deletions [10:36:30] and most of them got flushed at the same time [10:36:43] we can try running the script too [10:36:45] manually [10:36:50] that is true [10:37:13] I dont think we should take measures yet [10:37:20] yeah, agreed [10:37:28] but we can force the script to purge stuff if we need [10:37:34] but I am thinking of possibilties [10:37:37] of actionables [10:37:47] and we should reevaluate on monday [10:37:59] we can also try a manual run to see if it shows something unexpected [10:38:24] the old keys should still be getting purge every hour [10:38:29] getting purged [10:39:02] so the disk should in theory be the same, as we are just renaming keys really [10:39:03] I think the only thing I would do for now [10:39:08] not adding new ones [10:39:11] is checking if there are any old keys [10:39:20] that should have been purged? [10:39:39] older than TTL + 1 day on any of the 8 hosts [10:40:04] marostegui: yeah, that may have been missed because the reorganization [10:40:18] I can see that happening on the 4th host, or codfw [10:40:34] let me check the min exptime for each table [10:41:36] for example, that only runs on the active dc [10:41:48] and probably should run on both? (not sure) [10:42:04] as technically pc is alrady active active [10:42:17] min exptime looks correct on pc1007 they are all 11th April or 12th April [10:42:26] I am going to propose that on T133523 [10:42:27] so there is nothing from 8th april or older [10:42:27] T133523: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 [10:42:36] we changed the key the 9th [10:42:55] so at least the purge is working there [10:43:24] I think it is 1010 the one that may have issues? [10:43:34] Yeah, still on 1009 :) [10:43:35] or pc20* [10:44:10] keep in mind that 1010 replicates from 1007, so it should be getting the deletes [10:44:16] yes [10:44:35] but the "pc1 deletes", not the pc2 or 3 [10:44:44] pc1010 looks good too, everything is from 11/12 april [10:45:18] ah wait a second [10:45:19] I am an idiot [10:45:24] forgot to change one thing [10:45:29] pc1010 is indeed wrong [10:45:39] pc194 [10:45:39] +---------------------+ [10:45:39] | min(exptime) | [10:45:39] +---------------------+ [10:45:39] | 2019-03-20 01:08:16 | [10:45:41] +---------------------+ [10:45:58] so we don't need to fix that now, but I want to remind it [10:46:10] noted and maybe write a proposal [10:46:53] I will write something to fix pc1010 manually [10:46:56] for now [10:49:20] 10DBA, 10Operations, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) 3 additional items/proposals regarding purging: * Smarter purging- something maybe priority queue based, while respecting TTL, not s... [10:49:27] ^these are my notes [10:50:20] yeah, the #2 and #3 are keys I think [10:50:32] I will meanwhile write a delete manually for pc1010 [10:50:37] keys as in important? [10:50:43] key as important yep [10:52:20] we may have to do some defragmentation when everthing is finished [10:52:27] technically it is not important [10:52:27] yeah [10:52:39] but it will allow us to see earlier cache issues [10:52:50] like the ones we had that time [10:53:09] so we would maybe want to do it [10:53:23] not a big deal right now [10:53:24] I have done a test on pc1010 with a certain table [10:53:33] 1521 keys were to be deleted [10:53:44] and I am checking an optimization too [10:53:46] on that same table [10:54:05] we could in the future do local purging [10:54:45] yeah, that would solve these problems [10:54:46] as in "per host purging", not replicated [10:54:50] yeah [10:54:57] not sure [10:55:28] as usual, we haven't yet "fixed" HA, so too early to propose something [10:57:06] haha yeah [10:57:59] so we gained around 1GB on the table I deleted+optimized rows [10:58:58] the table I deleted == truncated? [10:59:08] no, the table I delete old rows from [10:59:14] Keys that should have been expired [10:59:26] keys as in primary key :-P? [10:59:34] rows :p [10:59:37] he he [10:59:45] that is a lot? [11:00:01] yeah, but who knows when was the last time we optimized that table [11:00:06] don't spend time on that [11:00:14] there are around 1500 rows (average) per table that need to be deleted [11:00:19] I just had a theory and wanted to see if it was true [11:00:36] we have to wait a bit more [11:00:37] it is true for pc1010 [11:00:58] there is also the fact that pc1010 has served on different keys [11:01:00] as it is the spare [11:01:04] at least now they only go to the spare [11:01:21] yeah, on monday I will delete those rows anyways [11:01:26] so it is clean for easter [11:01:37] those rows == rows that should not be there :p [11:01:45] I wouldn't spend time on that [11:01:59] I have the loop ready, it is just one line :) [11:02:00] not until we have finished the migration [11:02:14] yeah, if the disk doesn't grow much! [11:02:21] the usage, I mean [11:02:40] we can also put that on a cron, on all hosts [11:02:43] but yeah, knowing it is only pc1010 is better [11:02:56] or an alert [11:02:59] the others are getting the purges ok [11:03:31] this is pc1007 though https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=12&fullscreen&orgId=1&var-server=pc1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-30d&to=now [11:04:04] ah, the second derivative is negative [11:04:04] 8% in 3 days [11:04:29] well, not sure [11:04:37] but not immediately worrying [11:04:39] 6% [11:05:35] do you know whay there is no data beyond 2 weeks ago? [11:05:36] yeah, we can pool it out and run optimizes [11:06:02] no, but it happens on all the host, db1112: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=12&fullscreen&orgId=1&from=now-30d&to=now&var-server=db1112&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [11:06:16] there is on codfw [11:07:41] maybe we should ask chris [11:07:43] I am going to grab food [11:07:48] bye [11:08:01] I will later do a test with pc1010, will clean the old rows and optimize all the tables [11:08:24] I want to see how much we get in case we have to pool out pc1007, 1008 and 1009 to do an emergency optimization [11:24:07] FYI, I'm upgrading db1114 to the latest buster state [11:27:13] noob question [11:27:23] regarding DB grants syntax [11:27:34] GRANT ALL PRIVILEGES ON <%= @db_user %>.* TO 'keystone'@'<%= @ipaddress %>' IDENTIFIED BY '<%= @db_pass %>'; [11:27:52] that `ipaddress` is the address of the DB server or the connecting client? [11:28:05] does that accept a FQDN? [11:28:44] cc jynus ^^^ [12:24:50] https://dev.mysql.com/doc/refman/8.0/en/grant.html