[07:31:53] Amir1: can i start the schema change in s8? [09:02:27] federico3: go for it [09:02:44] regarding es one, I would need to look at our docs [09:02:50] or previous cases [09:02:55] x1 eqiad snapshot very_wrong_size 11 hours ago 307.9 GB -20.2 % The previous backup had a size of 385.8 GB, a change larger than 15.0%. [09:02:55] dump very_wrong_size 1 day, 7 hours ago 96.8 GB +17.5 % The previous backup had a size of 82.4 GB, a change larger than 15.0%. [09:03:50] I really think the increase is because of double compression. Is it possible to exclude a table from dump compression since it's already compressed? [09:04:52] Also ^_^ [09:04:53] s4 eqiad snapshot wrong_size 4 hours ago 1.8 TB -5.4 % The previous backup had a size of 1.9 TB, a change larger than 5.0%. [09:04:58] (categorylinks) [09:05:01] i grew the xfs partition manually for now - my question is if we need to open a bug to a different team for the host provisioning or we want to do the repartitioning on new hosts [09:54:57] https://www.irccloud.com/pastebin/epMuFJtr/ [09:55:14] The script is deleting these thumbnail sizes (and it continues in both directions) [10:00:34] uh? [10:04:44] I restarted ferm but /var/lib/prometheus/node.d/check_ferm_active.prom is still stale, is there any other known workaround? [10:05:31] It would be extremelly special to have an exception like that [10:07:27] uh? [10:07:44] I am answering amir [10:10:11] jynus: noted, no big deal. I'm asking them to move it to ES anyway, it'll change [10:13:25] I'm seeing nrpe2nodexp-ferm_active logging out an error around permissions for /var/lib/prometheus/node.d - I'm not seeing bug reports for this in phab - has anyone seen this before? [10:14:14] given that is probably a new alert, I don't think it has been seen before [10:16:42] you should mention it to Tiziano, I think: T384472 [10:16:42] T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager - https://phabricator.wikimedia.org/T384472 [10:21:33] it's probably a red herring, other "healthy" hosts are showing the same error but the stale file is not there [10:22:19] thanks, I'll poke him [10:27:13] Amir1: also, compressing on backup still gets a 42% reduction on backed up size [10:28:32] I think the increase could be because it has to be stored in hexadecimal for text dumping [10:30:12] Aaah. That makes more sense [10:30:42] I knew double compression increases the size but 20% is way too much for that [10:31:00] no no, compression decreases the size [10:31:00] This makes a lot more sense [10:31:24] I am now wondering if what's stored on rows is very inefficient [10:31:35] if it is hexadecimal on source [10:31:58] It's gzipped content of the original [10:32:17] I hope it's not turned into hex before storage [10:33:02] I check [10:47:14] so it has "rawdeflate," + base64 encoded content, with around a +1/3 overhead [10:47:28] yup, I just checked [10:50:38] still it should be around half the size of the original, with some extra overhead [10:51:19] 1383 bytes on db, 2673 bytes the original [11:54:59] federico3: would you mind running the checklist script again on this ticket? https://phabricator.wikimedia.org/T395241 [12:10:38] checklist script? [12:17:52] Amir1: i updated the task with few updated hosts [12:18:25] thanks [13:27:30] federico3: I don't know where you're up to with es2049, but it has an alert active against it for 'The /var/lib/prometheus/node.d/check_ferm_active.prom metrics file has not been updated in 5d 0h 0m 5s. Check processes responsible for updating the file on es2049:9100' [13:30:51] Emperor: I've been talking to tappof, see https://phabricator.wikimedia.org/T403617 and https://phabricator.wikimedia.org/T403615 - the first seems to be affecting all hosts due to the puppet config creating the /var/lib/prometheus directory, yet only es2049 is also showing the stale file [13:32:16] ah, cool. Worth silencing the alert for a bit with a link to one or other task, then? [13:34:00] ok [16:35:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184544 this should help both for the new host and the existing ones, fleet-wide, but I would recommend rolling the change out incrementally one host at a time. Amir1 any thoughts? [18:32:43] would it be okay if I borrowed backup1012, to test a bios patch from supermicro, the box is currently insetup [19:39:41] given /backup is emtpy, I'm going to be bold and borrow it [22:23:00] well I didn't finish my testing, so if you folks don't mind, I will use backup1012 tomorrow as well