[07:07:00] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose query for minor edit count - https://phabricator.wikimedia.org/T235572 (10jcrespo) 0.46 sec is still bad for a public api endpoint- and causes outages like T234450 when run multiple times by external users. Plea... [09:26:19] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose query for minor edit count - https://phabricator.wikimedia.org/T235572 (10jcrespo) To add more context: Please don't blame us DBAs! 0:-) On another app- 1000 runs a day of 0.46 seconds would be more than accepta... [10:14:40] jynus: I see backup1001 has progressed. Now fighting with all the small files in fermium (lists.wikimedia.org), but otherwise progressing [10:14:56] we can probably increase at some point the max concurrent jobs for the storage daemon by 50% or 100% [10:14:59] 3 or 4 that is [10:18:29] apparently if you ask to restore [10:18:42] it has to stop all jobs before the restore restarts [10:18:49] that is not ideal [10:19:08] maybe we have to add different pools, can't say [10:19:48] when all run and we have new stats of files and GBs we can see what is the best way to proceed [10:23:15] also it didn't help that for some reason, dbprov2002 has very low io performance, for some reason [10:23:42] I am checking it why now (unrelated to bacula, it happens for db snapshots too) [10:24:37] akosiaris: I will only ask you for feedback on parametrizable freshnes checks- gerrit dumps happen every hour, and they are the most likely to get delayed [10:31:04] restore is higher priority [10:31:07] that does make sense [10:35:15] oh, yes, but I would have expected with a 2 concurrency, to allow 1 restore and 1 backup at the same time [10:42:58] ah, that depends on the volume being used [10:43:13] if you are writing in a volume the assumption is that you can't read from it as well [10:43:24] yeah, so that was the main issue [10:43:27] that's a limitation that comes from tapes [10:43:50] so we may want to increase in some cases the pools for separate volumnes [10:44:05] how would that help? [10:44:06] in fact, that was one of the reasons for db separated [10:44:10] pools [10:44:24] dbs are large, so they won't interact with the rest [10:44:35] so we can "kill" db backup operations [10:44:43] to restore from there [10:44:53] without affecting non-database backups [10:45:34] Elapsed time: 4 hours 27 mins 25 secs [10:45:40] SD Bytes Written: 44,421,314,005 (44.42 GB) [10:45:41] what? [10:45:48] that's phab1003 btw [10:45:51] yeah [10:45:53] 4 hours for 40G ? [10:45:56] there may be too many files [10:45:58] that's insane, what's to blame? [10:46:04] that is something we fixed with dbs [10:46:12] by pre-taring some files [10:46:24] that is not that easy to do in other cases [10:46:30] we are talking about 2.8 MBytes/s [10:46:48] that's extremely low [10:46:48] in any case, phab1003 may not be in production [10:46:59] so not sure what actually was backed up [10:47:31] phab1001 is more or less the same [10:47:36] Transfer rate=4.786 M Bytes/second [10:47:39] that's really bad as well [10:47:49] 18mins for 4G [10:47:54] I belive it is like cobalt [10:48:10] one file for each file on the repos [10:48:15] at least people1001 is fast. Transfer rate=23.65 M Bytes/second [10:48:21] don't worry too much about that [10:48:27] netmon2001 is at 58M/s [10:48:33] ok so it's not the storage daemon to blame [10:48:35] as part of my work [10:48:51] I intend to review all backups and see what optimizsations could be done [10:49:00] but at a later stage [10:49:18] Could not stat "/var/lib/librenms": ERR=No such file or directory [10:49:20] hmmm [10:49:25] yeah that would be nice [10:49:32] for what host/job? [10:49:43] netmon1002 librenms [10:49:51] I said to m*rk I had work 1 year at least :-D [10:49:54] probably moved to some other place without the fileset being upgraded [10:50:11] yeah, that is other thing- filesets unused [10:50:20] or nor backing up what was supposed to do [10:50:29] but I have to go client by client and asking people [10:50:47] also in some cases I suspect people may be doing backups like a mysql datadir [10:50:54] and that wouldn't work on restore [10:51:11] so it needs review- don't worry, that was all planned on the next step [10:51:31] remember I said "step 1- failover, later, see what has to be redesigned/fixed" [10:51:36] on the document [10:52:01] maybe producing some rules on best practices for client additions, etc. [10:52:29] Files=6,445,956 Bytes=143,670,573,981 AveBytes/sec=2,752,889 LastBytes/sec=3,686,050 [10:52:30] look, you had like 1000 responsabilities, and you still have those [10:52:33] Files=1,411,043 Bytes=12,803,467,738 AveBytes/sec=981,635 LastBytes/sec=841,920 [10:52:40] sigh... this is going to take forever [10:52:44] I am here to help with the burden [10:53:33] yeah, that's appreciated, I am just pointing out issues as I see them. I don't expect them to be solved instantly or anything [10:53:38] I might even help solve some [10:54:13] funnily, for dbs I was called out for having 200K files and "fixed" them into 5000 by pretaring them [10:54:24] but those were much worse! [10:54:27] poor bacula-fd on phab2001 is at 100% cpu [10:54:37] 1 thread usually [10:54:45] going back to multiple pools, that could help [10:54:51] in those extreme cases [10:55:17] but again, I don't have a concrete proposal, jsut something I brought up [10:55:18] yeah, but keep in mind we have those low concurrency numbers cause bacula used to not be great at high concurrency levels [10:55:28] it uses to lock up between the threads [10:55:31] used to* [10:55:37] well, the blocker here is fd, not the sd [10:55:41] hence the conservative approach [10:55:45] of course [10:55:57] granted, that was in version 5.0 [10:56:04] but as you said, the sd has right now low thourgput [10:56:14] and we are in 9.0 and from what I know a lot of bugs have been opened and fixed regarding this [10:56:24] we'll see about that [10:56:31] not that they are not fixed [10:56:41] but what new ones were introduced :-) [10:56:46] :D [10:56:51] I have another question [10:57:10] setup for backup2001- not sure what is the current status according to puppet [10:57:35] will we need to setup puppet code to "transfer" backups remotelly? [10:57:53] as that is now supported on 9? [10:57:55] yeah, there is some already IIRC [10:58:07] but we should revisit this [10:58:21] another thing you mentioned is the possibility of keeping backups local to the DC [10:58:26] yeah [10:58:41] it might make sense by default to send all backups to the local SD [10:58:42] maybe actually migrate to a dual director(?) [10:58:49] dual SD, not dual director [10:58:58] I know it is dual sd [10:59:01] it's possible and probably rather easy [10:59:17] but I was thinking also later on dual director for different rules, but taht can be delayed [10:59:26] maybe some puppet coding to let a hiera var override per DC [10:59:38] not sure that multiple directors are that well supported [10:59:45] we will need to test [10:59:50] well, it wouldn't be dual directors [10:59:55] as much as a parallel installation [11:00:04] but again, I don't have a concrete proposal [11:00:45] we have to think because in some cases it may make sense to backup only to the local dc [11:00:51] libcrypto.so.1.0.2 [11:00:54] (like db backups) [11:01:03] that's what perf top says [11:01:07] interesting [11:01:15] but in other we want to have dc redundancy [11:01:28] I have really not thought much about that yet,m honestly [11:02:07] I think more important than all of this is to have parity features with the old setup [11:02:28] aka mount and test the archive, mount the dbatabase only pool [11:02:44] and get to 100% success rate [11:03:51] just to be clear, I know you are very busy, so I really needed your help for the migration [11:04:10] I can keep monitoring the status and will call you again when I got stuck [11:07:12] ok, good to know, thanks [11:07:24] I 'll be keeping an eye every now and then and point out stuff if you don't mind [11:21:29] @meeting [12:29:32] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose query for minor edit count - https://phabricator.wikimedia.org/T235572 (10WDoranWMF) @jcrespo thank you for the help! I'll sync with Bill when he's on. [14:06:50] jynus: telnet -4 backup1001.eqiad.wmnet 9103 [14:06:50] Trying 10.64.48.36... [14:06:54] router ACLs it seems [14:07:07] I 'll file a task for netops [14:08:25] I can conect [14:08:42] what is that port? [14:08:49] 9103 [14:08:56] it's the analytics VLAN filrewall [14:09:06] it has 10.64.0.179/32 (helium's IP) [14:09:07] from where are you trying? [14:09:14] an-master1002 [14:09:26] which is in the analytics VLAN [14:09:39] ah, I see [14:10:02] sorry, I forgot the context, was preciselly working on another conectivity isssue [14:14:31] https://phabricator.wikimedia.org/T237016 [14:15:08] so I restarted an unrelated host, and it got incomplete firewall rules [14:15:26] probably unrelated [14:15:30] ? [14:15:35] weird... [14:15:36] but writing it here [14:16:10] did you check firewall too? just to discard those are not related? [14:16:22] the host fw, I mean [14:17:52] yeah it was ok [14:17:56] ok [14:18:16] sorry, I was debugging that, and I was confused with your catch [14:18:18] :-D [14:19:56] phab1002 is indeed because of the many files btw [14:20:00] er, phab2001 [14:20:11] I 've straced it... it opens thousands of files per minute [14:20:45] the codfw one we can disable if it causes overload [14:20:57] only the primary is worth backup really [14:20:58] niah, it's just making things slower [14:21:02] ok [14:21:23] the rest are moving quite fast overall [14:21:34] then again most have very few data to backup [14:21:52] e.g. confXXXX hosts have something like 2MB ? [14:22:05] it's important we don't lose the data, but it's very little data [14:22:59] there were some hosts that got stuck the previous days [14:23:15] on lack of client(fd) reload [14:23:21] just FYI [14:23:25] on the tls error [14:23:46] allow me please to ignore you akosiaris while I work on this new firewall issue ATM [14:25:20] hahahaha [14:25:22] ok [14:31:36] it should be fixed now [15:12:27] I am back [15:12:35] and Fresh: 83 jobs looks good [15:12:41] better than helium [15:47:33] gere [15:47:36] here [15:47:39] :) [15:47:52] so jbond42 discovered one issue, akosiaris wit hteh way backup passwords are generated [15:48:07] https://phabricator.wikimedia.org/T221083 [15:48:46] fqdn_rand_string is stable? [15:49:07] yes [15:49:16] on purpose [15:49:20] fqdn_rand_string should be around for some time. however that would be predictable [15:49:26] yeah [15:49:33] not sure how predictable uniqueiod is [15:49:36] so what is the best strategy? [15:49:42] ah uniqueid dies? [15:49:43] dammit [15:49:52] if you have one, jbond42 [15:50:02] otherwise we will figure out something [15:50:10] akosiaris: we have recreated it internaly so its not a big deal to maintain support if this is a problem [15:50:33] well, but we shouldn't maintain cruft [15:50:34] no it's not, as long as we find a valid replacement [15:50:41] as to replacment we could use the seed_rand function fr5om stdlib with a seed pulled in from the private repo [15:50:43] all we want is something that is local to the host and not public [15:50:46] we can maybe program something? [15:50:50] and is stable across facter runs [15:50:57] yeah, deterministic but not guessable [15:51:12] or stable but not guessable [15:51:13] we don't even care too much if it's guessable. Just not fully public [15:51:28] well, if it is "the host name" that is deterministic [15:51:30] :-P [15:51:37] but a bad choice [15:51:39] https://github.com/puppetlabs/puppetlabs-stdlib/blob/master/REFERENCE.md#seeded_rand_string shold do the trick [15:51:40] :-) [15:51:40] yeah but it isn't from what I remember [15:52:15] I will add that task to the backup goal [15:52:16] I remember uniqueid not being with many bits of entropy but still needing some work to guess it [15:52:33] yes i think its 6 otr 8 chars from memory [15:53:03] actully https://github.com/puppetlabs/puppetlabs-stdlib/blob/master/REFERENCE.md#fqdn_rand_string with a seed from private. ill create a cr and add you both [15:53:18] that would be nice. thanks! [15:53:24] np [15:53:45] oh, he asks for help, but ends up doing all the work, where do I sight!? [15:53:52] *sign [15:53:54] :-D [15:53:55] lol [15:53:57] :D lol [15:54:29] thanks for the heads up, I wasn't even aware of that issue [15:55:12] its low prio as its not affecting anything so i had mostly forgoten about it tbh [15:57:52] so now I understand the original question [15:58:34] when that changes, easy automatic recovery will not be possible [15:58:49] but as long as we have the old director password, we should be ok [15:59:08] akosiaris: agree? ^ [15:59:22] actually, it is the puppet cert [15:59:26] nope [15:59:33] it's a password that is unique per host [15:59:44] and we can change it whenever we want [15:59:55] as long as puppet runs on both the host and the bacula server [15:59:57] it should be fine [15:59:59] ah, I am mixing certs and pws [16:00:05] thanks [16:00:10] so all we require is 1 hour that backups are not being run [16:00:21] sorry [18:07:22] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose query for minor edit count - https://phabricator.wikimedia.org/T235572 (10eprodromou) I've captured a task T237043 for doing these counts the "right way" by keeping them in the database and incrementing on edit....