[09:43:52] <_joe_> jynus: I was noticing https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=37&fullscreen&orgId=1&from=1572340997147&to=1572342192229 [09:44:13] <_joe_> what's test-s1? [09:44:23] it is a test host, you can ignore [09:44:35] a production test host- eg. we recover backups there [09:44:42] <_joe_> but still, writes to s1 spiked since 9:30 [09:44:47] s1? [09:45:10] <_joe_> both s1 and test-s1 [09:45:14] <_joe_> according to that graph [09:45:34] I will check [09:45:54] <_joe_> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=7&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All&from=now-1h&to=now s1 [09:45:57] they seem to line up, which is strange [09:46:18] <_joe_> yeah that too, it seems like a monitoring artifact (that they line up) [09:46:34] so s1-test normally replicates from s1 [09:46:41] to test imports happen correctly [09:46:50] but they shouldn't show up on s1-core stats [09:47:01] maybe it is duplicated? [09:48:21] indeed it is [09:48:32] so bug on monitoring (my fault) [09:48:45] db1114 was on both s1 and test-s1 [09:48:58] I will fix it [09:49:22] I think this was not because it is a test, but because it used to be a production s1 host [09:49:30] and was adde the new section but not removed the old one [09:49:40] thanks for the heads up, _joe_ [09:50:53] _joe_: on next puppet run I think it should retractively fix it [09:53:46] <_joe_> great :) [09:54:08] although there is a point of reverting my change [09:54:36] as test-s1 is at the same time, part of the s1 replica set, but part of the test-s1 group [09:54:59] I need to define those groups better in a way that make sense for monitoring [09:55:28] it definitely shouldn't have shown up on core [09:55:55] maybe we can reate a new group "test" [10:17:07] 10DBA, 10Operations: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10jcrespo) I noticed db2062 isn't set on m1, is that on purpose because it is going to be decommissioned? Or becuase we didn't want it to alert? Or something else? CC @Marostegui [10:31:33] FYI, I'm upgrading PHP on dbmonitor1001, there'll be a blip of a few seconds on tendril [10:32:08] completed [11:02:26] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) I've created a copy of the bacula database on the bacula9 one, and then ran: ` sudo -u bacula ./update_mysql_tables -h m1-master.eqiad.wmnet bacul... [11:31:57] jynus: o/. should we go over the process one last time before we begin? [11:32:43] so there is changes [11:33:13] because bacula won't support a sd that is different version, nor it will work for buster [11:33:44] so in theory we will have to upgrade the current host to buster for it to work [11:33:53] more than in theory, really [11:34:36] but at this point I would like to see new backups succeding on backup1001 [11:35:52] akosiaris: let me know your thoughts [11:36:38] (switchover on a different database, make it work, revert if we cannot) [11:36:57] (then reevaluate next steps) [11:40:16] jynus: isn't backup1001 buster already? [11:40:22] * akosiaris can't connect for some reason, debugging [11:40:36] yes, but it won't be able to work with existing helium [11:40:51] or buster client will not work with helium [11:41:11] indeed, but that's the case right now as well, isn't it? [11:41:17] yes [11:41:40] but it invalids the step of "make it work with current director as a storage daemon" [11:42:06] only partially (the rest of the fleet still works) but ok, it's indeed a complication [11:42:26] maybe after some months [11:42:33] we can upgrade helim before decom [11:42:36] bast3004:~$ telnet -4 backup1001.eqiad.wmnet 22 [11:42:37] Trying 10.64.48.36... [11:42:38] ? [11:42:49] it is up, or it used to be up [11:42:50] what have we forgot there? [11:42:58] it's up, it's bast3004 that is the issue [11:43:04] I am connected [11:43:23] ah, yeah, that was warned yesterday I think [11:43:47] puppet hasn't run perhaps? [11:43:58] puppet is disabled there indeed [11:44:05] ah ok that explains it [11:44:12] as I was doing tests and bacula wipes the config [11:44:15] can you do a service ferm stop for a bit ? [11:44:24] so I can login without having to go through the mgmt? [11:44:29] and I 'll temp fix the ferm rules [11:44:38] ok [11:45:18] done [11:45:24] thanks [11:45:26] I can also enable puppet [11:45:44] it will just wipe out everything on config [11:46:17] we will anyway have to but gimme 10 to reassess the plan and the situation a bit [11:46:28] ok [11:47:33] ok, that part fixed, /me revisiting the plan [11:52:11] jynus: Test a full backup/restore cycle on a new remote host towards backup1001 to validate new setup [Jaime is doing this] [11:52:20] do I understand correctly this has happened already? [11:52:29] not to the extent I wanted [11:52:46] what do you mean? [11:53:01] it is difficult to manually configure bacula [11:53:33] and I just wiped the database [11:58:55] so? [11:59:29] oh, I was planning to test a single host backup/restore, if it worked that's fine, we will address issues with other hosts as we meet them [11:59:46] we WILL have issues [11:59:49] with pools [11:59:59] and db not matching current entries [12:00:08] helium being uncontactable [12:00:10] etc. [12:00:30] but at this point I just want new backups to work, as I don't think there is a good way to migrate things [12:00:50] I don't even think that migrating the db will sserver for anything [12:00:57] *work [12:01:19] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris) [12:01:35] what do you mean ? [12:02:04] well, the old director will get uncontactable, which I am unsure we will ever make it work at a later time [12:02:19] which may make the backups not working in gneeral because lots of unsuable entries [12:02:42] uncontactable because of the ferm rules ? [12:02:52] we can dig holes for those [12:02:54] no, because sd == dir [12:03:10] bacula segfaults when using different protocol version [12:03:10] he have an sd role, we can switch it to that [12:03:16] yes, of course [12:03:20] we will lose nothing [12:03:21] not a different, an incompatible [12:03:31] we can always revert [12:03:34] backup1001 will be able to talk to helium [12:03:43] I disagree [12:03:48] after my testing [12:03:56] and < v9 clients (aka buster) will be able to talk helium as well [12:04:11] ah you mean as an SD? [12:04:12] but hey, better if you are right :-D [12:04:17] exactly [12:04:19] yes that is right I am afraid [12:04:23] but that's actually fixable [12:04:30] I 've backported to jessie bacula 9 [12:04:30] with an upgrade? [12:04:40] we can just install the bacula-sd package on helium manually [12:04:41] ok, that changes everything [12:04:51] you didn't tell me that [12:04:53] wait out the 3 months and then forget about it since we will kill the box [12:05:13] well, finally we have the archive thing [12:05:19] I did [12:05:31] https://phabricator.wikimedia.org/T235838#5587009 [12:05:37] which we expect to be able to reconnect [12:05:55] I missed the "do already have debs" [12:05:58] sorry [12:06:10] no worries [12:06:23] ok, then I suggest we proceed with the deploy [12:06:28] ok I think we can go forward https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/544665/ [12:06:30] we should lose nothing [12:06:35] lemme double check it [12:06:35] just a few downtime [12:06:48] see how I on purpose created bacula9 [12:06:51] and upgrade only that [12:07:00] so the original one is untouched [12:07:16] we may miss backups in the last hour or so [12:07:50] I also have offline backups [12:07:58] yup, but we can't really do much better though, so it should be ok [12:08:01] +1ed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/544665/ [12:08:17] so the thing that is likely to fail is pool management [12:08:28] not sure if those will be created automatically [12:08:39] the new pools in the config? [12:08:44] IIRC they will [12:08:48] more like the old ones [12:09:11] because there is some weirdness of those configured by clients expected on the local sd [12:09:12] hmmm [12:09:17] aaaaah [12:09:20] we forgot 1 thing [12:09:24] what? [12:09:27] we should copy over the .bsr files [12:09:35] normally they will be rewritten on the first backup [12:09:40] I thought about those [12:09:46] I opted not to [12:09:58] as we have not migrated the files [12:10:11] and we can do it at a later time [12:10:33] true and I guess they will all have been recreated before the end of next week [12:10:38] exactly [12:10:49] later we can reevaluate [12:10:55] I 've only had to resort to them once in my life [12:10:57] I think the main lesson for the future [12:11:08] is to be able to have multiple directors [12:11:14] with different config [12:11:21] but that will require a lot of work [12:11:27] but will help on the next migration [12:11:41] quite some work indeed. But now with hiera being around it should be easier [12:11:55] a lot of the design issues in the backup/bacula modules stem from the lack of hiera when I wrote them [12:11:58] we may even need them for codfw vs eqiad different policies [12:12:09] yeah, that is more than understandable [12:12:22] I just wanted to see if you agreed on me working on that [12:12:30] in the future [12:12:38] overall? yes! [12:12:52] akosiaris: your puppet code was better than average! [12:12:52] anyway, let's break everything shall we? [12:12:58] let's do it [12:12:59] :-D [12:13:35] I am just a bit pesimistic, plus bacula Enterprise backup solution weirdness doesn't help [12:14:20] as long as we have a way back, we will be ok [12:15:15] deploying [12:16:30] should I run puppet on helium? [12:16:30] hmm we will have manually to delete the production00XX entries from bacula db [12:16:34] yes go ahead [12:16:50] cause we haven't copied over those files and we want to anyway keep them frozen [12:16:53] so better to create new ones [12:16:57] ah wait [12:17:04] ? [12:17:06] will helium move over to using bacula9 ? [12:17:19] it should- eventually [12:17:24] no scratch that, unimportant [12:17:37] yeah, it is not easy? :-D [12:17:57] I been thinking a bit about this, that is why I wanted to deploy and do proper production test [12:17:58] :-) [12:18:14] we should stop the director on helium however [12:18:17] just to make sure [12:18:19] yes [12:18:28] I will run puppet and stop it [12:18:38] ok [12:18:49] note it is on a different database [12:18:53] so that should be ok [12:19:06] yup, which is why I said we should stop it [12:19:18] can you downtime helium and bacul? [12:19:20] I would like that database to not be touched at all for a few days [12:19:24] and heze probably too [12:19:25] yeah, will do [12:19:40] ah! [12:19:44] one minor issue [12:19:47] ? [12:19:56] we WILL have to restart bacula-fd manually all over the fleet [12:20:03] oh [12:20:07] I didn't know that [12:20:13] for the key reload? [12:20:19] it doesn't support reloads so on config changes we don't restart it in order to not stop backups that are currently running [12:20:20] or the password, or whatever it is on the config [12:20:29] well, that should be ok [12:20:31] for pretty much all of the above [12:20:36] yeah, adding it to task [12:20:41] we can do it for some host first [12:20:47] in case we revery quickly [12:22:47] downtime until end of february in icinga helium/heze [12:22:52] bacula is not running on helium [12:23:01] except the sd and the fd [12:23:22] ok, cool [12:23:28] I will enable puppet and run it on backup1001? [12:23:40] ok? [12:23:47] go for it [12:24:12] I 'll delete the production00XX volumes on backup1001 [12:24:24] and let bacula recreate/label them automatically on the first backup [12:24:25] ok [12:25:12] not sure if it will have to run pupept on all fd for it to have good config? [12:25:37] it has scheduled jobs as expected [12:25:41] so that should be ok [12:26:02] echo 'list media pool=production' | sudo bconsole | awk '/production00/ {print "delete volume=" $4}' [12:26:02] delete volume=production0001 [12:26:29] jynus: no let's do a couple manually and let the rest coalesce slowly over the next 1 hour [12:26:31] evice File: "FileStorageArchive" (/srv/archive) is not open. [12:26:38] Available Space=514.7 GB [12:26:41] mmm [12:26:48] I may rename the archive files [12:26:57] I don't want them to be touched [12:27:08] Device File: "FileStorageDatabases" (/srv/databases) is not open [12:27:13] Available Space=38.32 TB [12:27:34] any fileset that is fast to do a full? [12:28:08] puppetmaster1001 is fast [12:28:19] but don't run it yet, still deleting volumes [12:28:31] ok [12:28:36] please note [12:28:38] on archive [12:28:44] for bacukup1001 [12:28:55] !log delete all production00 volumes on backup1001 [12:28:55] akosiaris: Not expecting to hear !log here [12:28:57] sigh [12:28:58] baculasd2, those are copies of archive [12:29:06] yea, only production one [12:29:12] not the others [12:29:27] yeah, not touching archive [12:29:58] and that is the weirdness that may have puppet bugs [12:30:05] *may show weirdness [12:30:15] with the old and the new config, we'll see [12:30:22] (nothing seen yet) [12:30:43] at bacula0033 [12:30:47] still some way to go [12:31:16] meanwhile I will run pupet on icinga to get the new checks [12:31:24] ok, do puppetmaster1001 as well [12:31:32] it will give us some visibility [12:33:49] on icinga, director removed from helium, enabled on b1001 [12:34:37] I am expecting the backup freshness check to give a "78 with no backups" [12:36:28] ok, all volumes removed [12:36:32] FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/bconsole': '/usr/bin/bconsole' [12:36:50] heh [12:36:54] moved to sbin [12:36:57] Interesting: which bconsole ~> /usr/sbin/bconsole [12:37:24] funnily, I can run now ./check_bacula.py --bconsole=/usr/sbin/bconsole [12:37:26] ah I see /usr/bin/bconsole -> ../sbin/bconsole [12:37:32] on helium that is [12:37:39] All failures: 26 (archiva1001, ...), No backups: 1 (webperf2002) jobs [12:37:50] ^that is the output right now on backup1001 [12:39:31] https://phabricator.wikimedia.org/P9493 [12:41:42] log is no longer on /var/lib/bacula/log [12:41:50] yeah, just noticed that too [12:41:59] it's on /var/log now [12:42:04] good [12:42:47] I don't see it there, however, systemctl + syslog maybe? [12:43:48] ah so our config still has it in /var/lib/bacula/log [12:43:57] it should show up there when we do our first test backup [12:44:14] but let's take an item to change it to /var/log/bacula afterwards [12:44:28] but it is not being created [12:44:42] IIRC it creates it when it wants to append to it [12:44:47] but I may be wrong though [12:44:48] ah, ok [12:44:57] not sure about it, we should just test and see [12:45:29] let me check the status from bconsole [12:45:44] puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-ssl [12:45:48] should we try this ^ ? [12:46:01] 2019-10-29 08:39:20: type: I, status: f, bytes: 0 [12:46:17] last try, which I think is on helium still [12:46:31] so let's do, as that will also confim the buster fix [12:46:39] do you schedule it or I do? [12:46:41] ls -l /var/lib/bacula/log [12:46:41] -rw-r--r-- 1 bacula bacula 3197 Oct 29 12:46 /var/lib/bacula/log [12:46:45] just did it [12:46:49] ok, thanks [12:46:55] 29-Oct 12:46 backup1001.eqiad.wmnet-fd JobId 158813: Wrote label to prelabeled Volume "production0056" on File device "FileStorageProduction" (/srv/production) [12:46:56] sigh [12:47:02] Backup OK [12:47:02] it started counting from 56 ... [12:47:03] :-( [12:47:12] 56? [12:47:14] my inner self is crying a bit right now [12:47:21] anyway, unimportant [12:47:23] I don't get it [12:47:44] it autocreates and labels the new volumes [12:47:52] and the last one was 0055 [12:47:54] well, it makes sense [12:47:59] it keeps somewhere something I guess [12:48:07] and may be even good to avoid issues [12:48:08] anyway, unimportant as I said [12:48:12] 100% [12:48:30] so, test restore? [12:48:31] for tapes it is not reused, that is expected [12:48:50] not sure if I want to restore to puppetmaster [12:49:00] restores are safe [12:49:02] but I guess it is ok, if on /tmp or whatever [12:49:04] they go under /var/tmp/ [12:49:13] as long as it doesn't run out of space [12:49:19] the operator has the final manual task of putting it them on the correct place [12:49:20] please just check taht [12:49:48] du -sh /var/lib/puppet/ssl/ [12:49:48] 164K /var/lib/puppet/ssl/ [12:49:48] root@puppetmaster1001:~# df -h /var [12:49:48] Filesystem Size Used Avail Use% Mounted on [12:49:49] /dev/md0 46G 9.7G 34G 23% / [12:49:54] I think we have 164k ;-) [12:49:58] cool [12:50:14] ah no, it's 1.1G but still, good enough [12:50:15] hey, when I suggest that is because I tried to restore a database on / ! [12:50:21] it's /var/lib/puppet/server/ssl [12:50:28] and then it broke the host :-D [12:50:41] some time ago [12:50:44] it happens! [12:50:56] Job queued. JobId=158814 [12:51:00] I had that too [12:51:05] my transfer.py utility actually has a check for taht [12:51:14] SD termination status: OK [12:51:14] Termination: Restore OK [12:51:20] perfect [12:51:28] double checking just to be sure [12:51:49] I will update the default bconsole check to /sbin [12:51:53] root@puppetmaster1001:/var/tmp/bacula-restores# find var/ |wc -l [12:51:53] 1502 [12:51:54] to monitor the backup status [12:51:56] nice [12:52:08] as one will fail [12:52:13] for one reason or another [12:52:51] rerun the script btw, it should say that == jobs_with_fresh_backups (1) == [12:53:01] doing [12:53:39] All failures: 92 (an-master1002, ...), No backups: 1 (webperf2002), Fresh: 1 jobs [12:53:46] \o/ [12:54:00] honestly, this is going better than I anticipated [12:54:02] ok lemme create a script to have backup1001 connect to all clients [12:54:09] jynus: same here, but knock on wood [12:54:12] connect? [12:54:30] status client=grafana1002.eqiad.wmnet-fd [12:54:30] Connecting to Client grafana1002.eqiad.wmnet-fd at grafana1002.eqiad.wmnet:9102 [12:54:30] grafana1002.eqiad.wmnet-fd Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian buster/sid [12:54:33] etc etc etc [12:54:42] ah, I see [12:54:50] to see the protocol problem is fixed [12:54:57] it should tell us of network, incompatibility, etc issues [12:55:50] you can steal my python "read all config" if you need it [12:56:07] list clients | bconsole | awk '{blah}' ;-) [12:56:11] ok [12:56:55] it connects fine to it's own SD and backup2001, so that's good [12:57:28] it connected to heze too ... weird [12:57:47] helium as well... anyway maybe ignore those for now [12:58:04] and I don't even need a script after all ... status all \o/ [12:58:51] he [12:58:58] ok, some things complained [12:59:02] 29-Oct 12:58 backup1001.eqiad.wmnet JobId 0: Error: openssl.c:68 Connect failure: ERR=error:1425F102:SSL routines:ssl_choose_client_version:unsupported protocol [12:59:02] 29-Oct 12:58 backup1001.eqiad.wmnet JobId 0: Fatal error: TLS negotiation failed with FD at "cobalt.wikimedia.org:9102". [12:59:05] * akosiaris looking [12:59:10] failures? copy the output to a paste [12:59:17] we we can both see it [12:59:21] doing so [12:59:45] it might be they are transient btw [12:59:56] https://phabricator.wikimedia.org/P9494 [13:00:21] Notice: /Stage[main]/Gerrit::Jetty/Systemd::Service[gerrit]/Service[gerrit]/ensure: ensure changed 'stopped' to 'running' [13:00:24] sigh on cobalt [13:00:32] that terriffied me for a second [13:00:42] but it happens on every run [13:01:13] lol [13:01:22] that may explain the session loss? [13:01:26] but other worry [13:01:35] *worry for another day [13:01:42] ah cobalt is role::spare now ? [13:01:50] ah, yes, it got migrated [13:02:04] ah, so maybe the ones that failed are old hosts [13:02:07] niah, it still is role(gerrit) [13:02:18] could be but this is a bit weird [13:02:22] wasn't a gerrit1001? [13:02:37] yeah but cobalt still has the role as well [13:02:48] and the config still gets populated from what it seems [13:02:55] so it should work [13:03:06] let me finish my check deploy and I will help with those [13:03:11] ok [13:03:23] the check will help understand the status [13:03:34] ERR=error:1425F102:SSL routines:ssl_choose_client_version:unsupported protocol [13:03:44] why do I have the feeling this is about TLS version? [13:03:47] jessie? [13:03:51] yes :-( [13:03:59] I fear all the failed hosts are jessie [13:04:00] don't tell me choose jessie or buster? [13:04:02]