[13:40:23] should we move here as it is easier to read? [13:40:31] +1 [13:40:50] yeah [13:41:08] one thing that I don't get is if the rack's switch is booting, or if it is still down [13:41:55] is it all hosts in that rack then? [13:41:59] elukey: cmjohnson1: said was booting [13:42:03] ack super [13:42:03] <_joe_> ok, who's incident commander? [13:42:35] marostegui: "dbproxy1016 haproxy failover" [13:42:41] <_joe_> akosiaris: are you looking at mobileapps by any chance? [13:42:41] ? [13:42:42] that can be production problematic [13:42:49] jynus: what? [13:43:02] if some misc are now in read only because of network isssues [13:43:04] _joe_: yes [13:43:08] I can take incident commander [13:43:14] super thanks a lot apergos [13:43:25] <_joe_> ok, I'm trying to understand what's going on with restbase in front [13:43:32] jynus: I don't understand what you mean, sorry [13:43:50] not sure if dbproxy1016 failed over [13:43:53] can we get a confirmation of the affected racks? [13:43:57] or just cannot be reached [13:44:12] if it is active, it will affect an actual pooled service [13:44:38] marostegui: CRITICAL check_failover servers up 1 down 1 [13:44:49] jynus: there are no active masters on row d, neither active proxies [13:45:15] so it may be failing over only the replica? [13:45:18] XioNoX: are you around? [13:45:19] or it is not in use? [13:45:28] jynus: I am not following you, let's move to -databases [13:46:11] i will text arzhel [13:46:15] and then have a look myself [13:46:17] mark: I texted him via vo [13:46:24] 10s ago [13:46:24] ok [13:46:45] the memcached errors are related to tkos in eqiad, mw is probably complaining about proxies not available (not a big deal for the moment) [13:46:46] can do sms too JIC [13:47:35] ok asw-d3 is indeed missing in the virtual chassis [13:47:44] the other switches are present [13:47:51] so are we seeing impact by other hosts? [13:47:52] <_joe_> we are losing log messages now [13:47:55] sms sent too [13:48:05] mark: hosts in D1 are also down [13:48:05] D1: Initial commit - https://phabricator.wikimedia.org/D1 [13:48:25] <_joe_> we are now losing dns it seems [13:48:29] <_joe_> this is a huge problem [13:48:32] D1 is directly connected to D3 [13:48:32] arzhel is on his way [13:48:32] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:48:45] any other racks? [13:48:52] hi [13:48:55] mark: d1, d3 and d4 so far are confirmed [13:48:55] so multi-rack? [13:49:01] ok [13:49:07] D4 is also connected to D3 [13:49:07] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:49:11] I have 183 hosts down in icinga... what on earth [13:49:11] D2 as well, any down in there? [13:49:11] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:49:20] mark: checking [13:49:31] <_joe_> bblack: what can we do about DNS? [13:49:32] if needed all row D hosts: https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=8 [13:49:38] sorry, 87 PEBKAC [13:49:55] _joe_: recursive you mean? it should be relatively-ok, as there's another one in row A [13:49:58] bblack: only dns1002 is in row D AFAICT [13:50:10] <_joe_> we just got a page [13:50:15] <_joe_> about auth dns [13:50:36] mark: d2 has some hosts down too [13:50:48] _joe_: yeah those alerts, may need tweaking as part of the after-action [13:50:55] apergos: got an incident doc? [13:51:08] the recursors run local authdns instances for themselves, but that's not our public authdns [13:51:12] I'm here [13:51:12] backup1001 and analytics1076 on D2 are down [13:51:17] <_joe_> the wikis are stable, current only impact is featured feeds are not working as expected [13:51:23] <_joe_> bblack: can you ack the alert then? [13:51:39] working on it [13:51:40] https://docs.google.com/document/d/1hsg3CQMXfBp7JqoPabmyUABsOzeH5P8eNw6tYiz4r2I/edit#heading=h.vg6rb6x2eccy akosiaris just the template [13:51:44] thanks [13:51:47] * volans acked [13:51:49] If you're making a comprehensive list, Horizon is broken too. Not a huge deal though. [13:51:55] thanks apergos - I will help filling it out [13:51:58] <_joe_> again, who's IC? [13:52:04] _joe_: apergos: [13:52:04] switch in D3 is up but is inactive [13:52:12] <_joe_> andrewbogott: no, I was just checking the sites are still visible [13:52:19] 'k [13:52:31] <_joe_> having lost ~ 8% of the production servers, it's the first thing to worry about [13:52:31] ok [13:52:47] yeah, I can confirm basic site functionality still up [13:52:55] let me check mobile [13:53:26] so the only user impact is featured feeds not working as expected, is that correct? [13:53:37] XioNoX: i am on the switches, if you are too, let's sync up [13:53:39] mobile seems unaffected too [13:53:52] <_joe_> apergos: and that's recovering rn [13:54:08] ok. I'll list it as impacted and when it recovers we can add that to the doc [13:54:13] mark: I'm ssh'ed on asw2-d, and about to jump on console for D3 [13:54:13] ps1-d3 and ps1-d4 are reported both down by icinga [13:54:13] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:54:14] apergos: even that is a bit far fetched. It seems ok as well, I am trying to gauge the extent of the issues there in the last 30m [13:54:20] XioNoX: ok [13:54:21] if featured is affected, mobile seems gracefully degraded (I don't notice errors) [13:54:29] i just jumped on D3 and did a request virtual-chassis reactive [13:54:31] didn't do anything [13:54:37] it's not clear why rerouting around D3 is not working [13:54:42] volans: so, the pdu work was supposed to happen on D3 and D4 [13:54:42] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:54:45] actionable: Create icinga hostgroup per rack row [13:54:54] XioNoX: lmk if you see anything interesting on the console? [13:54:56] I am still chasing down hosts to racks [13:55:04] bblack: I am receiving alerts from various varnishkafkas, data is not being delivered to kafka for some of them [13:55:11] akosiaris: so far I have confirmed D1, D2, D3 and D4 with hosts down [13:55:12] D1: Initial commit - https://phabricator.wikimedia.org/D1 [13:55:12] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:55:16] akosiaris: D8 have hosts up [13:55:16] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [13:55:22] apergos: ^ [13:55:23] mark: we need to reboot D3, innactive is probably because it booted on the backup partition with a different junos [13:55:29] XioNoX: could be, let's do it [13:55:47] noted akosiaris [13:56:06] XioNoX: it looks like it's running a different version indeed [13:56:09] that would explain it [13:56:13] will you reboot it? [13:56:16] marostegui: similar view from my side [13:56:23] akosiaris: ok, I will add that to the doc then [13:56:30] elukey: looking at e.g. webrequest_text topic [13:56:32] mark: yep one sec [13:56:34] some partitions still have leader 1006 [13:56:38] which is not reachable [13:56:47] apergos: where is the incident doc? [13:56:55] https://docs.google.com/document/d/1hsg3CQMXfBp7JqoPabmyUABsOzeH5P8eNw6tYiz4r2I/edit# [13:56:57] the controller broker is 1003, so i'm not totally sure why that is [13:56:59] andrewbogott: can you give me a hostname for horizon, to double check its rack? [13:57:01] elukey: cp1087, cp1088, cp1089, cp1090 are all in row D [13:57:02] ah, can you add it to the topic? [13:57:06] 1006 still says its in the ISR for those partitions [13:57:07] ottomata: ack, I was about to roll restart the vks [13:57:08] !log asw2-d-eqiad> request system reboot member 3 [13:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] marostegui: labweb1002.wikimedia.org [13:57:21] I'm going to depool those four from their respective pools, in case pybal-vs-DR issues keeping them afloat [13:57:23] elukey: you can try, but i think until kafka shows 1006 as not a leader, the result will be the same [13:57:30] fpc3 is rebooting [13:57:30] i dunno why 1006 would still be in the iSR... [13:57:32] bblack: I have also hosts from eqsin/uslfo/etc.. not pushing data to jumbo, I think vk is timing out to the brokers affected as ottomata was saying [13:57:40] ok [13:57:43] andrewbogott: yep, D4 [13:57:44] thanks [13:57:48] i was about to try a leader election...but 1006 is sitll in the ISR [13:57:51] I am not a channel operator [13:58:14] andrewbogott: so horizon is fully down? [13:58:26] so far it doesn't seem to be rebooting [13:58:41] XioNoX: will it use the correct partition on power cycle? [13:58:44] nevermind it's rebooting [13:58:46] marostegui: in theory it's an lvs pair but losing 1002 seems to have broken things entirely. That's a topic for research after the dust clears [13:58:47] <_joe_> can we get an idea of what's happening in the datacenter cmjohnson1 / jclark-ctr ? [13:58:48] a bunch of recoveries from icinga [13:58:52] ottomata: confirmed, all KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: 3 request(s) timed out etc.. [13:58:54] _joe_: in what respect? [13:58:59] mark: on clean power cycle it should use the primary one [13:59:02] standing by waiting on instruction [13:59:07] cmjohnson1: thank you [13:59:07] andrewbogott: gotcha, will add it to the doc, thank you [13:59:10] <_joe_> if they have power to the servers or not [13:59:11] wikifeeds has seen some pretty steep latency increases (from ~80ms to >2s) but is now behaving again [13:59:23] andrewbogott: what does horizon being down impact as far as the users? [13:59:23] _joe_: we already got that confirmed, only switch in D3 is impacted [13:59:24] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:59:24] <_joe_> akosiaris: now, or since a few minutes? [13:59:52] _joe_: couple of mins [13:59:54] elukey: i am able to consume fron parittion 0 with leader 1006 now [14:00:02] apergos: it's the control plane, so it prevents users from creating/deleting/etc VMs but during any given hour that's like 0 or 1 user who notices [14:00:03] is anyone in touch with WMCS? [14:00:04] _joe_: added in the doc, not actionable for now [14:00:08] (and right now I'm the 1) [14:00:11] noted, thank you [14:00:12] <_joe_> akosiaris: so we have some hidden dependencies [14:00:13] ah andrew [14:00:25] <_joe_> akosiaris: the latency was in both DCs, correct? [14:00:31] actual running services/VMs/etc are fine as best I can tell [14:00:34] ottomata: I see some graphs recovering from vk [14:00:36] aye [14:00:50] <_joe_> we're down to 33 hosts down [14:00:54] _joe_: yes [14:00:54] andrewbogott: added an action item for you guys to investigate the horizon stuff on the incident doc [14:00:59] thx [14:01:02] XioNoX: status of switch d3? [14:01:16] "NotPrsnt" in the VCF [14:02:04] mark: yep, console, shows the prompt now [14:02:10] icinga down hosts count down to 53 [14:02:22] but now additional problems (was 33 30s ago) [14:02:34] for mobileapps it seems to be business as usual, so not adding anything to that. Nothing noticeable in the graphs [14:02:36] So I can ping d1 hosts now [14:02:37] XioNoX: looks like it's on another version still? [14:02:48] yeah 15.1R4.6.. [14:03:16] something's wrong, icinga got recovered up to 33 hosts down and now is back to 76 [14:03:31] only racks d1,3 and 4 are affected, is that right? [14:03:32] issue with console is that the prompt is trying to log on the master [14:03:35] like it did recover just temporarily [14:03:47] trying to find the command to switch to the specific member, one sec [14:03:50] volans: I am on a D1 host and it seems stable [14:03:50] D1: Initial commit - https://phabricator.wikimedia.org/D1 [14:03:52] XioNoX: request session member 3 [14:03:58] thanks :) [14:04:37] apergos: can't confirm or deny, I saved the current icinga host down list for later analysis [14:04:50] volans: some of the D1 hosts are indeed going down again, I cannot ping some of them [14:04:57] so D1 is partly up [14:05:06] i think this VCF does not work well [14:05:43] yeah it's not cabled correctly, scheduled to fix that next week... [14:05:54] is this the old cabling? [14:06:21] yep, non VC compliant [14:06:26] ok, that would explain [14:06:40] I'm looking at options on how to fix it, 2min [14:07:08] it seemed that things worked better with D3 not present at all [14:07:08] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:07:22] rather than the current "inactive" [14:07:28] Active Partition: da0s1a [14:07:29] Backup Partition: da0s2a [14:07:29] Currently booted from: backup (da0s2a) [14:08:33] stupid question: can't we restore what was running before the accidental reboot? [14:08:50] volans: i don't understand what you mean [14:09:34] maybe I'm not understanding what's wrong with the switch, it's not rebooting from its primary partition? [14:09:40] so either the partition is corrupted, and we need to re-install it [14:09:43] it's not booting from the correct partition it seems [14:09:53] or it just booted from the backup just-in case [14:09:56] while it was rebooting, more hosts seemed connected [14:10:10] so we can switch it off, and have things work except rack D3, but we risk more problems when we recover that switch [14:10:12] <_joe_> yes, all but the ones in D 3 [14:10:15] the reboot didn't fix it, so it seems like it's the former [14:10:40] yep seems to do more harm than when shutdown [14:10:40] XioNoX: can we make it not attempt to join the VCF? [14:10:45] like, shut down the VC ports [14:10:54] mark: yes, we can [14:10:56] so it doesn't impact the other switches until we figure this out [14:11:00] we may have to move to serial [14:11:03] but that still seems better than now [14:11:09] yeah, serial + usb for the OS [14:11:13] yes [14:11:48] so maybe we should shut the ports on the -other- switches [14:12:05] 3 (FPC 3) Inactive PE3716030336 ex4300-48t 0 Linecard Y F 1 vcp-255/1/0 [14:12:05] 2 vcp-255/1/1 [14:12:05] 4 vcp-255/1/2 [14:12:08] yep, crafting the commands [14:14:51] !log request virtual-chassis vc-port delete pic-slot 1 member 1 port 1 [14:14:51] !log request virtual-chassis vc-port delete pic-slot 0 member 2 port 50 [14:14:51] !log request virtual-chassis vc-port delete pic-slot 1 member 4 port 0 [14:15:20] done [14:15:20] wrong channel for !log [14:15:39] ok [14:15:45] so now the other racks should be unimpacted, hopefully [14:15:51] except if the VCF cabling prevents that [14:16:01] * effie is around [14:16:15] good [14:16:21] icinga back to 35 hosts down and counting [14:16:27] so D3 will stay down while XioNoX debugs the issue with the firmwre [14:16:41] yep [14:16:44] once D3 is back up with the correct firmware, we can choose to reconnect it to the other switches [14:16:45] mark: just D3 or also D4? [14:16:49] <_joe_> let's wait for icinga to settle a bit and we can assess what needs to be done if this is long-term [14:16:52] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [14:16:53] marostegui: I hope D4 works, but don't know [14:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:58] mark: I will check [14:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] D4 is connected to D7 [14:17:08] list of D3 hosts: https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=8&rack_id=37 [14:17:12] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:17:12] mark: I can access D4 hosts [14:17:13] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [14:17:13] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [14:17:15] good [14:17:36] <_joe_> can you all write D with a space so that stashbot stops pestering us? :) [14:17:49] or d3 [14:18:00] <_joe_> or that yes [14:18:10] <_joe_> akosiaris: so the two restbase hosts are still down [14:18:14] <_joe_> but wikifeeds is back [14:18:27] <_joe_> I don't... know what this can be caused by [14:18:37] so d1, d2 and d4 hosts are up [14:18:52] <_joe_> 33 hosts down [14:19:11] <_joe_> effie: can you check none of the mw hosts that are down in eqiad are scap or mcrouter proxies? [14:19:27] I have a meeting, about on-call rotation no less, in 10 mins ;) [14:19:31] i was hoping to make that [14:19:39] _joe_: sure [14:19:53] the 3 VMs on ganeti1019 (that is in d3) are: logstash1031, releases1002, schema1004 [14:19:58] cmjohnson1, can you put apt1001:/srv/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz on a USB drive? [14:19:59] D42: Document NRPE checks - https://phabricator.wikimedia.org/D42 [14:20:26] andrewbogott: labweb1002's rack should be up [14:20:47] marostegui: yep, things are looking better [14:20:57] nice [14:20:59] cmjohnson1: or let me know if I should copy it somewhere easier for you, you can also download it from https://apt.wikimedia.org/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz [14:21:22] <_joe_> we're still receiving dns pages bblack [14:22:28] what is being worked on now exactly? trying to get the right firmware onto d3? [14:22:36] apergos: correct [14:22:41] noted, thanks [14:23:22] <_joe_> hnowlan: cassandra has lost two nodes in eqiad [14:23:37] jclark-ctr, see the above too ^ (can you put apt1001:/srv/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz on a USB drive, or let me know if I should copy it somewhere easier for you, you can also download it from https://apt.wikimedia.org/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz ) [14:23:48] i'm grabbing a usb [14:24:05] _joe_: looking [14:24:12] XioNoX, others: I will step away now to go into a meeting soon, if you need me, please text/call me [14:24:13] thanks! [14:24:21] sure, thanks for the help! [14:24:42] <_joe_> hnowlan: it's restbase1018 and 1025, let's see if we need to do anything about it now [14:25:12] is there anything *NOT* in D3 still impacted, flapping or down? [14:25:12] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:25:46] <_joe_> down, not AFAICT [14:25:52] <_joe_> impaced, quite a few stuff [14:25:53] /ignore stashbot :) [14:26:01] <_joe_> or start writing d3 :P [14:26:52] impacted as of falloff? or or connectivity issues (trying to figure out if the switch stack is stable [14:26:53] ) [14:27:29] <_joe_> akosiaris: tileratorui is down on all maps hosts in eqiad [14:28:10] _joe_: and? [14:28:41] XioNoX: I think all the hosts on d3 are down, I have tried a bunch and they are all down [14:28:47] <_joe_> is that fundamental to maps working? [14:28:51] no [14:28:52] * volans looking at mc1033, [14:28:57] releases1002 appears to be down in icinga and not in rack d 3 [14:29:01] no _joe_ it is just used for enqueuing tile rendering jobs [14:29:04] apergos: VM listed above [14:29:14] other than that icinga list looks ok [14:29:22] marostegui: yes all d3 is hard down until I fix the partitions by re-installing junos [14:29:23] <_joe_> the "ui" in the name made me dubious [14:29:26] apergos: together with logstash1031 and schema1004 [14:29:30] yeah I couldn't remember if it was ganeti or not [14:29:50] XioNoX: Sorry, I read anything IN d3 XD [14:30:03] marostegui: but wondering if anything else not in d3 [14:30:05] yeah :) [14:30:07] no pb :) [14:30:17] not according to icinga, nope [14:31:16] thanks! [14:32:02] when you start the junos re-install, lemme know so I can add it to the timeline (or add it yourself if you like) [14:32:19] <_joe_> can someone from traffic fix dns1002? [14:32:40] _joe_: we've been covering that in another channel, it's depooled from service [14:32:48] XioNoX do you need it as an ISO or just move the file to usb and plug in? [14:32:49] looks like my icinga-level ack didn't go through, stupid icinga login case [14:32:55] <_joe_> bblack: so why are we getting paged? [14:33:05] nsa-v4 page just now was a Resolve [14:33:13] cmjohnson1: just the file [14:33:23] or alert anyways, not page? [14:34:21] if there is something to check I'm free right now [14:34:48] apergos: will do, the upgrade and reboot will not do anything visible as it's virtually disconnected from everything around it, only when we will re-enable the VC-ports things will happen [14:35:05] <_joe_> we're still without logs AIUI [14:35:14] XioNoX done [14:35:17] thx [14:36:47] that's ok, it just lets us know how things progressed [14:36:54] copying the image on the switch [14:37:37] cmjohnson1: you can unplug it [14:38:11] <_joe_> godog or anyone else from observability: we still have a lot of delivery errors reported for syslog [14:38:23] <_joe_> is that just an alert that's going away soon, or still an issue? [14:38:54] _joe_: I'm looking, seems on its way to recovery [14:39:15] adding at least an AI on dealing with one centrallog down [14:39:23] Is there something I can help with/check? [14:39:32] apergos: installing the image on fpc3 [14:39:38] logstash-next seems unaffacted afaict [14:39:38] there seems to be a lot of recoveries going on [14:39:44] <_joe_> jynus: any backup-related issue? [14:40:10] unavailablitiy, but aside from possible cancell of ongoing jobs (unlikely at this hour) I don't expect any issues [14:40:11] weird, it went wayyy too fast [14:40:24] let me check the status if I can log in [14:40:46] noted, thanks! [14:40:57] mmm [14:41:14] actually there is something going on, metadata check is failing, maybe a db issue? [14:42:05] <_joe_> great I see syslog recovering [14:42:18] marostegui: db1080 (m1) should be up, right? [14:42:31] jynus: correct, it is up [14:42:38] dbproxy1017 is in the affected hosts list, but nothing else I can see [14:42:43] mm, maybe bacula ended up in a bad state [14:42:51] currently reasearching [14:42:57] apergos: yep, but it is no an active proxy [14:43:04] good! :-) [14:43:31] somestimes when communication to storage gets closed, daemoes are kept in a bad state [14:43:36] apergos: trying a reboot of fpc3 [14:43:45] crossing fingers :-) [14:44:21] Director authorization problem. If you are using TLS, there may have been a certificate validation error during the TLS handshake. [14:44:30] that is weird, given it is a local connection [14:46:49] [16:46] yeah, some ongoing full backups got errors, no big deal [14:46:54] I will check all daemons look healty, I needed a restart [14:47:11] current backups are running normally and compleating [14:47:23] checking storage [14:48:03] media looks healty, icinga checks should go back to green soon [14:48:18] daemon was in a bad state, probably due to network or dns [14:49:10] Installing jbundle-ex-4300-14.1X53-D42.3-jbundle-ex-4300, etc... [14:49:10] D42: Document NRPE checks - https://phabricator.wikimedia.org/D42 [14:49:15] looks good so far [14:49:48] Were actual backups impacted, jynus or just some sort of monitoring for them? [14:50:09] stash bot: shush! [14:50:28] what are the various bundles, XioNoX? [14:51:19] well, for the small which backupe started with no network/dns they failed [14:51:34] wmcs things seem to be working as expected now. I'll follow up on the gdoc later in the day but about to vanish into a meeting now. Thanks for the quick rescue everyone! [14:51:34] backup freshness was not affected as it as a buffer [14:51:50] apergos: it's different parts of the Junos image, dunno about each specifically [14:52:00] apergos: backups can fail for a bit with no real impact to availability, that is not unusual [14:52:03] ok. I have added a note in the top section in the google doc, if you want to elaborate/correct, please do (jynus) [14:52:12] apergos: I did already [14:52:15] ty [14:52:32] I am more worried about needing to restart the director [14:52:47] probably because it lost contact with the db that holds the metadata [14:53:21] ok, noted Xio NoX [14:53:38] the general idea is if a backup fails to run punctually, no issue, if it fails always it is an issue. This was more of the first kind [14:54:16] I may force a rerun of some if needed [14:54:41] ok! [14:55:01] monitoring also did break: https://grafana.wikimedia.org/d/413r2vbWk/bacula should be back soon [14:56:28] right [14:57:29] I will shut up but the general idea with backup errors is that, like http errors, they are only useful in context- what you want is to have backups, not matter how many times they failed before [14:57:46] :-) indeed! [14:57:58] so it re-installed the good version over the primary partition, and left the wrong version on the backup one... (and booted from the primary) [14:58:08] so now copying the primary partition to the backup one [14:58:10] lol of course [14:58:20] things are looking much better on icinga [14:58:38] as it is written, "Nobody cares about backups. What people want is RESTORES." [14:58:46] correct, rzl [14:59:04] I'm probably going to re-enable the VC ports in ~10min [14:59:05] the issue is we used to use a naive metric on grafana (successful backups) [14:59:30] which mean connectivity might flap the time everything "re-converge" [14:59:30] I want to change that to "% of time we have fresh backups available" [14:59:51] right. please holler when you're ready to go, Xio NoX :-) [15:00:06] now both partitions are running the correct one [15:00:23] jynus: as long as you have a clear definition of freshness, it makes sense to me [15:00:33] yep, there is one [15:00:49] it is the one we use for the icinga alert [15:00:54] and confirmed it's the same version as the remaining of the VC stack [15:01:06] XioNoX: so d3 hosts will come back? [15:01:28] marostegui: they should come back as soon as I issue the 3 commands to re-enable its VC ports [15:01:32] excellent [15:01:39] the icinga alert came backup, but not the prometheus scrapping, investigating [15:01:43] but might cause some flapping in the process [15:01:46] sure [15:02:03] * volans here if needed [15:02:25] I'm ready to go, and I think there are enough eyes around? [15:02:35] go go go [15:02:38] ok [15:02:53] XioNoX: go! [15:03:13] I see, too many failures, but systemd didn't complain [15:03:19] done [15:03:29] restarting prometheus-bacula-exporter.service [15:03:38] XioNoX: waiting for ping to reply on a d3 host [15:03:40] 3 (FPC 3) Prsnt [15:03:46] XioNoX: ping is back [15:03:50] ice! [15:03:51] nice! [15:03:52] I can ssh into mw1363 [15:03:59] recoveries coming up too [15:04:06] switch logs are quiet [15:04:49] icinga looks muuuuch better [15:04:50] only host down unacked are now ps1-d[3-4]-eqiad [15:04:59] icinga looking clean now, apart from the ps1 pdus [15:05:17] wow yes! (re icinga looks better) [15:06:20] do we declared the incident over? [15:06:25] *declare [15:06:42]