[13:40:23] should we move here as it is easier to read? [13:40:31] +1 [13:40:50] yeah [13:41:08] one thing that I don't get is if the rack's switch is booting, or if it is still down [13:41:55] is it all hosts in that rack then? [13:41:59] elukey: cmjohnson1: said was booting [13:42:03] ack super [13:42:03] <_joe_> ok, who's incident commander? [13:42:35] marostegui: "dbproxy1016 haproxy failover" [13:42:41] <_joe_> akosiaris: are you looking at mobileapps by any chance? [13:42:41] ? [13:42:42] that can be production problematic [13:42:49] jynus: what? [13:43:02] if some misc are now in read only because of network isssues [13:43:04] _joe_: yes [13:43:08] I can take incident commander [13:43:14] super thanks a lot apergos [13:43:25] <_joe_> ok, I'm trying to understand what's going on with restbase in front [13:43:32] jynus: I don't understand what you mean, sorry [13:43:50] not sure if dbproxy1016 failed over [13:43:53] can we get a confirmation of the affected racks? [13:43:57] or just cannot be reached [13:44:12] if it is active, it will affect an actual pooled service [13:44:38] marostegui: CRITICAL check_failover servers up 1 down 1 [13:44:49] jynus: there are no active masters on row d, neither active proxies [13:45:15] so it may be failing over only the replica? [13:45:18] XioNoX: are you around? [13:45:19] or it is not in use? [13:45:28] jynus: I am not following you, let's move to -databases [13:46:11] i will text arzhel [13:46:15] and then have a look myself [13:46:17] mark: I texted him via vo [13:46:24] 10s ago [13:46:24] ok [13:46:45] the memcached errors are related to tkos in eqiad, mw is probably complaining about proxies not available (not a big deal for the moment) [13:46:46] can do sms too JIC [13:47:35] ok asw-d3 is indeed missing in the virtual chassis [13:47:44] the other switches are present [13:47:51] so are we seeing impact by other hosts? [13:47:52] <_joe_> we are losing log messages now [13:47:55] sms sent too [13:48:05] mark: hosts in D1 are also down [13:48:05] D1: Initial commit - https://phabricator.wikimedia.org/D1 [13:48:25] <_joe_> we are now losing dns it seems [13:48:29] <_joe_> this is a huge problem [13:48:32] D1 is directly connected to D3 [13:48:32] arzhel is on his way [13:48:32] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:48:45] any other racks? [13:48:52] hi [13:48:55] mark: d1, d3 and d4 so far are confirmed [13:48:55] so multi-rack? [13:49:01] ok [13:49:07] D4 is also connected to D3 [13:49:07] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:49:11] I have 183 hosts down in icinga... what on earth [13:49:11] D2 as well, any down in there? [13:49:11] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:49:20] mark: checking [13:49:31] <_joe_> bblack: what can we do about DNS? [13:49:32] if needed all row D hosts: https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=8 [13:49:38] sorry, 87 PEBKAC [13:49:55] _joe_: recursive you mean? it should be relatively-ok, as there's another one in row A [13:49:58] bblack: only dns1002 is in row D AFAICT [13:50:10] <_joe_> we just got a page [13:50:15] <_joe_> about auth dns [13:50:36] mark: d2 has some hosts down too [13:50:48] _joe_: yeah those alerts, may need tweaking as part of the after-action [13:50:55] apergos: got an incident doc? [13:51:08] the recursors run local authdns instances for themselves, but that's not our public authdns [13:51:12] I'm here [13:51:12] backup1001 and analytics1076 on D2 are down [13:51:17] <_joe_> the wikis are stable, current only impact is featured feeds are not working as expected [13:51:23] <_joe_> bblack: can you ack the alert then? [13:51:39] working on it [13:51:40] https://docs.google.com/document/d/1hsg3CQMXfBp7JqoPabmyUABsOzeH5P8eNw6tYiz4r2I/edit#heading=h.vg6rb6x2eccy akosiaris just the template [13:51:44] thanks [13:51:47] * volans acked [13:51:49] If you're making a comprehensive list, Horizon is broken too. Not a huge deal though. [13:51:55] thanks apergos - I will help filling it out [13:51:58] <_joe_> again, who's IC? [13:52:04] _joe_: apergos: [13:52:04] switch in D3 is up but is inactive [13:52:12] <_joe_> andrewbogott: no, I was just checking the sites are still visible [13:52:19] 'k [13:52:31] <_joe_> having lost ~ 8% of the production servers, it's the first thing to worry about [13:52:31] ok [13:52:47] yeah, I can confirm basic site functionality still up [13:52:55] let me check mobile [13:53:26] so the only user impact is featured feeds not working as expected, is that correct? [13:53:37] XioNoX: i am on the switches, if you are too, let's sync up [13:53:39] mobile seems unaffected too [13:53:52] <_joe_> apergos: and that's recovering rn [13:54:08] ok. I'll list it as impacted and when it recovers we can add that to the doc [13:54:13] mark: I'm ssh'ed on asw2-d, and about to jump on console for D3 [13:54:13] ps1-d3 and ps1-d4 are reported both down by icinga [13:54:13] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:54:14] apergos: even that is a bit far fetched. It seems ok as well, I am trying to gauge the extent of the issues there in the last 30m [13:54:20] XioNoX: ok [13:54:21] if featured is affected, mobile seems gracefully degraded (I don't notice errors) [13:54:29] i just jumped on D3 and did a request virtual-chassis reactive [13:54:31] didn't do anything [13:54:37] it's not clear why rerouting around D3 is not working [13:54:42] volans: so, the pdu work was supposed to happen on D3 and D4 [13:54:42] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:54:45] actionable: Create icinga hostgroup per rack row [13:54:54] XioNoX: lmk if you see anything interesting on the console? [13:54:56] I am still chasing down hosts to racks [13:55:04] bblack: I am receiving alerts from various varnishkafkas, data is not being delivered to kafka for some of them [13:55:11] akosiaris: so far I have confirmed D1, D2, D3 and D4 with hosts down [13:55:12] D1: Initial commit - https://phabricator.wikimedia.org/D1 [13:55:12] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:55:16] akosiaris: D8 have hosts up [13:55:16] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [13:55:22] apergos: ^ [13:55:23] mark: we need to reboot D3, innactive is probably because it booted on the backup partition with a different junos [13:55:29] XioNoX: could be, let's do it [13:55:47] noted akosiaris [13:56:06] XioNoX: it looks like it's running a different version indeed [13:56:09] that would explain it [13:56:13] will you reboot it? [13:56:16] marostegui: similar view from my side [13:56:23] akosiaris: ok, I will add that to the doc then [13:56:30] elukey: looking at e.g. webrequest_text topic [13:56:32] mark: yep one sec [13:56:34] some partitions still have leader 1006 [13:56:38] which is not reachable [13:56:47] apergos: where is the incident doc? [13:56:55] https://docs.google.com/document/d/1hsg3CQMXfBp7JqoPabmyUABsOzeH5P8eNw6tYiz4r2I/edit# [13:56:57] the controller broker is 1003, so i'm not totally sure why that is [13:56:59] andrewbogott: can you give me a hostname for horizon, to double check its rack? [13:57:01] elukey: cp1087, cp1088, cp1089, cp1090 are all in row D [13:57:02] ah, can you add it to the topic? [13:57:06] 1006 still says its in the ISR for those partitions [13:57:07] ottomata: ack, I was about to roll restart the vks [13:57:08] !log asw2-d-eqiad> request system reboot member 3 [13:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] marostegui: labweb1002.wikimedia.org [13:57:21] I'm going to depool those four from their respective pools, in case pybal-vs-DR issues keeping them afloat [13:57:23] elukey: you can try, but i think until kafka shows 1006 as not a leader, the result will be the same [13:57:30] fpc3 is rebooting [13:57:30] i dunno why 1006 would still be in the iSR... [13:57:32] bblack: I have also hosts from eqsin/uslfo/etc.. not pushing data to jumbo, I think vk is timing out to the brokers affected as ottomata was saying [13:57:40] ok [13:57:43] andrewbogott: yep, D4 [13:57:44] thanks [13:57:48] i was about to try a leader election...but 1006 is sitll in the ISR [13:57:51] I am not a channel operator [13:58:14] andrewbogott: so horizon is fully down? [13:58:26] so far it doesn't seem to be rebooting [13:58:41] XioNoX: will it use the correct partition on power cycle? [13:58:44] nevermind it's rebooting [13:58:46] marostegui: in theory it's an lvs pair but losing 1002 seems to have broken things entirely. That's a topic for research after the dust clears [13:58:47] <_joe_> can we get an idea of what's happening in the datacenter cmjohnson1 / jclark-ctr ? [13:58:48] a bunch of recoveries from icinga [13:58:52] ottomata: confirmed, all KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: 3 request(s) timed out etc.. [13:58:54] _joe_: in what respect? [13:58:59] mark: on clean power cycle it should use the primary one [13:59:02] standing by waiting on instruction [13:59:07] cmjohnson1: thank you [13:59:07] andrewbogott: gotcha, will add it to the doc, thank you [13:59:10] <_joe_> if they have power to the servers or not [13:59:11] wikifeeds has seen some pretty steep latency increases (from ~80ms to >2s) but is now behaving again [13:59:23] andrewbogott: what does horizon being down impact as far as the users? [13:59:23] _joe_: we already got that confirmed, only switch in D3 is impacted [13:59:24] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:59:24] <_joe_> akosiaris: now, or since a few minutes? [13:59:52] _joe_: couple of mins [13:59:54] elukey: i am able to consume fron parittion 0 with leader 1006 now [14:00:02] apergos: it's the control plane, so it prevents users from creating/deleting/etc VMs but during any given hour that's like 0 or 1 user who notices [14:00:03] is anyone in touch with WMCS? [14:00:04] _joe_: added in the doc, not actionable for now [14:00:08] (and right now I'm the 1) [14:00:11] noted, thank you [14:00:12] <_joe_> akosiaris: so we have some hidden dependencies [14:00:13] ah andrew [14:00:25] <_joe_> akosiaris: the latency was in both DCs, correct? [14:00:31] actual running services/VMs/etc are fine as best I can tell [14:00:34] ottomata: I see some graphs recovering from vk [14:00:36] aye [14:00:50] <_joe_> we're down to 33 hosts down [14:00:54] _joe_: yes [14:00:54] andrewbogott: added an action item for you guys to investigate the horizon stuff on the incident doc [14:00:59] thx [14:01:02] XioNoX: status of switch d3? [14:01:16] "NotPrsnt" in the VCF [14:02:04] mark: yep, console, shows the prompt now [14:02:10] icinga down hosts count down to 53 [14:02:22] but now additional problems (was 33 30s ago) [14:02:34] for mobileapps it seems to be business as usual, so not adding anything to that. Nothing noticeable in the graphs [14:02:36] So I can ping d1 hosts now [14:02:37] XioNoX: looks like it's on another version still? [14:02:48] yeah 15.1R4.6.. [14:03:16] something's wrong, icinga got recovered up to 33 hosts down and now is back to 76 [14:03:31] only racks d1,3 and 4 are affected, is that right? [14:03:32] issue with console is that the prompt is trying to log on the master [14:03:35] like it did recover just temporarily [14:03:47] trying to find the command to switch to the specific member, one sec [14:03:50] volans: I am on a D1 host and it seems stable [14:03:50] D1: Initial commit - https://phabricator.wikimedia.org/D1 [14:03:52] XioNoX: request session member 3 [14:03:58] thanks :) [14:04:37] apergos: can't confirm or deny, I saved the current icinga host down list for later analysis [14:04:50] volans: some of the D1 hosts are indeed going down again, I cannot ping some of them [14:04:57] so D1 is partly up [14:05:06] i think this VCF does not work well [14:05:43] yeah it's not cabled correctly, scheduled to fix that next week... [14:05:54] is this the old cabling? [14:06:21] yep, non VC compliant [14:06:26] ok, that would explain [14:06:40] I'm looking at options on how to fix it, 2min [14:07:08] it seemed that things worked better with D3 not present at all [14:07:08] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:07:22] rather than the current "inactive" [14:07:28] Active Partition: da0s1a [14:07:29] Backup Partition: da0s2a [14:07:29] Currently booted from: backup (da0s2a) [14:08:33] stupid question: can't we restore what was running before the accidental reboot? [14:08:50] volans: i don't understand what you mean [14:09:34] maybe I'm not understanding what's wrong with the switch, it's not rebooting from its primary partition? [14:09:40] so either the partition is corrupted, and we need to re-install it [14:09:43] it's not booting from the correct partition it seems [14:09:53] or it just booted from the backup just-in case [14:09:56] while it was rebooting, more hosts seemed connected [14:10:10] so we can switch it off, and have things work except rack D3, but we risk more problems when we recover that switch [14:10:12] <_joe_> yes, all but the ones in D 3 [14:10:15] the reboot didn't fix it, so it seems like it's the former [14:10:40] yep seems to do more harm than when shutdown [14:10:40] XioNoX: can we make it not attempt to join the VCF? [14:10:45] like, shut down the VC ports [14:10:54] mark: yes, we can [14:10:56] so it doesn't impact the other switches until we figure this out [14:11:00] we may have to move to serial [14:11:03] but that still seems better than now [14:11:09] yeah, serial + usb for the OS [14:11:13] yes [14:11:48] so maybe we should shut the ports on the -other- switches [14:12:05] 3 (FPC 3) Inactive PE3716030336 ex4300-48t 0 Linecard Y F 1 vcp-255/1/0 [14:12:05] 2 vcp-255/1/1 [14:12:05] 4 vcp-255/1/2 [14:12:08] yep, crafting the commands [14:14:51] !log request virtual-chassis vc-port delete pic-slot 1 member 1 port 1 [14:14:51] !log request virtual-chassis vc-port delete pic-slot 0 member 2 port 50 [14:14:51] !log request virtual-chassis vc-port delete pic-slot 1 member 4 port 0 [14:15:20] done [14:15:20] wrong channel for !log [14:15:39] ok [14:15:45] so now the other racks should be unimpacted, hopefully [14:15:51] except if the VCF cabling prevents that [14:16:01] * effie is around [14:16:15] good [14:16:21] icinga back to 35 hosts down and counting [14:16:27] so D3 will stay down while XioNoX debugs the issue with the firmwre [14:16:41] yep [14:16:44] once D3 is back up with the correct firmware, we can choose to reconnect it to the other switches [14:16:45] mark: just D3 or also D4? [14:16:49] <_joe_> let's wait for icinga to settle a bit and we can assess what needs to be done if this is long-term [14:16:52] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [14:16:53] marostegui: I hope D4 works, but don't know [14:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:58] mark: I will check [14:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] D4 is connected to D7 [14:17:08] list of D3 hosts: https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=8&rack_id=37 [14:17:12] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:17:12] mark: I can access D4 hosts [14:17:13] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [14:17:13] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [14:17:15] good [14:17:36] <_joe_> can you all write D with a space so that stashbot stops pestering us? :) [14:17:49] or d3 [14:18:00] <_joe_> or that yes [14:18:10] <_joe_> akosiaris: so the two restbase hosts are still down [14:18:14] <_joe_> but wikifeeds is back [14:18:27] <_joe_> I don't... know what this can be caused by [14:18:37] so d1, d2 and d4 hosts are up [14:18:52] <_joe_> 33 hosts down [14:19:11] <_joe_> effie: can you check none of the mw hosts that are down in eqiad are scap or mcrouter proxies? [14:19:27] I have a meeting, about on-call rotation no less, in 10 mins ;) [14:19:31] i was hoping to make that [14:19:39] _joe_: sure [14:19:53] the 3 VMs on ganeti1019 (that is in d3) are: logstash1031, releases1002, schema1004 [14:19:58] cmjohnson1, can you put apt1001:/srv/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz on a USB drive? [14:19:59] D42: Document NRPE checks - https://phabricator.wikimedia.org/D42 [14:20:26] andrewbogott: labweb1002's rack should be up [14:20:47] marostegui: yep, things are looking better [14:20:57] nice [14:20:59] cmjohnson1: or let me know if I should copy it somewhere easier for you, you can also download it from https://apt.wikimedia.org/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz [14:21:22] <_joe_> we're still receiving dns pages bblack [14:22:28] what is being worked on now exactly? trying to get the right firmware onto d3? [14:22:36] apergos: correct [14:22:41] noted, thanks [14:23:22] <_joe_> hnowlan: cassandra has lost two nodes in eqiad [14:23:37] jclark-ctr, see the above too ^ (can you put apt1001:/srv/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz on a USB drive, or let me know if I should copy it somewhere easier for you, you can also download it from https://apt.wikimedia.org/junos/jinstall-ex-4300-14.1X53-D42.3-domestic-signed.tgz ) [14:23:48] i'm grabbing a usb [14:24:05] _joe_: looking [14:24:12] XioNoX, others: I will step away now to go into a meeting soon, if you need me, please text/call me [14:24:13] thanks! [14:24:21] sure, thanks for the help! [14:24:42] <_joe_> hnowlan: it's restbase1018 and 1025, let's see if we need to do anything about it now [14:25:12] is there anything *NOT* in D3 still impacted, flapping or down? [14:25:12] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:25:46] <_joe_> down, not AFAICT [14:25:52] <_joe_> impaced, quite a few stuff [14:25:53] /ignore stashbot :) [14:26:01] <_joe_> or start writing d3 :P [14:26:52] impacted as of falloff? or or connectivity issues (trying to figure out if the switch stack is stable [14:26:53] ) [14:27:29] <_joe_> akosiaris: tileratorui is down on all maps hosts in eqiad [14:28:10] _joe_: and? [14:28:41] XioNoX: I think all the hosts on d3 are down, I have tried a bunch and they are all down [14:28:47] <_joe_> is that fundamental to maps working? [14:28:51] no [14:28:52] * volans looking at mc1033, [14:28:57] releases1002 appears to be down in icinga and not in rack d 3 [14:29:01] no _joe_ it is just used for enqueuing tile rendering jobs [14:29:04] apergos: VM listed above [14:29:14] other than that icinga list looks ok [14:29:22] marostegui: yes all d3 is hard down until I fix the partitions by re-installing junos [14:29:23] <_joe_> the "ui" in the name made me dubious [14:29:26] apergos: together with logstash1031 and schema1004 [14:29:30] yeah I couldn't remember if it was ganeti or not [14:29:50] XioNoX: Sorry, I read anything IN d3 XD [14:30:03] marostegui: but wondering if anything else not in d3 [14:30:05] yeah :) [14:30:07] no pb :) [14:30:17] not according to icinga, nope [14:31:16] thanks! [14:32:02] when you start the junos re-install, lemme know so I can add it to the timeline (or add it yourself if you like) [14:32:19] <_joe_> can someone from traffic fix dns1002? [14:32:40] _joe_: we've been covering that in another channel, it's depooled from service [14:32:48] XioNoX do you need it as an ISO or just move the file to usb and plug in? [14:32:49] looks like my icinga-level ack didn't go through, stupid icinga login case [14:32:55] <_joe_> bblack: so why are we getting paged? [14:33:05] nsa-v4 page just now was a Resolve [14:33:13] cmjohnson1: just the file [14:33:23] or alert anyways, not page? [14:34:21] if there is something to check I'm free right now [14:34:48] apergos: will do, the upgrade and reboot will not do anything visible as it's virtually disconnected from everything around it, only when we will re-enable the VC-ports things will happen [14:35:05] <_joe_> we're still without logs AIUI [14:35:14] XioNoX done [14:35:17] thx [14:36:47] that's ok, it just lets us know how things progressed [14:36:54] copying the image on the switch [14:37:37] cmjohnson1: you can unplug it [14:38:11] <_joe_> godog or anyone else from observability: we still have a lot of delivery errors reported for syslog [14:38:23] <_joe_> is that just an alert that's going away soon, or still an issue? [14:38:54] _joe_: I'm looking, seems on its way to recovery [14:39:15] adding at least an AI on dealing with one centrallog down [14:39:23] Is there something I can help with/check? [14:39:32] apergos: installing the image on fpc3 [14:39:38] logstash-next seems unaffacted afaict [14:39:38] there seems to be a lot of recoveries going on [14:39:44] <_joe_> jynus: any backup-related issue? [14:40:10] unavailablitiy, but aside from possible cancell of ongoing jobs (unlikely at this hour) I don't expect any issues [14:40:11] weird, it went wayyy too fast [14:40:24] let me check the status if I can log in [14:40:46] noted, thanks! [14:40:57] mmm [14:41:14] actually there is something going on, metadata check is failing, maybe a db issue? [14:42:05] <_joe_> great I see syslog recovering [14:42:18] marostegui: db1080 (m1) should be up, right? [14:42:31] jynus: correct, it is up [14:42:38] dbproxy1017 is in the affected hosts list, but nothing else I can see [14:42:43] mm, maybe bacula ended up in a bad state [14:42:51] currently reasearching [14:42:57] apergos: yep, but it is no an active proxy [14:43:04] good! :-) [14:43:31] somestimes when communication to storage gets closed, daemoes are kept in a bad state [14:43:36] apergos: trying a reboot of fpc3 [14:43:45] crossing fingers :-) [14:44:21] Director authorization problem. If you are using TLS, there may have been a certificate validation error during the TLS handshake. [14:44:30] that is weird, given it is a local connection [14:46:49] [16:46] yeah, some ongoing full backups got errors, no big deal [14:46:54] I will check all daemons look healty, I needed a restart [14:47:11] current backups are running normally and compleating [14:47:23] checking storage [14:48:03] media looks healty, icinga checks should go back to green soon [14:48:18] daemon was in a bad state, probably due to network or dns [14:49:10] Installing jbundle-ex-4300-14.1X53-D42.3-jbundle-ex-4300, etc... [14:49:10] D42: Document NRPE checks - https://phabricator.wikimedia.org/D42 [14:49:15] looks good so far [14:49:48] Were actual backups impacted, jynus or just some sort of monitoring for them? [14:50:09] stash bot: shush! [14:50:28] what are the various bundles, XioNoX? [14:51:19] well, for the small which backupe started with no network/dns they failed [14:51:34] wmcs things seem to be working as expected now. I'll follow up on the gdoc later in the day but about to vanish into a meeting now. Thanks for the quick rescue everyone! [14:51:34] backup freshness was not affected as it as a buffer [14:51:50] apergos: it's different parts of the Junos image, dunno about each specifically [14:52:00] apergos: backups can fail for a bit with no real impact to availability, that is not unusual [14:52:03] ok. I have added a note in the top section in the google doc, if you want to elaborate/correct, please do (jynus) [14:52:12] apergos: I did already [14:52:15] ty [14:52:32] I am more worried about needing to restart the director [14:52:47] probably because it lost contact with the db that holds the metadata [14:53:21] ok, noted Xio NoX [14:53:38] the general idea is if a backup fails to run punctually, no issue, if it fails always it is an issue. This was more of the first kind [14:54:16] I may force a rerun of some if needed [14:54:41] ok! [14:55:01] monitoring also did break: https://grafana.wikimedia.org/d/413r2vbWk/bacula should be back soon [14:56:28] right [14:57:29] I will shut up but the general idea with backup errors is that, like http errors, they are only useful in context- what you want is to have backups, not matter how many times they failed before [14:57:46] :-) indeed! [14:57:58] so it re-installed the good version over the primary partition, and left the wrong version on the backup one... (and booted from the primary) [14:58:08] so now copying the primary partition to the backup one [14:58:10] lol of course [14:58:20] things are looking much better on icinga [14:58:38] as it is written, "Nobody cares about backups. What people want is RESTORES." [14:58:46] correct, rzl [14:59:04] I'm probably going to re-enable the VC ports in ~10min [14:59:05] the issue is we used to use a naive metric on grafana (successful backups) [14:59:30] which mean connectivity might flap the time everything "re-converge" [14:59:30] I want to change that to "% of time we have fresh backups available" [14:59:51] right. please holler when you're ready to go, Xio NoX :-) [15:00:06] now both partitions are running the correct one [15:00:23] jynus: as long as you have a clear definition of freshness, it makes sense to me [15:00:33] yep, there is one [15:00:49] it is the one we use for the icinga alert [15:00:54] and confirmed it's the same version as the remaining of the VC stack [15:01:06] XioNoX: so d3 hosts will come back? [15:01:28] marostegui: they should come back as soon as I issue the 3 commands to re-enable its VC ports [15:01:32] excellent [15:01:39] the icinga alert came backup, but not the prometheus scrapping, investigating [15:01:43] but might cause some flapping in the process [15:01:46] sure [15:02:03] * volans here if needed [15:02:25] I'm ready to go, and I think there are enough eyes around? [15:02:35] go go go [15:02:38] ok [15:02:53] XioNoX: go! [15:03:13] I see, too many failures, but systemd didn't complain [15:03:19] done [15:03:29] restarting prometheus-bacula-exporter.service [15:03:38] XioNoX: waiting for ping to reply on a d3 host [15:03:40] 3 (FPC 3) Prsnt [15:03:46] XioNoX: ping is back [15:03:50] ice! [15:03:51] nice! [15:03:52] I can ssh into mw1363 [15:03:59] recoveries coming up too [15:04:06] switch logs are quiet [15:04:49] icinga looks muuuuch better [15:04:50] only host down unacked are now ps1-d[3-4]-eqiad [15:04:59] icinga looking clean now, apart from the ps1 pdus [15:05:17] wow yes! (re icinga looks better) [15:06:20] do we declared the incident over? [15:06:25] *declare [15:06:42] apergos: can give it a few more minutes but I'd say yes so far [15:06:58] im going to start bird on dns1002 again [15:07:18] I'm happy to camp out here for a while longer [15:07:33] we have a bunch of action items that need people to sign up for them via google comments [15:07:46] prometheus for bacula is back https://grafana.wikimedia.org/d/413r2vbWk/bacula [15:07:50] it also needed a restart [15:08:56] So the PDU maintenance on D3 and D4 isn't done, right? will that be postponed? cmjohnson1 or jclark-ctr? [15:08:57] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [15:08:57] D3: test - ignore - https://phabricator.wikimedia.org/D3 [15:09:09] akosiaris: are you down for "Create icinga hostgroup per rack row "? any other action items you want to claim? [15:09:15] pdu swap is completed [15:09:22] physically [15:09:23] jclark-ctr: for d3 and db4? [15:09:28] both [15:09:29] marostegui: pdu maintenance is complete. updating netbox information now [15:09:45] thanks cmjohnson1 and jclark-ctr! [15:10:31] <_joe_> I see an alert on mc1033 that says pdu status is critical, but I'd wait a bit before worrying [15:11:25] "if it ain't incandescent..." [15:11:38] ... it's fluorescent? :-P [15:12:30] Well, everything is incandescent, if you remove the "in the visible spectrum" bit [15:12:34] _joe_: that host got rebooted so might have an issue iwth one psu maybe [15:12:57] <_joe_> it got rebooted? not all hosts in that rack got rebooted [15:12:58] apergos: yeah [15:13:18] ok, I shall assign to you in the doc unless you want to assign it to yourself :-D [15:13:20] akosiaris: [15:13:53] if all's still ok, going to repool the row D cp servers too [15:14:06] (done) [15:14:32] <_joe_> bblack: eqiad is still completely depooled from trafficm so it seems relatively safe [15:14:44] I don't see why not, XioNoX any reason not to? [15:15:07] all green from me [15:15:31] 👍 [15:16:12] we still have some action items left to claim, step right up, don't be shy. any takers for "what happened to wikifeeds"? [15:17:44] apergos: I have theory, I 'll add it in the doc [15:17:51] great! [15:18:18] icinga still having an embarassing amount of unknowns [15:18:22] I guess they will recover eventually [15:19:25] lots of check systemd state, ugh [15:21:37] who will be looking into why the switch booted into a different partition? XioNoX, is that you or would someone else probably take it? [15:22:06] apergos: that's for me, opening a task right now [15:22:11] awesome [15:22:18] the why is easy: power outage [15:22:30] he he [15:22:44] the why there was a different (higher) version on the backup partition is probably: oversight [15:23:49] a lot of grunge work involved in checking all those I guess [15:23:50] so the task will be about going over all the switches and making sure it's not the case anywhere else, fix them if needed, and update the doc to not forget in the future [15:24:21] it's not that bad I think, I only looked at row D for now [15:24:50] is the anycast stuff you also or does that go to traffic? [15:25:33] I can take it too [15:25:42] a glutton for punishment! addding you [15:26:19] thank you for being that glutton! [15:26:37] only the rsyslog item is left. who can take it? [15:27:01] (I feel like an auctioneer at a shady fly by night outfit...) [15:27:56] just look at these rsslog deliveries. a unique item, it will look great on your mantlepiece... [15:31:13] in time-honored tradition, I have given it to the author of the task linked from the item [15:33:11] ok, I am officially out of IC commander role, have a good evening everyone (yeah I'm still around, just back to regular work) [15:41:54] apergos: SGTM (re: action item) [15:42:03] great! [18:11:20] shdubsh, herron o/ around by any chance? [18:11:33] o/ [18:11:42] hello :) [18:11:55] do you have a minute for a prometheus targets question? [18:12:16] sure :) [18:12:25] thanks :) [18:12:48] so I just merged a change for the discovery team that created a new file on the prometheus1003/4 hosts, namely [18:12:54] /srv/prometheus/ops/targets/mjolnir_kafka_msearch_daemon_instance_eqiad.yaml [18:13:08] that is expected, it contains 4 targets etc.. [18:13:18] but one old file was left there, /srv/prometheus/ops/targets/mjolnir_msearch_eqiad.yaml [18:13:36] they have one target in common, search-loader1001:9171 [18:14:09] basically mjolnir_msearch_eqiad.yaml is not in the puppet catalog anymore, and it can be removed manually.. but I guess that it will also mean a reload of prometheus? [18:16:52] prometheus watches those files, so it shouldn't need a reload. puppet should clean it up if it is no longer in puppetdb though [18:17:05] ah! [18:19:56] I am trying to rm the file + run puppet on 1003 [18:20:03] just completed, the file is not re-created [18:21:56] 👍 [18:22:01] shdubsh: all right will clean up, thanks :)