[06:13:48] backups from etcd failed 2 days in a row [06:14:02] I will retry them manually again [06:30:54] <_joe_> from what etcd, and why? [06:37:09] etcd1001, 2 and 3 [06:37:16] the why I don't know yet [06:37:48] manual run failed again, checking logs [06:38:17] Could not connect to Client: etcd1003.eqiad.wmnet-fd on etcd1003.eqiad.wmnet:9102. ERR=Connection timed out [06:38:27] Fatal error: bsockcore.c:209 Unable to connect to Client: etcd1003.eqiad.wmnet-fd on etcd1003.eqiad.wmnet:9102. ERR=Interrupted system call [06:38:56] that looks like a network connection error, probably either firewall or file daemon not running [06:40:00] mmm, I cannot even ssh to it, I guess it is down? [06:41:01] jynus: they are offline [06:41:06] don't exist anymore [06:41:33] but still on puppet, so backup is still configured [06:41:48] I can disable its check until they are fully decommed [06:41:51] I doubt it [06:42:04] I mean, I can see them on icinga [06:42:31] the decom was not run for them? [06:42:46] I am guessing no, otherwise they would be out of the resource for backups [06:43:57] whttps://netbox.wikimedia.org/search/?q=etcd100 [06:44:02] they are offline in netbox that means unracked [06:44:07] ah no sorry [06:44:08] my bad [06:44:09] VMs [06:44:16] weird [06:44:23] note no replacement backup is setup on bacula FYI [06:44:31] not sure if that is an issue [06:44:45] not sure why they are offline [06:44:54] the state of VM in netbox is gathered from Ganeti automatically [06:44:57] so they are offline in ganeti [06:45:05] e.g. no more etcd backups AFAIKS (e.g. I would expect to backup etcd1004 or etcd2XXX) [06:46:00] so just mentioning it in case this may be an issue (I got the alert from backups failing) [06:46:10] they are still in dns and have SRV records for k8s [06:46:40] maybe conf2001 are the codfw equivalents? [06:46:48] mimight be like 3aad96eb97 (dns repo) [06:47:08] I think they are replaced by kubetcd100.... [06:47:14] I see [06:47:27] but not sure why they are offline in ganeti and not removed [06:47:43] we need to check with akosiaris or jayme I guess [06:48:06] so 2 issues here I just want to rise: the backups are still configured, and backups may want to be setup in the new env (if needed) [06:48:41] this is not a blocker on my side, just reporting [06:48:45] jynus: https://phabricator.wikimedia.org/T239835#6457487 [06:49:03] cool, I will report there [06:49:11] thanks for finding it [06:49:19] they will be decomm'ed today apparently [06:51:09] I will add a patch to ignore the alerts on those hosts only [06:51:46] thank you again, volans [06:51:54] your feedback was most useful [06:52:08] np, sorry for the initial confusion, my sleeping eye saw offline and went for the physical host mental path [06:58:08] jynus: replacements are kubetcd100[456]. If we don't have backups configured, let's get them configured then. [06:58:31] jynus: volans: I did downtime and shut down etcd100[1-3] on monday to make sure we can decommission them without issues later this week [06:58:34] I 'll see what needs to be done on the puppet level, but later in the day [06:58:36] at your own pace, don't worry (although backups == good normally :-P) [06:58:44] 0:-) [06:58:47] oh, hi alex :) [06:58:47] OTRS migration is done /me doublechecking stuff [06:58:51] nice [06:58:52] o/ [06:58:59] well done. [06:59:05] by done I mean the ALTER tables stuff [06:59:06] sorry didn't mean to stress people [06:59:12] just acting on alerting [06:59:16] theirs is still a couple of steps, but it's still looking good [06:59:34] this prooves that if you ignore your problems, they just go away :-DDD https://gerrit.wikimedia.org/r/c/operations/puppet/+/627647 [07:33:16] jynus: Please make sure your database accepts packages over 64 MB in size (it currently only accepts packages up to 32 MB). Please adapt the max_allowed_packet setting of your database in order to avoid errors. [07:33:25] just fyi, I doubt we care at this point [07:33:31] our packages are really really small anyway [07:36:04] we can change that [07:36:16] it was complaining on backup [08:28:07] o/ If I had a 1TB file that I wanted to transfer out of the cluster (one of the wdqs JNL files), does anyone have any thoughts as to the best route out? [08:30:34] <_joe_> addshore: my suggestion includes a blue-ray disc, and Fedex [08:30:46] <_joe_> I'm not sure that's what you were searching for though [08:31:31] :D [08:31:45] do we have blue ray burners? ;) [08:32:24] <_joe_> we can ask dcops :P [08:32:34] SO I'm looking for either somewhere to transfer it to that is exposed as a public webserver with 1TB space so I can wget it basically, otherwise I'll have to do something else via some other means [08:32:39] <_joe_> jokes aside, I don't have a good solution [08:33:05] <_joe_> my suggestion would be to archive it as smaller, compressed files [08:33:44] addshore: define "out of the cluster", one time transfer by you or accessible by multiple people? [08:34:36] * volans assuming there are no PII and it's all public data anyway, correct me if I'm wrong [08:37:07] <_joe_> he said wdqs JNL files, so public data [08:46:12] One time transfer [08:46:20] Public data [08:46:55] I have a Google cloud bucket I could try to CURL it up to, but I guess I would need to go through webproxy and that might rate limit it a fair bit? [08:50:05] maybe we could push it to that or S3 or similar, not sure if in one piece though, split might be required anyway [08:52:33] Would it need to go via the webproxy? [08:53:00] I could probably also create a machine to SCP it up to [08:53:45] addshore: for something that large i'd use rsync over scp, so you can resume the transfer if something fails [08:53:48] in that case I'd suggest rsync via ssh so that can be resumed, but still might be much easier splitting it [08:54:02] lol kormat [08:54:12] 🚤 [08:54:45] Amazing, thanks for the input, and rsync over ash avoids my webproxy issue I guess [08:57:42] yes, and if you have a local disk big enough you can even do it locally [08:58:38] <_joe_> please be sure to set a bwlimit :) [08:58:39] addshore: what I would suggest though if you're not too much in a hurry is maybe to use the ratelimit option in rsync (I guess arzhel might give you a ballpark value) [08:58:55] <_joe_> s/maybe// and s/suggest/require/ [08:58:59] <_joe_> :) [08:59:06] don't we already do something at the bastions? [08:59:16] I thought we did but wasn't sure [08:59:39] I can do that! :) [09:00:22] what's the issue? [09:00:39] git slow to review for anyone else or just me? [09:01:03] git as in gerrit [09:01:38] XioNoX: give a bwlimit to use in rsync for a 1TB rsync transfer via ssh out of cluster (so via bastion) [09:01:48] on which bastion? [09:02:03] I can we can pick it :) [09:02:07] * I guess [09:02:19] if you have preferences [09:02:55] mine is stuck at "2020-09-16 11:01:44.738537 Running: git push origin HEAD:refs/for/production" for pupept [09:03:04] volans: bast1002 is a 1G host so do whatever you need on it :) [09:03:51] XioNoX: you mean 10G? [09:04:09] jynus: same for me, git fetch ongoing for a minute or so [09:04:13] volans: no, 1G [09:04:18] jynus: I get some 500 for posting to gerrit [09:04:26] so no risk of saturating the network, only that host [09:04:39] thanks, moritzm, jayme CC hashar [09:04:44] XioNoX: if the rsync is done from GCE it could, I would suggest to use a rate limit anyway [09:05:48] volans: what's the source/destination? [09:06:31] XioNoX: source: wdqs1010 dest: addshore's pc or GCE instance [09:08:31] would it be via bastion? can I use ssh from wdqs -> the GCE instance? (would that go / need to go via bastion still)? [09:08:56] addshore: wdqs only have internal IPs [09:09:04] so you need a jumphost anyay [09:09:07] gotcha! [09:11:21] wdqs1010 has a 1G nic, so you can set a limit depending on what it needs for normal operations [09:11:34] eg. 700Mbps or 500 [09:14:47] mind that --bwlimit is in Bytes, not bits [09:15:21] actually KB [13:47:21] marostegui: c6 power restored, moving on to c7 now [13:47:28] cmjohnson1: excellent thanks! [14:36:06] marostegui c7 power restored...moving on to d6 [14:36:07] d8 [14:36:14] cmjohnson1: thank you! [15:52:20] ottomata: godog: so I wanted to ask something about intake-logging.wikimedia.org which is currently being used for client JS error reporting, and soon to be used for NEL reporting -- do you think the existing use case 'cares' if it goes to a different edge datacenter than usual? it is a significant improvement for the NEL use case: https://phabricator.wikimedia.org/T261340 [16:11:12] cdanis: AIUI no, reporting to another edge datacenter shouldn't have a significant impact [16:11:33] cool! I didn't think so either, but wanted to make sure to ask. anyone else you think is worth asking godog ? [16:12:07] marostegui: racks d7 and d8 are completed. tomorrow is d1/d2 and c1 (fundraising) [16:12:24] cmjohnson1: sweet! thank you - I am going to restart our services then! [16:12:36] cdanis: I'd say heads up to "hip" / Jason from Product Infra but that's it [16:12:59] ok great [16:13:05] cmjohnson1: do you mind having a look at https://phabricator.wikimedia.org/T262998? I guess it is a loose cable? [16:21:21] marostegui power is there, could be a stuck error. cleared the log to see if it reappears [16:21:44] cmjohnson1: let me check the ilo to see if it says something else [16:29:05] cmjohnson1: the status hasn't really changed on ipmi tool status, so I believe it is still being seeing as down by the host: [16:29:14] Status | 85h | ok | 10.1 | Presence detected, Power Supply AC lost [16:29:14] Status | 86h | ok | 10.2 | Presence detected [16:29:15] PS Redundancy | 77h | ok | 7.1 | Redundancy Lost [16:29:43] okay, I am not sure what to say here, dell idrac says it's okay and there are green LEDs for both power supplies [16:30:04] interesting [16:30:07] let's give it a few hours then [16:30:38] I have a tendancy to trust the native dell check over ipmitool [16:32:10] cmjohnson1: Yeah I am checking the idrac CLI and on one side it says: HealthState = 25 (Critical failure) and on another it says HealthState = 5 (OK) [16:42:00] cmjohnson1: the idrac CLI shows it as down: https://phabricator.wikimedia.org/P12612