[06:42:34] Krenair: AES128-SHA TLS ciphersuite deprecation [06:59:17] _joe_: so... I've been debugging the June 26th pybal/scap issue [06:59:34] <_joe_> vgutierrez: oh? do tell [06:59:52] _joe_: at some point.. something (I'm assuming scap) asked for a 100% depool via etcd [07:00:09] <_joe_> yes [07:00:21] <_joe_> that's known, see the incident report [07:00:42] yup [07:00:44] Jun 26 09:33:35 service: api_80: 0 enabled 46 disabled: 0% enabled [07:00:53] that's the timestamp for api_80 for instance [07:00:57] at least from pybal point of view [07:01:12] and what I'm seeing right now on pybal code [07:01:14] <_joe_> https://wikitech.wikimedia.org/wiki/Incident_documentation/20180626-LoadBalancers see here [07:01:29] yup, I've read that :) [07:01:39] <_joe_> the issue is that T184715 is not really resovled [07:01:40] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [07:02:00] pybal trusts the config enough to depool servers without going through canDepool() logic [07:02:26] <_joe_> yeah, which was ok when using files [07:02:30] <_joe_> now it's not [07:02:35] <_joe_> so we wanted to change that [07:02:39] <_joe_> but we apparently failed [07:02:47] https://github.com/wikimedia/PyBal/blob/1.15-stretch/pybal/coordinator.py#L110 [07:03:08] <_joe_> I know [07:12:11] _joe_: hmmm well... T184715 refers to depooling when a server is detected as down [07:12:12] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [07:12:23] _joe_: and that's properly done [07:14:08] <_joe_> vgutierrez: well, in the discussion in the ticket https://phabricator.wikimedia.org/T184715#3896284 [07:14:30] <_joe_> but that's hard to solve in the current code, I know [07:14:49] yey... your comment is changing the scope of the task [07:15:10] "So when we want to depool a server, either logically or because of monitoring, we should first check that we have enough pooled servers to be able to depool the current machine." :) [07:35:04] mark: I've two tests in pybal currently failing in master [07:35:13] pybal.test.test_monitors.UDPMonitoringProtocolTestCase.testInit [07:35:21] pybal.test.test_monitors.UDPMonitoringProtocolTestCase.testRun [07:36:22] exceptions.AttributeError: 'UDPMonitoringProtocol' object has no attribute 'interval' [07:36:27] exceptions.AttributeError: 'UDPMonitoringProtocol' object has no attribute 'loop' [07:40:50] forget about it... [07:40:54] rusty pyc files :/ [07:41:15] that actually makes sense [07:41:20] hey, I'm merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440157/ with puppet disabled on text hosts [07:42:40] * vgutierrez hides [07:42:45] ema: good luck <3 [07:46:04] !log install misc VCL on a text host (cp3030) T164609 [07:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:08] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [07:48:07] ema: does it mean that the misc merge to text happens today? (just to alert my team) [07:48:16] elukey: nope [07:48:33] ah only the multiple VCL thing [07:54:51] elukey: yes :) [07:55:15] this is about installing misc-specific VCL files on text hosts and do auto-reloads in case they change [07:56:04] VCL switching already does work and is tested: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443573/2/modules/varnish/files/tests/text/22-cache_misc-vcl-switch.vtc [07:56:12] we just don't do it yet [07:56:32] nice [07:57:49] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Gehel) For WDQS, we should keep access to at least 2 nodes in both eqiad and codfw. I propose: wdqs1003: 10.64.0.14 wdqs1004: 10.64.0.17 wdqs2001: 10.192.32.1... [07:58:55] !log install misc VCL on all text hosts T164609 [07:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:58] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [08:10:04] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) First batch of changes: ``` delete firewall family inet filter analytics-in4 term logstash delete firewall family inet filter analytics-in4 term event... [08:29:03] ema: 4.9.107~wmf1 works fine on multatuli, uploading to apt.wikimedia.org [08:29:17] (but also needs an update of linux-meta as the ABI was bumped) [08:29:45] moritzm: great, thanks! [09:26:40] ema: meta packages now also upgraded, "apt-get -y install linux-meta will do the right thing", maybe start with a misc host, though [09:53:46] * mark is nearby esams [09:53:51] i will eat something and then head over [09:55:30] nice, I'm about to have lunch too :) [09:56:49] * mark grins [09:56:56] i just found myself ordering food in english [09:57:03] haha [09:57:46] pretty common brain bug [09:57:54] unable to write in one language and speak in other [10:02:46] so i'm actually sitting at the beach [10:03:16] https://www.google.com/maps/place/Het+Veerkwartier/@52.3877732,4.6722825,3a,75y,90t/data=!3m8!1e2!3m6!1sAF1QipPFfn4p9LB0Csl8QC2mMp0Llcrw2J22ozfiNIDd!2e10!3e12!6shttps:%2F%2Flh5.googleusercontent.com%2Fp%2FAF1QipPFfn4p9LB0Csl8QC2mMp0Llcrw2J22ozfiNIDd%3Dw114-h86-k-no!7i4032!8i3024!4m5!3m4!1s0x47c5e5853612c017:0x454ab64f9b0e37e!8m2!3d52.3875056!4d4.673433 [10:03:55] it's really the wrong weather to work in a hot data center, but oh well ;-) [10:07:43] mark: cool place! [10:07:59] yeah pretty good [10:14:37] where's the sea? [10:14:51] * volans associate beach with sea ;) [10:16:12] actuallly not far either ;) [10:17:51] so esams will definitely get flooded some day with rising sea levels hehe [10:19:00] isn't esams below see level? [10:19:05] yes [10:19:42] I was thinking the same ;) [10:19:42] well, you would get free cooling! [10:19:57] and great interconnection [10:26:43] Microsoft already does that: https://natick.research.microsoft.com/ [10:28:41] huh, our data center, evoswitch, apparently has been acquired by iron mountain [10:28:43] i missed that [10:30:04] not quite sure what is 'mountain' about this very flat place below sea level ;p [10:31:28] s/esams/imams/ then [10:31:34] :P [10:32:02] i'm done with food, gonna pack up and head over [10:51:23] _joe_: possibly related to the traffic anomalies in codfw, look at this IP on turnilo: https://bit.ly/2tUeyah [10:56:36] ema: so I'd like to start cannibilizing cp3048 [10:56:42] we can probably get it back up again afterwards, [10:56:48] but it won't be any more reliable than it already was ;) [10:57:17] or was not, hehe [10:58:35] hm [10:58:43] actually I only need one of its SSDs I think... [10:59:59] mark: alright, the host is depooled and the most recent hardware task is T190607 [10:59:59] T190607: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 [11:00:24] for the record :) [11:00:38] yeah, i'm working off the ops-esams workboard, "next visit" column [11:01:01] so cp3043 has a disk failure, I will steal an SSD from cp3048 [11:02:01] sdb... [11:02:21] mark: sounds good, I'll depool cp3043 whenever you're ready to begin working on it [11:02:32] go ahead [11:03:01] I see cp3048 has varnish running too [11:03:16] * mark stops it [11:03:40] !log depool cp3043 (cache_upload) for hardware maintenance T179953 [11:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:44] T179953: cp3043 disk failure - https://phabricator.wikimedia.org/T179953 [11:03:52] can I downtime it on icinga? [11:04:07] jynus: sure, thanks :) [11:04:16] 43 or 48? [11:04:40] I guess 43? [11:04:45] both [11:04:52] ok [11:04:54] doing [11:05:50] you can see which hosts are pooled here https://config-master.wikimedia.org/pybal/esams/upload [11:06:19] Iv'e down them for 24 hours [11:07:06] ok [11:07:12] i'm going to pull sdb from cp3048 now [11:08:01] ok [11:09:35] sorry if it still complained on irc, some errors were already soft when I downtimed them [11:11:24] ema: could you shutdown processes on cp3043? [11:11:29] some stuff is still accessing its cp3043 [11:11:32] mark: sure [11:11:33] probably varnish, because it's out of the raid [11:11:49] its sdb I mean [11:13:05] lmk when done [11:13:41] mark: varnishes stopped, we can also shutdown the host altogether if you prefer [11:13:50] well [11:13:53] i would prefer to rebuild the raid first [11:14:02] not entirely sure if it wouldn't pick cp3048's otherwise ;) [11:14:05] since that is the new drive [11:14:10] k [11:14:14] but after that we should reboot yes [11:14:23] i will pull cp3043 sdb now [11:14:28] and put the new drive in [11:14:38] ok [11:16:28] it detected the new one as sdc but I think that will correct after reboot [11:17:48] ok, raid1 rebuilt, I will now reboot? [11:17:55] go ahead [11:18:07] damn yuviguard [11:19:06] haha is that a new molly-guard? [11:19:26] I'll ACK the IPSec alerts meanwhile [11:19:51] ;) [11:21:28] hey that was a fast reboot [11:21:37] ok [11:21:48] yeah [11:21:55] everything looks good? [11:22:04] looks good I see sda1 and sdb1 in the raid [11:22:09] yes [11:22:16] if varnish backend works then I suppose we're done [11:22:38] varnish-be is up [11:22:54] cool, then I will resolve the ticket [11:23:05] \o/ [11:23:48] nice [11:27:24] ok next [11:27:25] cp3034 [11:27:32] memory error [11:27:40] so yeah, I will probably have to steal cp3048's memory then [11:27:44] just swap it altogether [11:28:04] or I could just do the one dimm, says it lists it as B3... [11:28:25] ema: could you depool/shutdown cp3034 entirely so I can open it up? [11:28:49] mark: 1 sec [11:28:59] and the same for cp3048 I guess ;) [11:29:32] sure [11:30:17] !log shutdown cp3048 and cp3034 (both already depooled) for hardware maintenance T190607 T189305 [11:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:22] T190607: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 [11:30:22] T189305: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305 [11:33:01] mark: done, cp3034 downtimed for 2h [11:33:07] ok [11:36:55] cp3034 DIMM b3 removed... [11:37:00] will now pull the same one from cp3048 [11:37:22] ok [11:44:01] swapped, powering on cp3034 [11:45:31] i don't see anything on the console yet... [11:45:40] mmh [11:47:06] i'll try a power cycle [11:48:30] ok, I've canceled downtime for cp3043 which looks good [11:49:00] Message PR1: Replaced part detected for device: DDR4 DIMM(Socket B3). [11:49:05] smart servers [11:49:22] linux booting [11:49:26] yay [11:49:58] up, looking good [11:50:09] i will put the bad dimm in cp3048 [11:50:14] k [11:50:20] it may just work without issue after a reseat [11:51:43] there are no other servers I need cp3048 parts for, right? [11:51:46] cp3009 is older hardware... [11:52:40] possibly bast3002? [11:52:44] much older [11:52:47] ok [11:52:50] different drives also [11:52:54] i will look at what I can do for it in a bit [11:53:05] ok then I will connect cp3048 back up and we'll see how it does, but I wouldn't pool it ;p [11:53:11] it's lacking a drive also hehe [11:53:52] poor fella [11:54:11] ok repooling cp3043 [11:54:44] !log repool cp3043 after hardware maintenance T179953 [11:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] T179953: cp3043 disk failure - https://phabricator.wikimedia.org/T179953 [11:57:55] UEFI0081: Memory configuration has changed from the last time the system was [11:57:55] started. [11:57:55] If the change is expected, no action is necessary. Otherwise, check the DIMM [11:57:55] population inside the system and memory settings in System Setup. [11:57:56] UEFI0081: Memory configuration has changed from the last time the system was [11:57:56] started. [11:57:58] If the change is expected, no action is necessary. Otherwise, check the DIMM [11:58:00] population inside the system and memory settings in System Setup. [11:58:05] that must have been why cp3034 didn't boot the first time, too [11:59:23] cp3048 is booting but is missing its sdb [11:59:29] it will continue after 90s [12:01:41] went into emergency shell [12:02:45] I wouldn't spend too much time on it, it's not that we're gonna pool it anyways in its sorry state [12:03:07] exactly [12:03:12] so I'll leave it in that state [12:03:15] +1 [12:03:26] * mark checks the SEL [12:03:43] ------------------------------------------------------------------------------- [12:03:43] Record: 108 [12:03:43] Date/Time: 07/04/2018 11:55:40 [12:03:43] Source: system [12:03:43] Severity: Critical [12:03:43] Description: The chassis is open while the power is off. [12:03:46] sneaky buggers [12:04:18] heh so [12:04:23] it actually reports all sensors as good [12:04:25] atm [12:04:32] that may or may not change again [12:04:50] so if we can get it a different SSD it /may/ be worth trying again, possibly with a cpu reseat/swap [12:04:51] but maybe not now [12:04:58] we can check on my next visit [12:05:06] i left a comment on the task with what i've done to it [12:05:27] next up are bast3002 and cp3009 [12:05:31] but I need a drink [12:05:36] it's super hot and very dry here [12:05:39] bbi 10 [12:05:47] enjoy [12:14:48] back [12:15:43] ok so bast3002 [12:15:50] it doesn't have hotswap drive, so i will need to shut it down [12:15:55] and find a donating server as well [12:17:52] ...and first I need to figure out which server is bast3002 as it has not been physically relabeled here ;) [12:18:34] hooft... ok [12:18:34] ha [12:18:41] hooft was the original bastion [12:19:01] * mark makes a ticket to relabel it [12:21:39] we've come full circle! [12:21:58] /o\ [12:22:03] \o/ [12:22:27] yeah [12:22:27] hahaha lovely, thanks vgutierrez [12:22:29] so there's a bunch of similar boxes, unused [12:22:34] I will steal a drive from amslvs3 [12:22:43] which is powered off already [12:22:57] time to get new hardware in soon [12:23:09] bast3002 downtime will mean prometheus downtime I think? [12:23:22] correct yeah [12:24:17] and a bunch of sre with higher latency ssh towards the cluster [12:25:06] actually every SRE that's working today [12:25:18] * volans not ;0 [12:25:19] ;) [12:25:40] hahaha.. iron for you [12:25:40] some of us are using iron [12:25:52] so higher latency by default :-P [12:26:35] yes [12:26:37] so i'll try to be quick [12:26:43] if someone can shut down bast3002, then I can swap its sdb [12:26:52] btw it's also athe install server, but I guess no reimages going on in esams today [12:27:03] even if there were they would be out of luck :P [12:27:23] i need to make my very scarce time here count ;) [12:27:40] * volans downtiming bast3002 on icinga [12:27:42] is anyone shutting bast3002 down? [12:27:45] or I can [12:28:15] godog: anything special to do? [12:28:18] for prometheus [12:28:40] downtimed already [12:31:29] just shutting prometheus is fine yeah [12:31:41] volans: ^ [12:31:55] godog: ack, mark I'm stopping prom. and shutting it down [12:32:01] cool [12:33:56] mark: going down... [12:34:00] ok [12:34:02] swapping drive... [12:34:11] check the leds [12:35:42] I've canceled downtime for cp3034 too, pooling it back shortly [12:39:03] bast3002 powered back up [12:39:08] !log cp3034 repooled after hw maintenance T189305 [12:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:11] T189305: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305 [12:39:29] i'm blind without console access now [12:39:37] hopefully it just boots back up ;) [12:39:43] or I will need to go find a console cart [12:40:15] mark: why no mgmt? I'm in [12:40:32] is booting [12:40:34] because to access the management network... we need to go via the bastion [12:40:39] maybe it works via iron these days actually [12:40:45] at login [12:40:46] i'm still thinking of the old situation without full global connectivity heh [12:40:56] and I can ssh into the host [12:41:04] yeah nice [12:41:14] so the drive sdb will need repartitioning and raid rebuild [12:41:19] is one of you able to do that so I can focus on dc stuff? [12:41:34] the drive is from amslvs3 which will have had very different partitions [12:41:48] any of you do you have at hand the commands? I'm sure you've already done that on the other hosts [12:42:08] i've done the raid rebuild today, not the repartitioning [12:42:13] but it basically means playing around with parted [12:42:19] and copying the partition table from sda to sdb [12:42:31] after that, mdadm /dev/md0 --add /dev/sdb1 etc [12:42:49] sure I can take that, I usually use sfdisk -d | sfdisk [12:42:58] here's our swift man [12:43:02] not afraid of a few partitions [12:43:03] ;) [12:43:08] lol, indeed [12:43:08] thanks godog! [12:43:10] hahahah [12:43:22] godog eats drive labels for breakfast [12:43:40] ty! [12:43:42] so next up, cp3009 [12:43:48] gotta get those vitamins going in the morning [12:43:50] which has been broken for nearly 2 years, heh [12:43:53] bad memory [12:44:05] we have a lot of decom'ed servers from the same hw batch, except most of them will have less memory [12:44:10] so i'm not sure we actually have a parts donor there [12:44:16] and apparently it was 1 out of 4 misc varnish servers [12:44:26] ema: do you feel it's worth repairing cp3009, considering that? [12:44:30] or is 3 servers enough? ;) [12:45:26] mark: 3 is enough I think [12:45:56] ok [12:46:31] especially considering the ongoing misc<->text merging work [12:46:34] ema: do you want to decom it then? [12:46:35] yeah [12:46:48] not sure what exactly needs doing for that [12:47:32] godog: anything to do to restart prometheus? the process seems up but I'm still not seeing data from other esams hosts in grafana [12:47:43] mark: so, the host is not in etcd [12:48:46] mmmh some graphs have data, mybe I just need to wait few more minutes [12:48:49] volans: should be unattended, perhaps in a minute or two [12:48:50] yeah [12:49:05] yeah seems so, I'll keep an eye, thanks [12:49:12] I see the unknowns in icinga recovering too [12:49:29] [>....................] recovery = 1.0% (4485312/438449152) finish=292.5min speed=24722K/sec [12:50:18] will take a while :D [12:51:19] heheh indeed, so happy we're defaulting to ssds nowadays [12:51:37] grafana seems ok, all graphs that I've checked have new datapoints [12:51:50] ema: i don't really know what that means, may or may not have been fully decom'ed? [12:51:55] ntp will take a while too but usually recovers by itself [12:53:36] mark: partially decom'ed, I'll search and removing the remaining things [12:53:48] *remove [12:54:00] ty! [12:54:02] ok [12:54:14] volans: yup... sometimes needs an extra help with a manual restart fo the service though [12:54:15] i am now going to fetch packages from evoswitch, they were a bit unhappy with me not coming by to pick stuff up ;p [12:54:17] s/fo/of [15:20:48] ema: could you clarify the status of some esams cp systems? [15:20:54] i have a whole bunch listed for removal soon [15:21:00] https://phabricator.wikimedia.org/T184063 [15:21:07] but looking at puppet site.pp, some are still in there [15:21:17] node /^cp300[3-6]\.esams\.wmnet$/ { [15:21:17] # ex-cache_maps, to be decommed [15:21:17] role(spare::system) [15:21:17] } [15:21:26] so not in use, but not really decom'ed either [15:21:30] can I disconnect them? [15:21:45] and why is this one special: [15:21:46] node 'cp3022.esams.wmnet' { [15:21:46] include ::standard [15:21:46] } [15:23:48] so for the ex-cache_maps systems (cp300[3-6]) I see there's a decom task -> T167376 [15:23:49] T167376: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376 [15:24:45] would you be able to do all the things one can do online for those so I can focus on physical stuff now? [15:25:16] yeah [15:25:33] awesome [15:25:38] as for why is cp3022 not role spare I'm not sure, but it's part of T130883 [15:25:38] T130883: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883 [15:25:58] ok that task is further ahead [15:26:06] so I will proceed with removing cabling for cp3011-3022 [15:26:11] and will check back in with you after that [15:27:23] ok! [15:35:11] btw, what do you think is the timeline for folding cache misc into text? [15:35:23] wondering if I should expect those remaining 3 misc servers here to also get recycled [15:35:29] (no problem if not though, they can go in the next round :) [15:35:52] surely no merge before the next round :) [15:36:03] ok [15:36:25] I -think- cp3003-3006 are identical hardware to those (including cp3009) btw, if we want more spares ;) [15:39:43] so, cp3003-3006: [15:39:58] 1) icinga checks disabled [15:40:11] 2) puppet disabled [15:40:29] 3) puppet patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443845/ [15:40:36] 4) dns patch https://gerrit.wikimedia.org/r/#/c/operations/dns/+/443846/ [15:41:25] I can now proceed with merging the puppet change and powering down the hosts [15:43:24] cool [15:43:29] then i will remove their cabling [15:43:34] and their drives will get destroyed like all others [15:46:07] mark: alright, powering them off [15:47:53] ty! [15:48:24] mark: done! Do you know offhand how to disable the switch ports? [15:51:41] yeah I can do that [15:51:57] ty [15:52:10] !log cp300[3-6]: puppet node clean/deactivate T167376 [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:13] T167376: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376 [15:58:21] mark@csw2-esams# set interfaces interface-range disabled-ports member-range xe-5/0/0 to xe-5/0/15 [16:02:10] prod dns entries removed [16:02:47] cables removed [16:03:20] I've prepared https://gerrit.wikimedia.org/r/#/c/operations/dns/+/443851/ to remove the mgmt entries too once we're done with everything else [16:05:38] mark: anything else I can help with? [16:06:20] I don't think so :) [16:06:33] mark: cp3008 seems to be down [16:06:38] checking [16:07:03] fixed [16:07:07] sorry about that :( [16:08:30] mark: no big deal, luckily it's a misc host :) [16:09:00] pybal would have caught it quickly I hope [16:10:30] mark: it did, see https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&var-datasource=esams%20prometheus%2Fops&var-server=lvs3004&var-service=misc_weblb_443&from=1530718730844&to=1530720594489 [16:12:00] mark: I'm now going afk, but I'm at home. don't hesitate to call if you need anything [16:14:57] thanks! [16:15:06] i'll keep cleaning up a bit more here and then head home too [16:15:11] getting hungry [16:15:20] and before anything else, a large pile of hardware needs to go to the recycler...