[00:49:15] ^ those reinstalled hosts didnt come back from reimaging [00:49:31] they were suddenly down in Icinga and on console i saw they were sitting at initramfs [00:49:45] with that "disk-by-uid.." doesn't exist- error [00:50:02] and then i powercycled them and they came back normal.. i saw they are all role::spare now [00:50:38] some others were up but they salt-master didn't accept the keys, (reinstall, makes sense), so i deleted some keys and accepted new ones [00:55:20] lvs1007 current at installer dialog what wants an IP address to be configured [00:55:42] s/what/that/g (not touching it) [00:56:36] well, i hit enter when connecting but it goes back to the configure network step [01:06:24] yeah lvs1007+ are still a mystery [01:06:41] the cp's moving to role::spare, over half didn't work the first go-round, then I did a second one on the obvious failures [01:06:53] I'm just going to have to audit through the list and see which still aren't in a clean state at this point I think [01:07:39] some them simply never rebooted for reinstall (I guess IPMI fail) [01:17:00] i cleaned up a couple, see recoveries in -operations and SAL. the remaining alerts about "systemd state" are not salt-minion anymore but puppet [01:17:22] but f.e. cp1060, can't get to it right now with either install-console or without [01:17:41] (to start initial run and then sign new cert) [01:18:00] and now i logged in, just had to say it [01:22:18] they show "systemctl --failed" that puppet.service was failed, hence degraded, but simply starting puppet or waiting and it recovers too [02:55:06] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331087 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3006.esams.wmnet', 'cp1046.eqiad.wmnet', 'cp1047.e... [03:47:15] what a clusterfuck reimagine 15 systems to the spare role turned out to be :P [03:47:23] s/reimagine/reimaging/ [03:49:41] borked ipmi->pxe on some, some randomly failed reimage the first time but worked the second (or third), the whole non-idempotency of auto-reimage around salt-key/puppet-cert removal vs --new (esp after having part of a batch fail in some way), tons of pointless spam icinga alerts, the "systemd state" alert (which is fixed by "service puppet start; service puppet stop" :P), salt-keys only coming t [03:49:47] ogether right about half the time, etc, etc. [03:50:26] anyways, they're all clean now [03:51:46] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331125 (10BBlack) 05Open>03Resolved a:03BBlack [03:58:27] 10Traffic, 10Operations, 10ops-esams: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3331150 (10BBlack) 05Open>03declined cp3003 is decomming for good in T167376 [09:27:10] bblack: the systemd one can be fixed just by 'systemctl reset-failed puppet' if is the one I think it is ;) [10:20:28] bblack, ema: FYI, I'm upgrading our openssl 1.1 package to 1.1.0f, no security fixes, but lots of general bugfixes [10:20:37] the changes in 1.0.2 are less interesting, I'll update that one in git, but probably skip the cluster rollout until the next security release [12:05:08] bblack: volans is going to work on auto-reimage (port it to cumin) soon if he hasn't already [12:05:18] bblack: so do tell him of your woes :) [12:22:53] hey guys [12:30:51] FarmerJoeDotOrg: hi :) [12:43:38] hi bblack [12:43:41] good to see you [12:44:10] GSM is easy to crack .. why do they still use it? [12:44:18] It's monsterous [12:55:11] GSM as in the mobile standard? [12:57:28] Yup [12:58:16] You can hack GSM with a cheap laptop and a hackRF or a BladeRF ... even, to a great extent with the $20 RTL-SDR [12:58:23] Good shit, meng [12:58:54] With the BladeRF (full duplex) you can hack and monitor LTE [12:59:00] it's wacko [15:14:30] 10Traffic, 10Operations: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3332571 (10ema) I've collected 60s of ICMP traffic from GCE on the load balancers and sent a report through https://support.google.com/code/contact/cloud_platform_report?hl=en. I've also added a c... [16:23:13] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3332875 (10Nuria) [17:13:42] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3333191 (10Fjalapeno) [17:41:52] 10netops, 10Operations, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3333354 (10ayounsi) > I think there's a lot of value in doing so. Agreed on the rest. Converted! > Upgrade to 3.6.2 Done > Switch from CVS to GIT Done > Replace password auth with ssh key auth Done,... [19:03:37] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3333688 (10RobH) 05Resolved>03Open Nevermind, I had a bad config and it didn't commit. I need to investiage and redo the change. [19:04:06] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestneutron2002 sswitch port configuration - https://phabricator.wikimedia.org/T167326#3333692 (10RobH) a:05Papaul>03RobH [19:47:35] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestneutron2002 switch port configuration - https://phabricator.wikimedia.org/T167326#3333850 (10Papaul)