[05:11:07] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10Marostegui) 05Open→03Resolved The failed disk is now ok: ` root@db2047:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Port Name: 1I Po... [05:20:59] 10DBA: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 (10Marostegui) a:03Marostegui [08:51:22] I'm deploying PHP sec updates, tendril will be unavailable for a few seconds shortly [08:51:32] 10DBA, 10Operations, 10decommission, 10ops-eqiad: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) 05Resolved→03Open @RobH @Cmjohnson there are still DNS entries for all these hosts: ` templates/wmnet:pc1... [08:51:43] moritzm: thanks for the heads up [08:52:25] and it's back and working fine [08:53:28] \o/ [12:34:55] 10DBA, 10Patch-For-Review: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 (10Marostegui) db1103:3314 is done and the table is now the same as db1097:3314: ` Query OK, 326003615 rows affected (6 hours 17 min 9.99 sec) Records: 326003615 Du... [13:51:34] 10DBA, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) @RobH @Cmjohnson there are also entries on site.pp, I have sent a patch for that: https... [15:53:32] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:04:19] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:15:32] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:34:12] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [17:04:55] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [17:06:19] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [17:08:57] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) 05Open→03Resolved This is now done, both servers are in production (although not with 100% of the final load, only all logic... [17:21:34] jynus: did you update the Status of dbprov100x in netbox by yourself or did someone ask you? [17:21:37] I'm curious! [17:21:40] thanks for doing that regardless [17:21:58] (it was the right thing to do!) [17:22:21] the ./activate_database.py script did it! :-D [17:22:31] just kidding, it was added on the checkbox [17:23:02] paravoid: see https://phabricator.wikimedia.org/T219399 [17:23:08] I didn't know I had to do it [17:23:12] so I asked robh [17:23:24] and he added it to the checkbox, so it is not forgotten [17:23:46] I didn't even know about it until I run into a ticket [17:23:49] cool :) [17:40:48] paravoid: may I ask you how you learned about the change? [17:41:08] it's on the front page of netbox, right hand corner under changelog [17:41:09] but also [17:41:17] it was showing up on https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ [17:41:23] which we're close to get to zero [17:41:28] this is a new thing we've built [17:41:33] I am a netbox noob [17:41:39] where it checks the data that we have in Netbox with what's in Puppet [17:41:46] so if for instance there is a host that is online and in puppet [17:41:48] nice [17:41:51] but is marked as offline or inventory [17:41:52] it warns [17:42:04] or planned as the case may be here [17:42:13] what are your thoughts about netbox scope? [17:42:25] pure inventory [17:42:30] it also checks that the serial numbers are the same as the ones reported from SMBIOS->Facter->Puppet->PuppetDB :) [17:42:30] or something else? [17:42:45] e.g. a more general source of truth [17:43:02] there have been ideas floated around using it as a Puppet ENC, I don't know how I feel about that yet [17:43:14] for now the rough plan is: [17:43:43] I guess for now there is still some more basic automation and monitoring [17:43:48] 1) Make sure that the data we have in Netbox is correct, accurate and complete (we're nearly there), as well as build infrastructure (reports) to make sure that we catch future errors [17:44:01] nice [17:44:03] so we have all Ganeti VMs in there now, automatically [17:44:16] I didn't know that [17:44:33] and we have reports that warn us if we miss asset tags, serial numbers, purchase dates, phab tasks, or if they're malformed, or if they don't match what's in Puppet [17:44:43] we have inventory items for all networking equipment [17:44:46] power cables in some case [17:44:48] s [17:44:48] etc. [17:45:03] because a lot of manual work on server setup (some unavaidable), I see this as a very valuable asset [17:45:21] almost every batch an ip is wrong, or a dns has a mistake, etc. [17:45:29] 2) Make sure that we have the necessary abstractions in Spicerack [17:45:32] this will help detecting those [17:45:48] so we have a Netbox abstraction now there (as well as a Ganeti one), and a "makevm" cookbook for creating new VMs [17:45:59] I didn't know about those either [17:45:59] almost there, that is [17:46:12] so we're almost there now for (2) [17:46:20] oh for (1) there's more reports coming, e.g. one checking LibreNMS [17:46:29] as network devices are not in Puppet [17:46:33] next step is... [17:46:37] it is almost perfect, except that it uses postgres :-D [17:46:48] 3) make sure that the existing cookbooks (wmf-reimage and makevm) also do all the necessary state transitions [17:47:05] so that you don't have to do the planned -> staged in the future, wmf-reimage would do that automatically [17:47:10] which is not a usual diss, we are way behind in postgres automation and support [17:47:12] (and similar for the opposite in decom) [17:47:40] 4) start adding more and more of the manual steps in these cookbooks, hand-in-hand with adding more data into netbox [17:47:41] although I saw the other day at least the basic backups are happening [17:47:56] the most likely next step is to generate all the PXE boot configuration [17:48:07] oh, so that is almost a yes to my question [17:48:08] instead of doing puppet commits to add MAC addresses to dhcpd.conf [17:48:22] it is more than pure inventory [17:48:39] next step after (or before perhaps, still unclear!) after that is DNS generation of all .mgmt.$site.wmnet entries [17:48:46] and then after that, even the primary DNS [17:49:07] 5) is start using its IPAM functionality [17:49:31] so basically, instead of manually picking IP addresses, just give it the subnets/pools (halfway there), plus all reserved IPs [17:49:38] and let it pick the next available address [17:49:48] as part of e.g. wmf-reimage or another cookbook [17:49:56] and finally [17:50:21] paravoid: I think you have work for a year already :-D [17:50:22] 6) integrate it all with our network configuration management (while rewriting that) [17:50:27] that's also halfway there [17:50:37] so VLANs in switches, interface ports, VLAN assignments, port descriptions [17:50:50] oh and [17:51:26] 7) tangentially related to Netbox, still unclear, provision iDRAC/iLO+BIOS+RAID settings centrally/automatically [17:51:55] for example, avoiding accidental reimages if it is "active" [17:52:00] (?) [17:52:28] well yes, that's also part of it, but were you asking in relation to something else I said? [17:52:30] maybe that is a separaete work [17:52:43] no, that's part of it [17:53:04] 7 has relations ship with the full lifecycle [17:53:13] no no [17:53:18] no? [17:53:19] (7) is about provisioning iDRAC [17:53:29] yeah, but that is done during install [17:53:35] (sure, not later) [17:53:37] before OS install [17:53:54] but it is one of the tasks [17:53:56] OS install (+ wipe for that matter) will be all automated [17:54:01] incl. state transitions [17:54:05] exactly [17:54:08] that is what I meant [17:54:10] and phab bug reports for "please unrack" when wiping [17:54:11] part of a larger task [17:54:15] of status changes [17:54:20] yeah [17:54:25] status we've documented here: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States [17:54:30] including a state diagram [17:54:37] I didn't know about those [17:54:42] I am not sure other do [17:54:51] I just learned almost by accident [17:55:08] that's good feedback [17:55:16] so probably needs more publicity internally [17:55:31] (maybe it was commented while I was on vacation) [17:55:51] also we are way more part of the setup process [17:56:10] than other ops, because hw importance for us, so we may be "special" [17:56:22] how so? [17:56:57] I think we are more on top of hw than other people beacuse we buy a lot and it is expensive, etc [17:57:09] ssds, raid setup [17:57:25] we have our own wikipage for chris and papaul, for example :-D [17:57:41] haha [17:57:54] anyway, that was a nice chat, but have to go [17:58:01] see you! [17:58:23] so [17:58:28] Dell has a feature [17:58:36] where the iDRAC, given a DHCP option [17:58:44] can fetch an HTTP URL with a json file [17:58:47] which has all the BIOS settings [17:58:56] (you can save those from an existing system, and modify) [17:59:03] iDRAC + BIOS + HW RAID, that is [17:59:13] I haven't looked at HPE, I'd guess it's similar [17:59:25] it's farther out on my roadmap [17:59:36] but if it saves you guys some time, then we can start exploring it sooner [18:07:06] paravoid: that would definitely save time for chris and papaul, as they have to go thru all the idrac, BIOS and specially RAID config manually [18:07:38] not sure if by default they are shipped with PXE enabled, but if not, that is also a step to save, configure PXE automatically or from factory [18:07:56] I know some vendors allow that sort of configuration by default: ie: configure PXE on eth1 from factory [18:31:16] 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Papaul)