[06:33:54] from https://apachecon.com/acah2020/tracks/cassandra.html - "We at Netflix have about 70% of our fleet on Apache Cassandra 2.1, while the remaining 30% is on 3.0. We have embarked on a multi quarter task of upgrading our 2.1 fleet to 3.0..." [07:15:52] <_joe_> elukey: "large web shop with a lot of history takes a long time to do a transition/upgrade" [07:15:57] <_joe_> paint me surprised [07:25:09] _joe_ sure but since 4.0 is almost out, and they have (IIRC) committers/pmcs in the project I thought that were on 3.0 by now [07:25:52] but I am happy since AQS is on 2.2.6 :P [07:26:12] (and we'll migrate in place to 3.0 hopefully during the next months) [07:26:49] also when 4.x is out 2.x will not be supported anymore by upstream ( in theory) [07:31:06] speaking of cassandra, does anyone remember why we opted into using separate IPs rather than ports for multi-instance? [07:32:06] gehel, ryankemper: there are a couple search alerts that started over the weekend: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.codfw.wmnet&service=ElasticSearch+unassigned+shard+check+-+9243 [07:32:31] XioNoX: thanks, looking [07:32:49] XioNoX: <3 for looking into alerts and pinging folks [07:33:11] XioNoX: already pinged Discovery earlier on, dcausse is working on those [07:33:40] nothing looks critical overall, but there are a few things that started over the weekend [07:33:45] elukey: thanks :) [07:34:07] that alert is about api feature usage, looks like we lost indices. [07:34:15] that's definitely unexpected [07:39:51] <_joe_> paravoid: I think godog is the only person that might remember [07:40:22] <_joe_> or urandom :) [07:55:48] yeah IIRC because a node in cassandra is identified by its address not by its address+port, or at least it was in 2015 [08:53:41] really kudos to all the people that worked on the decom cookbook, really amazing [08:54:54] it's only a matter of time until it shapeshifts and also takes care of unracking servers [08:55:46] looking forward for it [08:58:15] <_joe_> I have a puppet mystery to solve [08:58:32] _joe_: that's a _terrible_ way to start a week [08:58:54] <_joe_> I did follow the steps to update the certificate listed here https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate and all went well [08:59:09] <_joe_> but then puppet seems not to apply the change [08:59:40] <_joe_> oh, I see the issue now [09:00:05] <_joe_> sslcert::certificate still expects the public cert to be committed to the public puppet repo [12:58:50] about the D4 switch replacement, I'm waiting to see what's up in eqiad, we will probably delay the start by 30min or more - https://phabricator.wikimedia.org/T196487 [12:58:50] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:42:03] ok, we're going to shutdown D4 ToR in ~20min [13:42:04] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:42:09] eqiad D4 [13:58:22] volans: quick trial of what you requested: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630596 what do you think? [13:59:48] kormat: thanks! I think I would have tried to embed the break logic too [14:00:01] volans: using an exception? or.. [14:00:35] to be python yeah, but also given your low level script returning a should_break :-P [14:00:52] but I agree that in that case would not save lines of code [14:01:00] or improve clarity [14:01:26] the current change saves 2 lines of code per call, but i'm not sure it improves readability [14:02:24] if you prefer to keep it duplicated no big deal [14:02:53] funny the asymetry of assigning pos from one side and calling check_needle in the return line in the other :-P [14:04:05] i could do `return find_line_start(f), check_needle(f, needle)`, and depend on python's order of evaluation. but it felt a bit cheaty [14:04:28] I would have assigned both for clarity indeed :D [14:04:57] ah, nicely keeping full line-count parity with the merged version :) [14:05:20] anyway. there's no real clear win among any of these for me, so i'm inclined to just leave it as is. [14:06:03] ahahah [14:06:10] I've commented more or less the same in the CR [14:06:34] CR abandoned with the message "Gleefully" [14:07:20] lol [14:16:29] mutante jbond42 thank you for the swift refactor patches! are they good to be reviewed yet? I956fb365 and Ib86e7efda7 that is [14:17:29] fyi, the host are coming back online [14:17:36] kormat: ^ elukey ^ [14:17:44] one at a time as Chris patches them [14:17:45] do they look angry? [14:18:31] kormat: they've been properly fed during the maintenance [14:18:38] perfect :D [14:18:45] phew 😅 [14:21:12] all done! [14:21:37] 🎉 [14:21:46] godog: probably not just yet. i ended up going down a bit of a swift rabbit whole today the PS from me (https://gerrit.wikimedia.org/r/c/operations/puppet/+/630568) i think includes all the changes needed and can in thory be reviewed on its own. however i think it may be better to a) have mutante revert to an earlier simpler PS (before i requested some feature creep) b) potentialy split mine up [14:21:48] we now have 1 more 10G rack [14:21:52] so its easier to review so may be better to ... [14:21:55] ... wait for mutante to come online and comment first [14:22:20] and icinga is all green [14:37:33] jbond42: ok! please LMK when they are good to be reviewed (cc mutante) [14:38:20] godog: ack will do [15:31:00] volans: i forgot to address your exit() suggestion, so https://gerrit.wikimedia.org/r/c/operations/puppet/+/630622 as i prefer your approach there [15:32:43] there are three options there, exit(), sys.exit() and raise SystemExit [15:32:56] everyone prefers one or the other and there is no real winner [15:33:06] idc [15:33:17] """There should be one—and preferably only one—obvious way to do it. [15:33:18] """ [15:33:21] for the win [15:33:34] it passed volint, good enough for me :) [15:40:11] <_joe_> I usually use sys.exit() [15:40:29] <_joe_> but yeah I would not object to anyone doing something else [15:44:34] is d4 down for a while or just a few minutes? [15:45:29] sys.exit just raises SystemExit afaik, although i always prefer the former for reasons I can't quite pin down [15:45:34] andrewbogott: see -operations, it was planned to be back already, investigation in progress [15:51:10] I managed to do a puppet-merge just as puppetmaster1002 went down - is there a simple command to re-sync? [15:52:37] wikitech:Puppet claims you can just ssh to puppetmaster1002 and run sudo puppet-merge there [15:52:47] that's a lie, unfortunately [15:52:51] :( [15:53:49] godog: labweb is back up but it didn't auto-resolve on Victor Ops [15:53:56] hnowlan: I'm taking a look [15:54:30] thanks! [15:54:45] you ushould be able to run puppet-merge.py with the lastest sha1` [15:55:07] XioNoX: indeed, I'm taking a look [15:55:15] thx! [15:55:17] running the wrapper script with an SHA is broken :( [15:56:42] hnowlan: okay, you're saved because somoene else just did a merge in the meantime [15:57:01] but yes, I think you should have been able to `sudo puppet-merge.py -o FETCH_HEAD` on puppetmaster1002 [15:57:08] (the shell script is a wrapper around the python) [15:57:19] ah sweet, my favourite kind of doing something is doing nothing [15:57:23] thanks for looking! [15:57:34] yes that should work, cdanis if you can rais a task for the wrapper script ill take a look wheni have a sec [15:57:50] thanks John! happy to take the CR on that in exchange [15:58:38] even better thx :) [16:02:28] XioNoX: I was sure we had a similar task already but can't find it, filed as https://phabricator.wikimedia.org/T264016 [16:03:48] cool! [20:15:06] Do I remember correctly that there's some way to have wmf-auto-reimage-host skip partitioning/reformat and just install a new OS? I have some hosts where I'd like to upgrade to buster but preserve data in the non-OS partition. [20:16:47] of /srv specifically, with a special partman config [20:17:18] and I don't think it works for all hosts, just with certain schemes? [20:17:22] k.ormat wrote it [20:18:31] hm [20:18:52] /srv is what I need; I'll look at partman rules [20:20:01] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/autoinstall/partman/reuse-raid1-2dev.cfg [20:20:57] andrewbogott: looks like it has been removed again.https://gerrit.wikimedia.org/r/c/operations/puppet/+/608306 [20:21:34] ah, that was the old method to not format /srv then [20:21:56] I see 'keep' in partman recipes which seems promising [20:22:20] yes, that is implemented via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/autoinstall/reuse-parts.cfg which fetches https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/autoinstall/scripts/reuse-parts.sh [20:22:47] so I think you can take any partman config and write a 'keep' variant now [20:23:02] with a reuse_partitions_recipe [20:25:36] you need some other preseed data, there's details in reuse-parts.sh and some real live examples in netboot.cfg [20:25:42] dbprov[12]00[123]) echo reuse-parts.cfg partman/custom/reuse-dbprov.cfg ;; \ [20:25:44] db[12]*|dbstore[12]*|es[12]*|pc[12]*|labsdb1*) echo reuse-parts.cfg partman/custom/reuse-db.cfg ;; \ [20:25:51] sretest*) echo reuse-parts.cfg partman/reuse-raid1-2dev.cfg ;; \ [20:25:53] etc [20:28:16] cdanis: I see an example that just uses 'method{ keep } \' in an otherwise normal-looking recipe [20:28:35] prometheus.cfg [20:28:39] the part I was pointing out was that you need reuse-parts.cfg in the preseed data in netboot.cfg to reference that at all [20:28:41] of course I don't know that it actually works :) [20:29:13] ok, looking... [20:29:18] prometheus.cfg hasn't been edited since before Stevie's changes, so I think that is something else entirely [20:29:45] ok [20:29:48] I'm tempted to try it :) [20:30:31] yeah, I don't think there are fully-cooked end-user instructions, but it definitely works, it's the default for a bunch of db hosts nowadays