[08:00:54] hello folks! [08:01:19] As announced me and Moritz are going to move puppet-merge to puppetserver1001 [08:01:44] if you can wait to merge your changes for half an hour we'd be grateful :) [08:01:52] or otherwise, if you have urgent work to do please tell us [08:22:33] we are solving a quick issue with labs private, should be done soon [08:24:44] (basically when we run puppet-merge.py remotely we don't update labs private on all the nodes, so the one on puppetserver1001 had its last commit dated 2023) [08:28:38] the final puppet-merge on puppetmaster1001 happened, bought over eight years ago! it replaced palladium back at that time (if you ever used that, you're eligible to a sabbatical :-) [08:29:17] :D [08:29:45] \o/ [08:29:49] I'll be thinking hard about that in ~10 days [08:30:02] (no) [08:30:09] congrats!! [08:30:18] moritzm: shall we give the green light for puppet-merges? [08:30:19] also yes, congratulations [08:31:13] yeah, everyone can merge freely on puppetserver1001 now [08:31:48] moritzm, elukey: do you have ready the patches for the cookbooks? [08:32:10] volans: not yet, we need to create them [08:32:51] good point, I hadn't thought of the spicerack constants [08:35:39] I *think* it might be enugh to fix some cookbooks that use the puppet_master as host where to check the puppet and puppet private repo [08:35:53] at least for now, until we really get rid of puppet5 [08:38:16] Going to do later on, IIUC we need to fix those but it seems not super urgent [08:39:41] had a quick look: get_puppet_ca_hostname() itself it fine, it's only used to specifically determine the Puppet 5 CA as used by the decom cookbook and that will continue to be needed until we have gotten rid of the last Puppet 5 node [08:41:21] the decom cookbook needs to check the code in the puppetserver not puppetmaster, see PUPPET_REPO_PATH and PUPPET_PRIVATE_REPO_PATH [08:43:33] volans: just filed the change [08:44:20] <3 [08:44:21] and also the fix I know it wasn't complete [08:44:25] now it is [08:48:23] * elukey running a quick errand [09:03:59] <_joe_> debian experts, I need some advice on packaging [09:04:40] <_joe_> conftool has a new binary and thus should have another debian package like python3-conftool-{dbctl,requestctl} [09:05:16] <_joe_> but this code only works on python 3.10+ so I'd like to build and upload the package only on bookworm, not on buster/bullseye [09:05:28] <_joe_> is that doable with a single control file or that won't be possible? [09:12:08] we ran into the same issue with debmonitor, where the server needs Bookworm and the latest Django, while the client needs to be supported for all distros. In the end we split the source packages for easier handling [09:12:24] the other option would be to check dependencies in the build itself [09:12:44] and simply skip the new component during build if on an older OS [09:13:01] and then just have an empty deb stub package on Bullseye/Buster [09:13:46] the typical mechanism within Debian for packages which are heavily backported [09:13:50] like openjdk [09:14:08] is to have some mechanism within debian/rules which enables/disables features based on a OS preference [09:14:37] like when openjdk is backported it updates the dependencies to use a differnent OS or disable features which are not available for older distros [09:14:52] probably not warranted here, though [09:17:33] <_joe_> actually that might work [09:17:53] <_joe_> also to be clear, the packages will build, and will mostly work [09:18:00] <_joe_> but we give no guarantees [09:18:20] you can add also in setup.py deps that depends on the python version, not sure if you can do the same for entry points [09:18:32] <_joe_> so I'm inclined to just let it build everywhere, we will use it only on bookworm anyways and the software itself refuses to start [09:18:57] <_joe_> if sys.version_info < (3, 11): [09:18:59] <_joe_> raise RuntimeError("Python 3.11 or higher is required") [09:19:24] <_joe_> it's just aesthetically unpleasant [09:19:37] <_joe_> but it's debian packaging, it's ugly by contract, I mean :) [09:25:25] btullis: good to merge 'Move the misc_crons dumper role from snapshot1017 to snapshot1016' ? [09:25:33] Yes please. [09:25:35] ack [09:25:38] ty [10:30:05] Something is off with MW Memcached [10:30:17] see https://logstash.wikimedia.org/app/dashboards#/view/memcached [10:38:41] it seems connection errors/timeouts to mcrouter's svc in k8s, but from https://grafana.wikimedia.org/d/ltSHWhHIk/mw-mcrouter I don't see anything popping up [10:38:51] effie: --^ [10:41:26] looking [10:41:50] need to run afk, will check later sorry [10:42:08] I tried to look into mcrouter's logs but I can't find a clear smoking gun [10:42:59] I think it is puppet related [10:43:35] not mcrouter related, cc jayme I think timeouts happen every time puppet runs [10:44:05] https://usercontent.irccloud-cdn.com/file/VQyhYl6U/image.png [10:44:22] sigh, this ferm/k8s problem is not going away [10:47:59] it will go away fafter puppet has run on all k8s hosts, but we are aware of this and we have tried to work around it [11:25:39] elukey: I got an error running `puppet-merge` on puppetserver1001 after merging a change to labsprivate. https://phabricator.wikimedia.org/P69324 [11:26:21] Looks probably non-serious, affecting puppetmaster1003.eqiad.wmnet and puppetmaster2002.codfw.wmnet only. [12:26:05] btullis: checking! [12:26:35] ah snap ok I know what it is [12:32:05] mmm sort of, so those should be puppetmaster backends, and they don't have the repo checked out [12:48:31] the change wasn't propagated on all puppetservers thoguh [12:48:35] *though [12:49:13] only on puppetserver1001 [12:55:11] * elukey checking puppet-merge.py [13:01:50] ahhh okok found the issue [13:02:01] it is in /etc/puppet-merge/shell_config.conf [13:02:30] for labs private it uses the MASTERS variable to get the target hosts to run puppet-merge.py onto [13:02:41] on puppetmaster1001 we have [13:02:43] MASTERS="puppetmaster2001.codfw.wmnet" [13:02:54] on puppetserver1001 [13:02:54] MASTERS="puppetmaster1001.eqiad.wmnet puppetmaster1003.eqiad.wmnet puppetmaster2001.codfw.wmnet puppetmaster2002.codfw.wmnet puppetserver1002.eqiad.wmnet puppetserver1003.eqiad.wmnet puppetserver2001.codfw.wmnet puppetserver2002.codfw.wmnet puppetserver2003.codfw.wmnet" [13:04:44] I'd argue that it was wrong also on puppetmaster [13:11:26] cr incoming [13:12:45] is it ok/safe to run puppet-merge now elukey or I should wait ? [13:15:10] godog: for ops/puppet yes, labsprivate better to wait [13:15:18] ack, thank you [13:19:45] fix should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074176 [13:20:01] moritzm: --^ if you have time [13:23:13] ack, reading through backscroll now [13:28:41] looks good, when we're fully on Puppet 7 we'll have soo many crazy corner cases eliminated... [13:29:02] I agree [13:30:31] not urgent, but any chance T375193 could be related to the ongoing Wikitech migration work? maybe it previously used InstantCommons but now needs to use another way to use Commons images more similar to other Wikimedia wikis? (just guessing) [13:30:31] T375193: Add.svg image shows up as “Add patch” redlink on Wikitech deployment calendar - https://phabricator.wikimedia.org/T375193 [13:33:01] okay, looks like it fixed itself for now, feel free to ignore [13:38:52] btullis: the issue should be fixed now! [13:39:24] I have manually ran git pull on all puppetservers and master frontends for labs-private [13:41:21] and also updated https://config-master.wikimedia.org/labsprivate-sha1.txt [14:05:38] elukey: Thanks. [15:17:07] FYI, at around 17:00 UTC (with a bit of prep before), we're going to run a test in which we temporarily depool mw-api-int-ro, mw-api-ext-ro, and mw-web-ro in codfw. [15:17:07] the goal is to validate capacity estimates for serving on 100% k8s from a single DC before the switchover. [15:17:07] we'll keep an eye on those services themselves and WAN links adjacent to eqiad. [15:17:07] more details in https://phabricator.wikimedia.org/T371273 [15:17:33] mutante: FYI, as fellow oncaller at that time ^ [15:18:03] unrelated but perhaps related, I plan to depool ulsfo with the DNS admin cookbook to make sure everything works as expected -- we have run a test before but there have been some changes so I want to run a real-world test [15:18:19] and to avoid any surprises before next week's switchover [15:18:51] swfrench-wmf: ok, thanks [15:19:43] sukhe: ack, thanks for the heads-up - do you know when you might be doing that? [15:19:57] swfrench-wmf: I think I will do it right now unless there are any objections [15:20:05] basically depool and then wait a bit and pool back again [15:20:25] sounds great, no objections on my end [15:20:43] ship it [15:21:19] thanks [15:32:02] volans: hi, quick question if you are round [15:32:03] *around [15:32:30] set_and_verify("pooled", False) seems to be error-ing out for me [15:32:46] the error message is fairly explanatory: ValueError: False not in 'yes | no' [15:33:03] but on the flip side, I see uses of set_and_verify that set False and not 'yes | no' [15:33:54] and also test-cookbook didn't actually catch this while an actual cookbook run seems to error out on this [15:38:08] sukhe: interesting [15:38:30] oh, interesting ... yeah, passing False to that will certainly type-check, but it seems that the actual valdiation applied by the underlying object type is not applied until you try if for readl [15:38:51] *for real (my goodness, my typing is terrible) [15:40:08] (i.e., the object's `update` is never called) [15:40:43] updated, trying again. good thing we did a sample run :) [15:41:00] like an actual live sample run! [15:41:43] * volans checking the schema [15:42:14] because it's used bu dnsdisc and mediawiki with booleans [15:43:10] yeah [15:44:36] _schema = {"weight": get_validator("int"), "pooled": get_validator("enum:yes|no|inactive")} [15:44:44] that's hardcoded into conftool [15:45:13] hmm [15:45:38] and enum gets a validator that checks if what you pass is "in" the list of choices [15:45:43] so yeah in this case strings [15:46:11] sorry for the confusion I might have added bringing the existing usages as examples [15:46:14] yeah, ok then, I reverted to literal yes and no in the current CR, will try again. [15:46:48] as for test-cookbook why you say it didn't catch it? you did run only in dry-run? [15:46:55] yeah [15:47:13] DRY-RUN: END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: no reason specified, no task ID specified] [15:47:19] you can do a real run from test-cookbook too, the logs will be in your home but by default it will log to SAL [15:47:35] for testing purposes is ok [15:47:41] oh right, if I skip --dry-run [15:47:56] yeah, just wanted to check the rest of the logic was fine and hence the dry-run [15:48:11] ofc dry-run first [15:56:06] ok, that worked fwiw, "yes|no" as expected [16:06:25] FYI, the ToR switch migrations in codfw today are going to touch conf2006, which is used by pybals in eqsin and ulsfo, as well as confd in those sites plus codfw itself. [16:06:25] once the migration is done, we'll check to see if those pybals are unhappy, and if needed restart them. [16:06:46] also, we'll be progressively restarting confd in those locations as well. [16:06:48] Thanks Scott [16:07:07] topranks: if you could ping here when they're done, that would be swell [16:07:11] It will also affect bast2003 briefly, so anyone connected using SSH sessions via that host should expect them to be unresponsive for a few seconds [16:07:21] swfrench-wmf: yes absolutely [16:07:38] swfrench-wmf: should we point them away from conf2006? [16:07:45] are we ready to start from a conf2006 point of view? [16:08:28] sukhe: I would expect that to be more disruptive on net than a brief connectivity blip. or, I assume you mean the pybals? [16:08:28] sukhe: restarting pybal should be enough [16:08:43] swfrench-wmf: I meant the pybals [16:09:08] what was it, did it fix after cache reload/reparse? [16:09:27] we can point them to 1009 temporarily, it's a hiera change but requires restarts yes [16:09:30] oh, ignore me. I had an old back scroll [16:10:04] sukhe: we will be following what is documented here: https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Reimage_cluster Since any of the two (hiera or restart) require a restart, lets just do teh simplest one :) [16:11:01] it's also very possible that no restart is needed, at least for the pybal case [16:11:16] that too yes [16:11:33] topranks: sorry, missed your message in there - yes, good to go [16:11:35] we will be losing connectivity to conf2006 during this, right? [16:11:38] topranks: good to go [16:11:53] sukhe: aye [16:11:58] ok thanks both - we are doing rack d7 now and will move on to d8 (where the conf host is) in a few mins [16:12:08] if we missed something, we will follow the alerting trail [16:17:56] effie: surely I am missing something, but where in the documentation is it an either or? [16:18:07] > Make Pybal use the other cluster [16:18:20] the gerrit change points it to the other confd host and then the restart happens [16:18:51] sukhe: the documentation is written when reimaging one, so a longer downtime is expected in that case [16:19:28] precisely, yeah - this is the difference between tens of minutes vs. seconds of unavailability [16:20:02] stated differently, we're borrowing some bits from those docs, but it's not entirely applicable [16:20:04] ok. the only concern I have is what if is not that, given it is a maintenance? [16:20:09] but it's fine, I will stop now :] [16:20:51] effie: ok conf2006 moved, we lost about 11 seconds of pings [16:21:01] thanks topranks! [16:21:18] topranks: sounds like ragnarok, yes [16:21:22] great! that's not too bad [16:21:55] I'll check the status of pybals first, and then get the confd restarts rolling [16:22:13] thanks - all the moves done now and all hosts responding to ping again [16:22:19] cc arnaudb [16:22:22] swfrench-wmf: IMO not required but yeah, thanks for checking! [16:22:49] [the hard-requirement is when the config changes and in case we pointed away from confd2006] [16:32:24] I will restart codfw backups , ok? [16:37:58] jynus: sure go ahead [16:38:16] alright, pybals still happy, confd restarts done, effie took care of navtiming on webperf2003. I think we're done. [16:38:43] sukhe: anything outstanding from your end ? [16:42:01] effie: thanks, looks good! (A:lvs and (A:ulsfo or A:eqsin)" "journalctl -u pybal --since='2 hour ago' + I checked by manually depooling and pooling to make sure connectivty is fine) [16:42:31] thanks, sukhe! [16:43:06] alright, now on to the next excitement: the pre-switchover capacity validation test will start at roughly 17:00 UTC. [16:43:39] I will, however, be starting to scale up the relevant deployments now-ush [16:43:42] *now-ish [16:44:25] good luck folks! [17:03:00] starting now, will be mainly posting to -operations [18:06:17] test is complete, no significant issues encountered, and indeed the capacity math seems to math alright. [22:23:46] no events today for on-call. really away from keyboard now.