[08:48:15] XioNoX, godog o/ can you check https://phabricator.wikimedia.org/P14076 when you have a moment? Those are uncommitted DNS changes for Netbox [08:48:25] if those are ok we can run the cookbook [08:49:52] elukey: the network ones are fine [08:50:44] <3 [08:51:18] elukey: yep all expected, logstash-be are being provisioned IIRC [08:51:37] super, going to run the cookbook then [09:01:45] it didn't really go well [09:01:46] info: admin_state: checking state file '/tmp/dns-check.r4ssaz_l/state/admin_state'... [09:01:49] error: Cannot open '/tmp/dns-check.r4ssaz_l/zones/netbox/7.0.e.f.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa' for reading: No such file or directory [09:01:52] fatal: Initial load of zone data failed info: admin_state: checking state file '/tmp/dns-check.r4ssaz_l/state/admin_state'... [09:01:56] error: Cannot open '/tmp/dns-check.r4ssaz_l/zones/netbox/7.0.e.f.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa' for reading: No such file or directory [09:01:59] (double paste sorry) [09:02:58] going to open a task sigh [09:03:19] elukey: ah, one sec [09:03:23] IIUC this was a failed reload on all dns auth server, so nothing really problematic [09:04:03] elukey: in templates/0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa need to remove $INCLUDE netbox/7.0.e.f.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [09:04:42] the other error mentions [09:04:43] error: Cannot open '/tmp/dns-check.byofbm4p/zones/netbox/3.0.e.f.3.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa' for reading: No such file or directory [09:04:51] but it is not consistent across auth servers [09:05:22] probably the other link, I'll have a look [09:05:33] ack [09:06:10] are you going to send a patch for DNS? I think that it is also fine to wait for more people to join [09:11:05] (bbiab) [09:16:53] elukey: thanks for the trust ;) [09:17:21] elukey: https://gerrit.wikimedia.org/r/c/operations/dns/+/660767 (cc/ volans|off ) [09:29:56] XioNoX: checking [09:33:01] XioNoX: ship it [09:33:26] I didn't had yet time to add the check of creating new files /deleting old ones [09:33:57] in the meanwhile we should be careful checking what are the changes and if there is a removed/new file check that the includes are correct [09:35:03] thx! [09:36:02] thanks all :) [09:37:13] merged! [09:37:18] elukey: let me know if it works, now [09:39:38] XioNoX: running [09:41:54] FYI in those cases the procedure to follow is outlined here: [09:41:54] https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [09:47:03] volans: ah yes I was about to ask, of course there is no diff to commit [09:47:38] if the dns patch was merged and deployed successfully there should be nothing more to do [09:48:13] volans: because the authdns servers were reloaded so the also picked up the new netbox zone changes? [09:48:44] yes because those were already pushed by the cookbook (and failed just the reload) [09:49:08] you can also through the cookbook force a reload, see the --force option (needs a SHA1) [09:49:09] gooood [09:49:18] nono I am fine :) [11:05:37] Hey, is there a friendly SRE who would merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/660783 [beta] Add vote.wikimedia.beta.wmflabs.org to beta_sites for me? [11:13:59] Urbanecm: done :) [11:14:30] Thx [11:29:51] elukey: patch works, but unravaled a mistake I made in the MW side of config. But I'll fix that myself :) [11:40:21] super [12:03:01] _joe_, I am guessing only some of the data was unavailable/long latency and that could flap a bit, the same way that lvs -> appserver -> db is difficult when db has partial problems [12:03:19] it is a difficult problem to handle :-/ [12:03:20] <_joe_> so, I think there are two things here: [12:03:40] <_joe_> 1 - it's ok if we get paged if the swift backend is suffering [12:04:02] <_joe_> 2 - it's not ok if pybal depools swift proxies because the swift backend is suffering [12:04:23] indeed, we could maybe tune the lvs check/behaviour [12:04:24] <_joe_> if anything, it only risks making the problems worse for end users [12:04:35] <_joe_> yeah that's the one actionable I have here [12:04:41] maybe increasing the number of min depooled [12:05:01] <_joe_> besides that... maybe depool swift the moment we rebalance it could be a good idea from now on [12:05:16] or increasing the check to cover multiple objects? [12:05:23] <_joe_> given IIRC something similar happened last time godog did a rebalance [12:05:40] <_joe_> not paging, but high load and some issues on the backends [12:06:04] yeah noisy rebalances are a long-standing issue [12:06:23] I can go ahead and merge puppet changes right now, right? we're not in a bad spot? [12:06:46] given that we're out of the woods I'll go afk and back to lunch, will followup with actionables/tasks later [12:06:51] unless objects ? [12:06:52] apergos, no blockers on deployments, imho [12:06:55] <_joe_> godog: +1 [12:06:56] objections even [12:07:07] <_joe_> we will look at repooling codfw later in the day [12:07:16] apergos, unless you plan to deploy to lvs config of swift :-DDD [12:07:21] thanks! [12:07:29] not in my todo list for today! :-P :-D [12:07:31] <_joe_> I suspect that, without external traffic, it should rebalance faster :) [12:07:50] did we check eqiad was ok with extra load? [12:08:01] <_joe_> I was keeping an eye on it [12:08:10] I am guessing yes at this hour [12:08:34] <_joe_> jynus: let's put it this way: if it's not, it's a problem we need to fix :) [12:08:39] hehe [12:08:41] indeed [12:09:05] so this reminds me of similar issues with the layers when we had db issues but not full hard down [12:09:32] it is very easy to flap, although in that case normally we don't notice it [12:09:50] except in increased number of errors [12:10:09] automation is hard :-/ [13:27:38] re: the earlier swift page, filed https://phabricator.wikimedia.org/T273453 [13:34:24] (actually writing an incident report in short form on wikitech as we speak) [14:48:21] <_joe_> godog: should we repool swift in codfw? the load on backends is going down progressively [14:53:00] _joe_: yeah sounds good to me [14:53:46] <_joe_> done! [14:54:11] nice, thank you [17:50:26] FYI puppetlabs updated there doumentation on pson https://github.com/puppetlabs/puppet/blob/main/api/docs/pson.md hopefully you dont need this information but its here for the curious and insane :) [17:57:08] TL;DR: "almost JSON" :D [17:59:09] I just know when I fetch it from the puppet compiler I can usually parse it with jq ;) [18:00:18] btw if someone wants to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/659991 it might be helpful wrt T273312 [18:00:18] T273312: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 [18:04:23] cdanis: looks good but wonder if some or all of the output of `sudo facter -p os.release os.distro ` may be usefull to include? [18:10:08] hey! that's useful [18:20:44] jbond42: what does -p do in facter? [18:20:53] it does not appear in the manpage [18:21:08] oh, I see, loads some extra Puppet-specific stuff [18:21:19] cdanis: it resolves facts provided by puppet, in this instance you dont need it tbh. b [18:21:24] yes [18:21:47] it can slow things down quite a bit so its nice to drop it if not neede [18:22:11] * jbond42 just muscle memory on my part [18:25:02] cool, I'm writing a more general version using this [18:25:20] cool [18:27:02] cdanis: the options '-j, -y and --no-ruby' may also be usefull. (only use the no-ruby if yu really need to as facter 4 moves back to ruby) [18:28:39] jbond42: `DISTRO_FACTS=$(facter -j os.distro | jq -e '."os.distro"')` [18:28:41] :) [18:30:02] :) nice [18:39:28] jbond42: ... why did facter stop including point releases in Buster? [18:40:52] "id":"Debian","release":{"full":"10","major":"10"}}} [18:40:57] whereas /etc/debian_version says 10.7 [18:42:23] "lsb_release -r" also just says "10" [18:42:41] cdanis: os.release (specificly os.release.full) im not sure why the difference howver its likley due to trying to keep a simlar interface accross all OS's but could also be a bug [18:43:13] yeah it's interesting since pre-buster it does report the point release [18:43:37] anyway jbond I am going to merge the original version for now; I have a version that 'does it all' but I'd rather figure that out [18:45:39] cdanis: sounds good to me, as to the other questions i would say its a bug. i just checked and facter4 on my laptop also gives the minor version for os.distro.releas.full [18:48:01] thanks! [18:51:41] https://w.wiki/xG4 [18:52:05] o11y folks: any idea what is up with the values for the 'version' field in the table? in the graph they look correct [18:54:30] and even more interestingly, if I hit the magnifying-glass-plus on one of those values in the table, it knows the correct one; it just renders it wrongly? [19:01:05] can someone tell me about network::constants in puppet? [19:06:15] cdanis: you ended up snipping me :) https://github.com/puppetlabs/facter/commit/f043ef1ff96402cc3aea07d5f3927636712ac74d [19:06:59] if its usefull we could look to backport that or try to uptream it [19:08:36] (upstream in the debian buster package sense) [19:09:45] cdanis: that's weird. the table value looks like a rounded number, but it should be a string... [19:09:53] yeah, it's very strange [19:10:13] rounded to exactly two significant digits? because 10.7 rounds to 11 heh [19:10:16] chaomodus: im not the best person to ask but i have can try and help [19:10:27] jbond42: oh hey nice thanks :) [19:11:22] Something in the JS must be coercing those values. https://github.com/grafana/grafana/issues/29598 [19:11:53] jbond42: just sort of trying to probe out the implications, it can simplify some changes i'm making if i could modify this slightly [19:12:32] aha thanks shdubsh, glad it's already reported upstream [19:13:43] chaomodus: are you saying you want to madify the data in data/data.yaml ? is so i think that would likely be a pain to to unless its backwards compatible [19:14:23] yah [19:14:41] I suppose so [19:14:55] if i could add keys to themanagement network list it'd save me some typing :) [19:15:17] to the best of my knowalge and (add me to any change to look deeper) adding something to network::constants shuld be a noop. also creating a new function may be an option [19:16:39] you can check the current function in network::parse_abuse_nets or some ones that jo.e wrot for service data under wmflib/functions/service/ to get an idea of data manipulation functions [19:16:40] Hmm interesting [19:18:07] that said i think ultimatly we want to pull all of the dat in the network constant module from netbox either as a hiera shim (as we descussed before), custome hiera backend or a ENC script [19:19:19] right [19:19:29] it's funny that it's in a module and not in heira, i didn't know you could do that. [19:21:52] yes im not sure why this pattern is used, could be due to some early puppet limitation. the admin module also dose simlar. at some point i think we shuld remove the use the load_yaml function and move data like this into the standard yaml stucture (or in this case one of the other options :)) [19:23:47] right, generated from netbox or whatever [19:24:07] i'll fiddle with it a bit and see if i can jam the data in and be a bit more dry about my change :) [19:24:23] thanks for the context [19:25:10] no probs, if you hit a wall drop me a mail with the output structure you want and i can take a look tomorrow [19:26:40] great thank you! [19:27:06] np [19:34:03] a while ago, I asked in this channel for recommendations of how to track uploads of packages to Debian testing and after some discussion, we decided that tracking https://release.debian.org/testing-watch/ can be a good way to know if a newer version of the package is available in testing [19:34:33] as an update to that, I found tracking for changes ("Watch") from https://salsa.debian.org/debian is an easier and a cleaner way [19:35:31] the downside is that you get notified of every change (and not just tagged version) but you can filter that out [19:38:54] the notification settings are defined under your profile, such as https://salsa.debian.org/-/profile/notifications. so you can set a global setting of "Watch" and then just follow a project [19:39:02] anyway, sharing this in case someone else finds this useful :)