[06:05:46] good morning [06:05:54] very interesting: https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=rsyslog-notice [06:06:13] there was a big bump in rsyslog-notice messages to kafka logging at around 18 UTC yesterday [06:06:24] and the consumers are still recovering [06:11:12] anyway, as FYI I am going to roll restart the memcached gutter pool for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605617/ (new settings) [07:19:50] godog: i've been trying to follow your pontoon instructions, but puppet is refusing to play ball: https://phabricator.wikimedia.org/P11527 [07:24:14] ahh, found a solution (added as comment to the paste) [08:36:56] kormat: oh good catch! I'll update the branch [08:49:28] volans: can cookbooks call other cookbooks? [08:50:28] kormat: good question! It's something I've been starting to think since a bit, but not yet formalized it. [08:51:03] mainly for argument parsing, that is something taken care by spicerack before calling the cookbook [08:51:41] technically speaking you can include another cookbook and call it's run, but I wanted to add something to spicerack for a more official way to do that [08:51:59] we are starting to see that necessity now [08:52:14] what's your use case? [08:52:23] would be useful for shaping the feature ;) [09:22:48] volans: i'm thinking about composability [09:23:01] e.g. if i had a cookbook to set up replication on a db host, [09:23:25] could a cookbook to change a host to use a different master re-use that [09:23:35] but probably the simplest thing to do is implement the actual functionality in a library, [09:23:44] and have cookbooks be thin wrappers around the library [09:24:04] that was the initial idea yes, we also have a feature, not much used, that is [09:24:46] if you run cookbook within a directory, you get an interactive menu and you can run one by one the cookbooks within the directory and they by default share the same parameters (but can be overriden) [09:24:54] that's what we use for the DC switchover [09:25:08] that's why they are named with numbers [09:25:53] that makes sense for larger operations that have some logical split and might need some manual verification/approval between steps [09:26:32] but more we automate stuff more it could be concentrated into less steps [09:27:22] the other thing i've realised is that a lot of the db automation stuff is common between what will run via cookbooks and what will run via other environments [09:27:30] (e.g. directly on the hosts themselves, or from the backup hosts) [09:27:36] The cross-calling case is more for unrelated stuff, like a cookbook to add new devices to netbox that then wants to call the sre.dns.netbox cookbook to deploy the DNS changes [09:27:48] so it probably makes sense to keep the spicerack mysql code minimal, and use it to provide a connection object to a common library [09:29:09] that's totally ok, it's what we do with cumin/conftool/etc... they are standalone library/tools that are wrapped by spicerack to use in the cookbooks but also have their own usage in other places [09:29:16] +1 [09:29:22] in particular for things that don't require SSH like mysql native connections [12:31:40] godog: progress - puppet in my VPS setup now breaks in the exact same way as it does in production \o/ [12:34:15] kormat: then you're done, a broken puppet is the best you can get from puppet :-P [12:34:21] kormat: woot woot! [12:34:25] volans: hehe [13:32:53] cdanis: nice dbctl diffs on cumin2001! [13:33:05] oh? [13:33:30] https://phabricator.wikimedia.org/P11536 [13:33:34] I thought you did something XD [13:33:37] Maybe the buster upgrade? [13:33:55] ah yeah, looks like buster upgrade also means we get icdiff as a library :D [13:34:10] it still records unified diffs in the SAL [13:34:19] yeah [13:41:14] what's up with mw codfw hosts? [13:43:36] rebooted for kernel upgrades i believe [13:43:37] "SSL connection failed with error code : 5 : Success [13:43:41] downtime expired [13:43:41] great error message [13:44:07] lol [13:53:42] jbond42: hello, yt? [13:53:51] q, in https://gerrit.wikimedia.org/r/c/operations/puppet/+/602646/5/modules/confluent/files/kafka/kafka.sh [13:53:56] why the change from $@ to $* [13:53:57] ? [13:54:02] I think it broke the script [13:55:26] ottomata: this was me it looks dpong shellcheck updates [13:55:59] not familure with the script but assummin $@ has additional abjects as such $@ is probably better, ill send CR now [13:56:05] ok thank you [13:56:19] yeah it is adding the args as is, i think $* will provide the args as a single string [13:56:26] yes [13:56:38] i'm getting describe --topic wdqs_streaming_updater_test is not a recognized option [13:56:46] i think it is passing all of those as a single shell arg [13:59:14] ema: I'm going to merge your patch [13:59:16] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605918 [13:59:22] andrewbogott: please do [14:00:50] ottomata: that change is merged please let me know if you still see an issue [14:02:35] jbond42: what about the line above? I gues it doesn't matter as much there because it is an echo [14:02:36] ? [14:04:30] ottomata: yes exactly it dosn't really matter can change that as well if you want [14:18:03] no i think its find jbond42 [14:18:05] fine* [14:18:11] ack [15:17:12] who's doing the thanos stuff? [15:17:14] cronspam [15:17:30] from thanos-fe2001, a lot of it [15:17:30] godog was working on it AFAIK [15:17:41] and observability in general [15:18:10] ah ha [15:19:45] yeah that'd be me, I'll stop the crons for now, the issue itself should be resolved "soon" [15:19:55] ty [15:43:14] andrewbogott: looks like there's a downtime from you on icinga1001 and all its services? FYI that's quite a big hammer and silences also unrelated alerts including availability pages [15:44:26] godog: that silences /all/ alerts, not just those pertaining to that host specifically? [15:44:50] (in any case, the alerty thing I was doing is finished so I can cancel the downtime) [15:44:58] andrewbogott: there are a lot of service-level alerts that have a host of the icinga hostname, because of implementation details [15:45:14] details, got it :) [15:46:02] I de-downtimed [15:46:03] andrewbogott: not all alerts no, a bunch though because of details as cdanis mentioned, i.e. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=icinga1001&scroll=6032 [15:46:07] thanks! [15:46:50] all check_prometheus alerts [15:52:14] hmmmm my new test still isn't in the icinga web UI [15:52:36] https://www.irccloud.com/pastebin/cg5tmroW/ [15:52:46] shouldn't I see that on the cloudcontrol2001-dev icinga dash? [15:57:35] yeah you should, unless the icinga configuration is invalid and isn't getting reloaded [15:58:29] config is valid, or was last I cheked [15:58:32] *checked [15:58:45] andrewbogott: check the PENDING column in icinga web UI [15:59:37] I see 0 pending [15:59:48] andrewbogott: https://phabricator.wikimedia.org/P11544 [15:59:56] I don't see it reloading since this [16:00:28] dang [16:00:31] ok, will fix up in a bit [16:04:11] godog: still seeing cronspam [16:09:55] apergos: gah, ok should be better now (#2) [16:11:29] godog, cdanis, I need to step away (pushing a friend around a hospital in a wheel chair), if you need to take desperate measures to fix the icinga config go ahead, otherwise I'll try to revisit as soon as I get a minute [16:16:55] andrewbogott: ack [16:26:00] ok, I am briefly back [16:27:25] godog: so that log is different from "/usr/sbin/icinga -v /etc/icinga/icinga.cfg" ? [16:28:10] (because that's what I'm using, and it returns 0 errors) [16:28:48] andrewbogott: it does look the output of icinga -v to me yeah (Chris pasted it not me) [16:28:56] ah, sorry [16:29:02] so in that case I think that's an old error that I've already fixed [16:29:21] which still leaves me the mystery of why my checks aren't showing up but that's much less urgent :) [16:29:41] * andrewbogott tries service icinga reload just to be sure [16:29:47] yeah, seems happy [16:30:45] andrewbogott: it's in pending now. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [16:31:33] so it is! [16:31:47] ok, so since I don't think I actually left everything broken I'm going to disappear again [16:31:50] thanks all [16:32:03] ping me if I turn out to be wrong re: everything broken [17:52:27] andrewbogott: so what happened was, Ib9255c30 was submitted, which referenced the at-the-time nonexistent `check_galera`. since it changed the check_command of a monitoring::service class, that triggered an icinga reload, which is where the error message I posted came from. that error message was the last time puppet (or anyone) attempted to reload icinga's config on icinga1001 at the time of posting [17:53:16] andrewbogott: this is because the followup patch I5c9c6fc1b only edits the definition of the nagios_common::check_command, which does _not_ know to trigger an icinga reload in puppet -- there's no dependency defined. I suspect this is because there are a bunch of NRPE-invoked check_commands that are installed and run on hosts without icinga? [17:54:18] not sure how best to solve this [17:54:39] also, I hope you and your friend are okay :) [18:09:57] ah, no, I was wrong, nagios_common::check_command and nagios_common::check_command::config are only used on icinga hosts [18:10:04] they should definitely notify the service then [18:17:49] anyone have a minute for a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/605983 ? [18:24:15] cdanis: stamped [18:24:19] ty