[05:20:33] cdanis: really really great, thanks a lot! [05:22:57] bblack: the main issue with those bursts is mcrouter, since we have a tight timeout (1s) to memcached. Currently for each mcrouter are needed 10x1s timeouts to flag a memcached shard as TKO (not able to serve traffic, so causing errors, until the health check probes pass again) [05:25:39] not always a burst causing saturation leads to a TKO, but this graph tells a lot [05:25:42] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=6&fullscreen&orgId=1&from=now-12h&to=now [05:30:27] cdanis: already added two graphs to https://grafana.wikimedia.org/d/000000316/memcache?orgId=1, \o/ [06:38:53] * elukey errand for a bit [10:52:30] fyi i have just added schema validation for the admin module data.yaml file to CI (https://gerrit.wikimedia.org/r/c/operations/puppet/+/589021) please ping me if you see any issues [10:56:17] \o/ thanks! [12:49:54] that is where the cool kids are hiding :] [12:49:55] hello [12:51:30] pbuilder USENETWORK=yes caused me some issues previously [12:51:39] well, setting any variable does cause issue at some point [12:51:48] but it seems adding them to the .pbuilderrc work [12:51:56] hi hashar o/ [12:52:01] and I think sudo might sometime strip some environment variable [12:52:19] jayme: hi and welcome :] [12:52:51] I think I've tried that (at least on deneb) and it did not work. Or do you mean /root/.pbuilderrc - wanted to try that next [12:52:54] Thanks [12:53:58] on my local machine I have it in my ~/.pbuilderrc [12:54:00] Oh [12:54:26] but that really depends in how you build the package I guess [12:54:42] the build gets started as regular user [12:54:56] I've misread the pbuilder message on deneb I guess. It reads "W: pbuilder: network will not be disabled during build!" when everything is *fine* [12:55:05] Ah [12:55:13] and supposedly it is supposed to show that message in yellow [12:55:34] the idea is that for Debian, packages should not rely on internet access at all [12:55:43] so their build boxes somehow disable network via black magic [12:55:47] And as deneb does not have internet access the build fails anyways. So I did not try .pbuilderrc on builder01 [12:56:18] ah [12:56:56] also on CI apparently there is no need for USENETWORK for those packages. But maybe I should verify [12:57:44] So. To be presice: With "echo USENETWORK=yes >> ~/.pbuilderrc" on builder01 network acces it *not disabled* aka enabled correctly :) [12:58:07] \o/ [12:58:08] ...and the build fails with segfault instead [12:58:32] Oo, [12:58:36] such a pity [12:58:43] progress! [12:58:57] the thing is that apparently we should not rely on network access [12:59:23] we do for a few exceptions (like what Janis is trying to build) [12:59:37] jbond42: i feel this schema validation discriminates against my ability to type random stuff into the wrong place in admin/data ;) [12:59:52] jayme: looking at the new build log [13:00:04] at least operations/debs/helmfile seems to have the dependencies fetched and commited under ./vendor/ [13:01:11] moritzm: it's something related to glide trying to replace stuff in vendor/ I guess [13:03:32] * moritzm snickers at naming the repo https://github.com/Masterminds [13:05:35] moritzm: looks similar to https://github.com/Masterminds/glide/issues/558 which is fixed in glide 0.13.3 (we're at 0.13.1) [13:07:44] note Alexandros has send patches for helmfile ( https://gerrit.wikimedia.org/r/#/c/operations/debs/helmfile/+/588418/ ) and https://gerrit.wikimedia.org/r/#/c/operations/debs/helmfile/+/588415/ [13:07:49] but I guess that is what you are trying to build? [13:09:56] We just tied to rebuild the existing source package in first place [13:10:09] that sounds quite likely, let's update to 0.13.3 [13:13:04] kormat: lol [13:16:17] moritzm: How do we do that? By creating an updated +wmf version? [13:16:45] jbond42: if there are similar areas that could do with validation, but you haven't been able to find an excuse to work on it, let me know, and i'll provide the excuse ;) [13:18:11] lol thanks ill keep that in mind ;) [13:20:07] elukey, do you have time to give https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/588752/ another look? [13:20:15] jayme: this is a WMF-internal package, so we don't need to explicity denote that it's a WMF build, the repo is at operations/debs/helm, I think we can simply bump the version to -2 [13:22:46] moritzm: don't know if I got you right, but glide > 0.13.1 is not available currently [13:23:56] kormat: jbond42: for yaml schema validation, the CI system "zuul" relies on a python module named "voluptuous" https://github.com/alecthomas/voluptuous . That lets one define a schema, load the yaml and assert the resulting data structure matches [13:24:22] kormat: jbond42: or really, just port that to XML with a DTD to validate it [13:26:57] moritzm: but we can remove the build-dependency to glide probably (and have that loaded on demand) [13:27:25] ...as we're relying on network access anyways [13:30:56] sure, ive it a shot, I don't know the history of how that package was created [13:31:48] works. I'll submit a patch then [13:33:49] is that for helmfile ? [13:33:59] hashar: no, for helm [13:34:01] oh no for the other one [13:34:22] helmfile has all the vendor packages added in a single large patch file under debian/patches/added-vendor [13:34:33] `helm`, I haven't looked into it [13:37:11] there is a backup freshness warning alert- this is a mistake, should be corrected with a WIP patch [13:44:49] moritzm: the helm git repo does not contain a .gitreview file. Is is okay to create the gerrit remote manually or does that need anything on gerrit-side first? [13:45:29] maybe akosiaris ^^ [13:45:37] So weird to read jynus and then jayme XDD [13:45:45] yes [13:45:46] It is like: is jaime talking to himself? [13:46:24] jayme: git remote add gerrit etc etc. And we can probably put a .gitreview file there [13:46:34] nothing needs to be done on the gerrit-side first [13:46:42] akosiaris: okay [13:50:36] most repos ship a .gitreview file, you can also commit one to the repo, see https://github.com/wikimedia/debmonitor/blob/master/.gitreview for an example [13:52:07] just FYI the equivalent link in our gerrit is: [13:52:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/master/.gitreview [13:52:29] anyone have any idea why the hieradata added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/589277 at hieradata/common/profile/prometheus/nic_saturation_exporter.yaml doesn't seem to have any affect? [13:53:16] have you tried hiera_lookup to check if it's used? [13:55:02] cdanis: why not role? [13:55:05] role prometheus [13:55:30] no, I haven't, as AFAIK the needed version of ruby is not installable on a modern machine [13:55:32] ./role/common/prometheus.yaml [13:56:45] I don't think that's where I want to specify anything? this is for an exporter that will eventually go on every physical machine [13:59:21] sorry I had read too quickly [13:59:29] you want to set a default for everyone that is not overriden [13:59:55] yes, so I can roll out the exporter to a few clusters/roles at a time [14:00:01] instead of doing the whole fleet at once [14:00:28] jbond42 had put it in the hieradata for the profile itself, which looks reasonable, but doesn't seem to work [14:02:36] the patch is not merged, so why you say doesn't work? [14:03:16] https://puppet-compiler.wmflabs.org/compiler1002/21985/ [14:03:36] pcc shows the exporter being installed on hosts where it shouldn't be [14:03:58] k [14:08:47] andrewbogott: I was out getting groceries, checking in a bit [14:08:56] elukey: thanks! [14:17:31] backup freshness alert fixed [14:18:25] lots of cp alerts in soft state however [14:19:01] probably just ema's maintanance [14:19:23] cdanis: ./utils/hiera_lookup --fqdn=bast3004.wikimedia.org --site=esams --roles=bastionhost::pop profile::prometheus::nic_saturation_exporter::ensure |grep -v puppet [14:19:23] absent [14:19:30] looks like it's working locally [14:19:36] so is PCC bugged? [14:20:00] jynus: yes, rolling restarts in progress [14:20:25] np, lots of things in red saying "varnish" is scary :-D [14:20:37] heya godog yt? [14:20:46] or whoever is working with logstash and kafka these days? [14:21:38] ottomata: at this time of day, herron is a good bet too [14:21:57] ottomata: sure (also what cdanis said) [14:22:11] hey ottomata I was just looking at it [14:22:38] okj [14:22:40] so [14:22:43] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/logstash/collector.pp#250 [14:22:56] we currently import the eventlogging_EventError topic into logstash [14:23:01] we are moving away from eventlogging stuff [14:23:27] there are eventgate topics, e.g. eqiad.eventgate-main.error.validation [14:23:45] these are in any kafka cluster that an eventgate instance produces to [14:23:50] which, i think is all of them now [14:24:06] cdanis: wait, PCC is just telling you the resources are in the catalog, not that they are with an attribute of ensure => absent [14:24:07] kafka jumbo is produced to by 2 different eventagte instances [14:24:15] which is what that change does [14:24:25] anywaym, i want to import these into logstash too [14:24:33] would it be best to mirror then to the kafka logging cluster? [14:24:47] or, just configure logstash collectors like the one I linked to to consume them? [14:24:48] akosiaris: ah, of course, I'm holding it wrong [14:25:18] akosiaris: you're right, looking at https://puppet-compiler.wmflabs.org/compiler1002/21985/bast3004.wikimedia.org/change.bast3004.wikimedia.org.pson they are present but absented [14:25:22] hmm [14:25:26] :) [14:26:15] mirroring them to kafka logging would mean setting up new mirror maker instacnes [14:26:20] probably less complicated to consume them where they already are [14:26:22] so maybe we should just consume? [14:26:23] yeah [14:26:30] ok i'll submit a patch' [14:26:37] ok sounds good [14:26:42] should I do the same for collector7.pp? [14:27:01] yes, both collector and collector7 [14:27:02] +1 to consume from where the topics are already [14:28:40] ottomata: while we are talking about kafka, have you ever ran into a case where log cleanup doesn't seem to be happening? troubleshooting this in kafka-logging right now T250133 [14:28:41] T250133: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 [14:30:33] huh no i've never seen that [14:31:44] hmm [14:31:52] isn't log cleaner for compacition? [14:32:45] I checked the running config and log.cleanup.policy = [delete] in kafka-logging [14:33:03] akosiaris: do you know how prometheus::class_config will interact here with a class that is in the catalog, but ensure=>absent? [14:33:09] hm ok [14:34:05] hmm herron i wonder if that error message you found is normal [14:34:35] i think the log cleaner process might only run on compacted topics, for which the only one we use compaction in are the offsests topics [14:35:29] not sure thogh [14:36:01] hmmm but inded you have very old log files! [14:36:55] cdanis: I see it uses query_resources(false, $query, true) from the puppetdbquery module and it looks like they will match, despite the ensure=> absent [14:37:19] that seems like a bug [14:38:26] for what it's worth I did already try manually setting the retention time on a few large topics like `afka topics --alter --topic udp_localhost-info --config retention.ms=432000000` (5 days) [14:39:35] and also a rolling restart of the kafka-logging eqiad brokers [14:39:48] cdanis: or maybe does. It does take into account parameters if they are passed. https://github.com/wikimedia/puppet/blob/production/modules/prometheus/templates/puppetdb_query_string.erb#L14 [14:39:53] s/does/not/ [14:40:29] <_joe_> when that class was created, it had no way to select parameters [14:40:53] so I guess if you pass class_parameters => {ensure=>'present'} then it will not match the ones with "absent" [14:40:58] <_joe_> btw I would discourage having ensure => absent/present on classes in general, but that' sunrelated [14:41:28] <_joe_> yes what akosiaris said is accurate [14:41:32] thanks [14:41:43] I wish there was a way to test prometheus::class_config's output in pcc [14:41:53] and _joe_ I'm just following in jbond's footsteps here [14:42:07] I certainly don't pretend to know what I'm doing ;) [14:42:34] <_joe_> cdanis: it's not very easy to properly test that class without re-running puppetdb-populate on the compiler [14:43:07] I know [14:43:09] <_joe_> cdanis: until we move the compiler to AWS and we re-provision a fresh puppetdb for every run, we won't be able to allow compiler runs to write to puppetdb [14:54:34] I think there's a good argument to have class_parameters => { 'ensure' => 'present' } as a default for prometheus::class_config, fwiw [15:17:47] ahahah sigh [15:18:15] https://puppetboard.wikimedia.org/report/prometheus1003.eqiad.wmnet/f35c91037dd279634f5edf1366d01e117547dd7f [15:18:17] sigh. [15:24:12] okay, reran puppet on all the memcacheds and then on the proms, now prometheus is scraping them again [15:26:22] hashar: I think the current helm package can not be build by CI in general (because of it needing network) [15:30:56] ok herron elukey godog the reason you haver old log files [15:31:11] is beacuse you have log.message.timestamp.type=CreateTime [15:31:15] (which is usually good) [15:31:15] but [15:31:25] something is producing messages with timestamps in the future [15:32:09] I knew ottomata would have found the issue [15:32:16] TIL [15:32:39] so logs are pruned based on the message timestamps [15:32:42] not their mtime [15:32:55] timestamp.type=CreateTime [15:33:06] otherwise timestamp based consumption would break [15:33:25] but, to have the timestamp set to somethign in the future [15:33:31] tthe producer woudl have to set tit when producing the message explicitly [15:33:37] yes yes [15:33:37] does the rsyslog producer do this? [15:33:52] IIUC yes it is rsyslog pushing those logs [15:33:52] is there some machine that has its time set in the future? [15:34:09] ottomata: very interesting! could you point me to an example? [15:35:42] * ottomata https://gist.github.com/ottomata/ba6564fd616dfe2eb73681bab513c37d [15:38:15] <_joe_> ottomata: that request is in the past though [15:38:28] <_joe_> and given the wmf version, that seems about right [15:38:48] wait hm [15:39:02] timestamp 1609441600000 [15:39:14] Thursday, December 31, 2020 11:57:55 PM [15:39:14] Y [15:39:23] jayme: ohh yeah [15:39:28] <_joe_> "timestamp": "2020-01-01T01:33:09+00:00", [15:39:35] that is what kafka has in the index as the message timestamp [15:39:35] hm [15:39:36] <_joe_> in the actual message [15:39:42] yeah [15:39:53] kafka gets the timestamp from the producer in the produce call though [15:39:57] not the message [15:39:57] jayme: so actually we have the use case for a couple other repository. That can be done by making CI to inject PBUILDER_USENETWORK in the build environment which will then have pbuilder configured with USENETWORK=yes ;] [15:39:58] hm [15:40:10] i think ifi no timestamp is set in the produce request, it uses broker server time [15:40:12] but hm [15:41:11] hashar: okay. How do I do that? [15:41:19] herron: where is the code that produces the rsyslog message to kafka? [15:42:38] ottomata: for the udp_localhost-* mediawiki topics it will be /etc/rsyslog.d/50-udp-localhost-compat.conf on mw systems, along with the lookup table [15:43:49] jayme: the CI system configuration is in https://gerrit.wikimedia.org/r/integration/config.git . In that repository there is a python script which is run for each build and that let us inject magic things to the environment: zuul/parameter_functions.py [15:44:50] jayme: there is some already existing code to inject USENETWORK to a couple repo. So a copy paste will handle it. We can pair it over a google meet if you want :] [15:44:53] hm wait ok, that message i printed is not the one with that timestamp [15:44:56] let me try to find it [15:45:26] looking at one example `timestamp: 1609441600000 offset: 4911836348`, this is the year 52971? am I crazy? [15:48:01] herron: those are in ms [15:49:06] hashar: I would suggest to skip that for now, build the package manually on builder01 (internet access) to unblock the contint* migration refactor the helm package afterwards (including "vendor/") to not need internet access anymore [15:49:29] s/refactor/and refactor/ [15:49:40] ok herron we have an older version of kafkacat so it is harder to use it to query based on timestamps [15:49:54] but on deploy1001 in my home i have a newer kafkacat binary [15:49:55] so [15:50:07] ~otto/kafkacat -Q -b logstash1010.eqiad.wmnet:9092 -t udp_localhost-info:2:1609459075000 [15:50:08] udp_localhost-info [2] offset 4913034082 [15:50:09] then [15:50:18] print the timestamp at that offset [15:50:25] ./kafkacat -C -b logstash1010.eqiad.wmnet:9092 -t udp_localhost-info -o 4913034082 -c 1 -f "%o %T" [15:50:25] 4913034082 1577861397000 [15:50:38] WIAT what [15:50:59] .... [15:51:04] i get different results [15:51:10] different times I run that [15:51:10] jayme: yeah that is the path that has been taken by helmfile repo: have all the dependencies committed in a vendor directory and a super larger debian/patches/vendor-added files. [15:52:14] jayme: I have made some comment about it on https://gerrit.wikimedia.org/r/#/c/operations/debs/helmfile/+/588415/ [15:52:16] OH [15:52:18] hm no [15:52:28] oh yes [15:52:35] ok phew that is because i was running without specifying the partition [15:52:35] dh [15:52:36] ok [15:52:37] -p 2 [15:52:43] ./kafkacat -C -b logstash1010.eqiad.wmnet:9092 -t udp_localhost-info -p 2 -o 4913034082 -c 1 -f "%o %T" [15:52:47] 4913034082 1609459075000 [15:53:17] 1609459075000 [15:53:24] is exactly one year after that messages timestamp [15:53:31] OH [15:53:33] no it isn't [15:53:34] right [15:53:43] sorry because II was looking at the wrong message befofe [15:53:44] ok [15:54:50] https://www.irccloud.com/pastebin/G1DRR0kS/ [15:55:17] so the @timestamp field looks normal, dunno what is up with timestamp though [15:55:18] hah, wow [16:01:34] hashar: Cool. Will look into that [16:06:16] herron: how does the timestamp field get set in these messages? [16:07:17] am tryign to understand https://github.com/rsyslog/rsyslog/blob/master/plugins/omkafka/omkafka.c [16:07:20] there is a rsyslog template that transforms cee formatted syslog to json before output to kafka in /etc/rsyslog.d/10-template-syslog-json.conf on the mw hosts [16:07:22] there is some kafka timestamp realted stuff [16:07:28] property(name="timereported" outname="timestamp" format="jsonf" dateformat="rfc3339") [16:07:51] so in theory this is using the time within the syslog message recieved [16:08:09] and how is @timestamp added? [16:08:21] is that just added by logstash when it recieves? [16:10:00] these udp_localhost topics logstash is only consuming from so no logstash shouldn't be doing that [16:11:15] hm right sorry yeah it is already in kafka at this point [16:11:24] somethign is adding @timestamp [16:11:33] i guess the logstash collector somehoow? [16:12:33] but we do pass along any json included in the syslog message received like this: [16:12:38] https://www.irccloud.com/pastebin/85Z37xX6/ [16:13:48] I'm curious if changing timereported to timegenerated would address this, since @timestamp is presumably right [16:14:39] ok so timereported is from syslog? [16:14:44] address it going forward I mean, maybe along with a much shorter log.message.timestamp.difference.max.ms [16:14:49] and the property(name="timereported" outname="timestamp" .. is what is setting the timestamp field? [16:16:01] yes [16:16:30] timereported is the time reported by whatever sent the log via syslog, so trusting the app that produced the log [16:16:31] so what sets timereported for our message example? [16:16:41] trusting the app is usually the right thing to do [16:16:49] for syslog it might not matter i guess [16:16:57] since the event time should be super close to the recieve time [16:17:09] so this is from mediawiki [16:18:09] yeah, from mediawiki, which is outputting cee formatted syslog messages with a json payload [16:19:22] heh herron just read the actual message in this log [16:19:23] \"unclear-\\\"now\\\"-timestamp\ [16:19:25] hah [16:21:53] anyone want to stamp https://gerrit.wikimedia.org/r/c/operations/puppet/+/589350 ? no impact expected, working fine on memcacheds, just a script that runs a loop at 1hz [16:22:02] heh that's beautiful [16:22:29] so what if we set a max size for these problem topics to address the growing disk utilization probelm, and change from timereported to timegenerated to keep it from continuing? [16:24:11] i dunno what is causing these, but i think if we manually delete the old offending log segement files...thingsi will be ok [16:24:26] cdanis: lgtm [16:24:31] but might be more dangerous that way [16:25:26] ok, yea I'd love to persuade kafka log cleanup to do the right thing if possible [16:25:50] do you think log.message.timestamp.difference.max.ms is worth messing with as well to prevent this? [16:27:28] rzl: ty! [16:27:55] hm yeah i wonder what happens if the difference is negative though, e.g. the event timestamp is in the futrue [16:28:03] i think that is usually used to reject timestamps that are too old [16:30:09] herron: i think setting retention.bytes iis a good ii dea [16:30:22] even just setting it huge [16:30:30] you'vue got data in one partition going back to jan [16:30:34] and the partition iis 700G [16:30:46] i think if you set it to 500G it will start pruning [16:30:56] ok, yeah kafka disk util is nearly the same as ES which is bananas [16:31:31] that sounds good, just one other idea before doing it... what about simply deleting the problem topics and letting them be re-created automatically? [16:32:07] https://medium.com/@jiangtaoliu/a-kafka-pitfall-when-to-set-log-message-timestamp-type-to-createtime-c17846813ca3 is relevant [16:32:18] herron that woulud work too [16:32:19] :) [16:32:32] might see soem producer errors for a few seconds or minutes [16:32:49] might be dangerous with an active topic, not suyre [16:32:58] IIRC kafka doesn't delete immediately, it iwll just mark a topic for deletion [16:33:07] i think [16:33:30] if a topic is in the middle of being deleted it might reject produce requests for that topci [16:33:32] not sure about that [16:34:02] ok, yea adjusting the max size sounds safer. I was thinking it may be transparent to delete if the consumers aren't lagging, but yea can keep that in the back pocket if the max size doesn't work out for some reason [16:34:38] ya, i will ask kafka mailing list about log.message.timestamp.difference.max.ms and timesttamps in the future [16:34:40] dunno what happens there [16:34:52] awesome thx! [16:42:50] herron: i think it won't hurt to set taht anyway [16:42:58] we should probably set it on all kafka cluster to some sane value [16:43:10] for analytics instrumentation it might get weird [16:43:17] offline mobile apps might send events with a big delay [16:43:32] but still, a week or 2 would be pretty generous i think [16:43:49] maybe just setting that to the same as retention.ms is good [16:44:31] ok, sgtm, the default is so huge that even months should be an improvement [16:53:05] elukey: this should clear soon https://usercontent.irccloud-cdn.com/file/lM9Ni9Ky/Screen%20Shot%202020-04-16%20at%2012.51.57%20PM.png [17:02:07] oh nice! [17:02:11] what was the issue at the end herron ? [17:03:27] timestamps in the future that otto.mata found preventing age based cleanup from happening, and worked around it by setting a max size on the problem topics [17:03:37] really nice [17:04:08] still a couple open questions about how best to prevent long-term, but good to have the pressure taken off [17:13:06] yep yep! [17:14:16] herron: I checked "curl -s -k 'http://localhost:9200/_cluster/allocation/explain?pretty'" on logstash1008, the size-based problems are gone but there are other ones [17:14:28] some shards are still not allocated, so cluster yellow [17:14:48] like "the shard cannot be allocated to the same node on which a copy of the shard already exists" [17:14:51] etc.. [17:16:32] (afk :) [17:21:11] elukey thx, keeping an eye on too, looks to be progressing still in that _cluster/health active_shards_percent_as_number is slowly increasing [17:21:19] later! [18:17:43] herron: do you mind if i take a pass at refactoring some of the logastsh::input::kafka stuff? [18:17:57] i want to standardize the ssl_truststore_lcoatino basd on kafka cluster name [18:18:03] so I can DRY some off it up [18:18:04] of* [18:18:05] sure! have at it [20:37:24] herron: [20:37:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/589400 [20:54:23] ottomata: nice, will take a closer look tomorrow [21:05:12] ... prometheus in both codfw and eqiad is scraping the nic_saturation_hot_seconds_created / nic_saturation_warm_seconds_created that are timestamps-of-first-appearance automatically exported by the client library when I set values for nic_saturation_hot_seconds/nic_saturation_warm_seconds, but is not scraping (or at least, not serving queries for) the actual [21:05:14] nic_saturation_hot_seconds/nic_saturation_warm_seconds metrics [21:05:25] for the LVS hosts I enabled the exporter on [21:10:14] hm, maybe this is some weird problem with grafana/explore, actually [21:28:27] no, I can't read, it's that since buster, python3-prometheus-client automatically adds a `_total` suffix to the names of all counters