[09:22:36] following up here because -operations is noisy, the logstash elastic 5 cluster is currently in trouble, I'm investigating [09:23:34] godog: ack, shout if you need help, should be considered an incident, need an IC, etc... [09:25:12] volans: ok thanks! not sure yet [09:25:21] here also, same [09:33:00] yeah let's start an incident please, I don't see the cluster recovering [09:35:16] so what we know so far: the logstash elasticsearch 5 cluster was in gc-trouble earlier, I roll-restarted the cluster hosts and observing now but e.g. the master now is logstash1011 and that's experiencing high gc activity too [09:36:14] hello [09:36:20] the cluster is recovering in the sense that unassigned shards are going down [09:36:33] creating the doc [09:36:46] "paged" in -operations [09:37:24] thanks [09:37:46] XioNoX: https://wikitech.wikimedia.org/wiki/Incident_response/Quick_reference [09:38:30] godog: some ES experts might be useful? [09:38:31] I also noticed the disk space on logstash101[0-2] going from 71% to 80% from 1 am to 8 am, not sure if related [09:38:57] volans: dcausse or gehel come to mind, if they are around [09:38:58] godog, volans https://docs.google.com/document/d/1qutKDBzEYsCVVE9ndHbCJFa9LaW_McNtuSjEiimRSXk/edit?usp=sharing [09:39:14] * gehel has been summoned? [09:39:18] thanks XioNoX ! [09:39:59] gehel: hey, yeah we're currently having some troubles with the elasticsearch cluster for logstash 5, namely high gc activity is the symptom [09:40:17] not sure yet of the root cause and/or how to stop it though [09:41:13] earlier in the week we increased the number of shards from 1 to 3 for new indices of that cluster, which I reverted now out of caution [09:41:47] can we reduce the amount of messages sent to logstash? [09:42:38] looks like there are a lot of shards initializing, that's going to put additional load on the cluster until it is fully recovered [09:43:03] we can stop logstash until the cluster has recovered yes [09:43:32] appart from high GC, do we have a measurable impact? [09:44:08] high GC will increase latency, but if logstash still has enough throughput to consume all logs, that's not really an issue [09:44:47] no at the moment things seem ok in terms of impact, logstash is catched up with kafka [09:45:28] looks like there are timeouts in some of the icinga checks, so I assume that we probably also have impact on the read use cases [09:46:25] https://logstash.wikimedia.org isn't super fast for me but queries work eventually [09:47:39] but yeah unassigned shards are going down somewhat slowly [09:48:15] sorry for my dumb questions, but could the cause of the issue in the logstash log? did someone look? [09:49:07] what are unassigned shards? [09:49:22] unassigned shards are related to the previous restart of the cluster [09:49:45] when a node goes down, shards that were on that node need to move around in the cluster [09:50:22] so even "unused" shards are seeing issues? [09:50:53] feel free to ignore my questions :) trying to understand what's going on but I don't want to slow the investigation [09:50:57] unassigned shards are not really an issue, just a byproduct of how ES manages scaling [09:51:27] XioNoX: I'm taking a look in the logstash logs in case [09:51:40] gehel: didn't you solve that issue for the main ES cluster on rolling restart? (like stopping the re-assignment for the time of the restart) [09:51:44] looks like some of the recoveries are for old indices. I would expect those to be recovered from disk since there should be no writes on them [09:52:06] yes, we solved (or at least optimized) for the search cluster [09:52:18] the context is fairly different for logstash, but the same principles could apply [09:53:13] k, added a comment to the doc for potential action item, thx [09:53:27] I do remember seeing log messages being sent to old indices, that might be something we want to investigate (low priority) [09:53:53] is there a graph of messages rate? maybe something spiked before? [09:53:58] can the mediawiki fatals be related in any way? [09:54:13] * gehel needs to leave in a few minutes (meeting at Oscar's school) [09:54:28] gehel: thanks for stopping by! appreciate it [09:55:16] can't find logstash logs for earlier than 8:20 today atm [09:56:03] https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMDYvMTkvLS10aHJlYWRkdW1wLjItLTktNDgtMjg7Oy0tdGhyZWFkZHVtcC40LS05LTQ4LTI4OzstLXRocmVhZGR1bXAuMy0tOS00OC0yODs7LS10aHJlYWRkdW1wLjEtLTktNDgtMjg= [09:56:19] ^ quick analysis of a few thread dumps [09:56:35] Let me know if there is anyone I can try to call/ping if needed [09:57:22] nothing jumps out except that there is a lot of threads writing to disk (but that's expected) [09:57:57] *nod* we'll wait for full recovery of all shards and see if things stabilise [09:58:03] * gehel needs to run, maybe dcausse has a better idea [09:59:03] godog: recovery from the restart? [09:59:50] FYI dcausse is not in this chan [10:00:16] XioNoX: correct [10:00:55] volans: ack, we can invite him here from -discovery if needed [10:01:17] if you need him I'll go call [10:01:28] volans: maybe ping for now so he is aware? [10:01:33] (if he is around) [10:01:38] ok trying [10:01:51] thx [10:02:48] we're at 18 unassigned shards btw, I'm looking at https://grafana.wikimedia.org/d/oXH_v3rWk/logstash-cluster-health?orgId=1&refresh=5s "elasticsearch details" row [10:02:53] o/ [10:02:54] is there anyone on the service ops team who can look at the mediawiki alerts? PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:12] hello dcausse, sorry for bothering, the logstash ES cluster is having some trouble and we got some help from ge.hel that had to leave [10:03:19] summary of current status is in the incident doc: https://docs.google.com/document/d/1qutKDBzEYsCVVE9ndHbCJFa9LaW_McNtuSjEiimRSXk/edit# [10:04:37] much better to have that on a Friday than on a weekend :) [10:05:10] if you want backlog of this chan is at https://bots.wmflabs.org/browser/index.php?start=06%2F19%2F2020&end=06%2F19%2F2020&display=%23wikimedia-sre [10:05:23] also of note, the high gc activity on logstash1010 started yesterday at ~16, whereas on logstash101[12] started today at ~5, I'm looking at https://grafana.wikimedia.org/d/000000462/elasticsearch-memory [10:07:23] volans: oops I resolved your action item by accident [10:07:50] you can undo [10:08:11] godog: is the purge of old indices functionning? [10:08:52] dcausse: afaik yes, I'm checking now [10:10:16] the increasing in terms size since tonigh 1am is worrysome [10:11:10] looking at some of the exceptions, I see a bunch of "NamespaceInfo::isTalk called with non-integer (string) namespace 'arachni_name'" coming from ChangesListSpecialPage->buildQuery more or less, all dewiktionary, but they are from over 20 mins ago (kibana) [10:11:18] I could check the raw logs [10:11:36] apergos: is that the mediawiki exceptions? [10:11:51] dcausse: yeah curator ran tonight at ~00:42 although it failed some actions, namely to set replicas [10:12:03] yeah, that's the second most popular, the most popular I'm not sure is so helpful, [10:12:12] dcausse: I'm looking at logstash1007:/var/log/elasticsearch/curator-logstash.log.1 [10:12:39] dcausse: so it could be logs coming in with more terms to be indexed than "usual" ? [10:12:43] apergos: thanks, let me know if you think it could be linked to the logstash issue [10:12:57] ErrorException from line 393 of /srv/mediawiki/php-1.35.0-wmf.37/includes/debug/MWDebug.php: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Fail-safe exception. Avoiding infinite loop due to possibily undetectable existing records in master. [10:12:58] godog: yes [10:13:01] It may be due to encoding incompatibility between database values and values passed in $neededRecords parameter.) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate... [10:13:07] that's the most popular [10:13:14] I don't know that it's related at all [10:13:23] * apergos goes to look at the raw logs now [10:13:26] thx [10:14:36] godog: it could be more terms/fields, if you switched to 3 shards it also increase term usage as they're no longer shared as they were when in the same index [10:15:32] dcausse: *nod* thanks, yeah the shards change I reverted so tomorrow's indices are back to one shard [10:15:50] is there a way to inspect which terms/fields grew ? [10:17:18] godog: somehow https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html [10:17:29] i'll take a look [10:17:49] I don't see anything too interesting coming in now, which makes sense as the exception rate has recovered [10:17:57] in the meantime you could perhaps by closing old indices [10:18:02] not seeing the same increase in the elk7 cluster fwiw, https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?panelId=48&fullscreen&orgId=1&from=1591957073935&to=1592561873935&var-exported_cluster=production-elk7-eqiad&var-datasource=eqiad%20prometheus%2Fops&var-instance=logstash1020&var-top5_fielddata=All&var-top10_terms=All&var-top5_completion=All [10:18:18] dcausse: ok! I'll start with the oldest [10:18:47] let me know what actions you're taking, I can't follow your conversation :) [10:19:37] XioNoX: haha ok! I'll !log too [10:20:47] thx! [10:25:05] almost done closing march indices [10:26:37] rgr [10:28:17] the most namespaceinfo messages I see in a minute is 50 (and that's not sustained), and the other ones are warnings anyways (the infinite loop ones), so dead end from me [10:28:50] s/are/were/ don't see new ones [10:29:00] ok, thanks! [10:30:30] godog: you set replicas to 1 which indices? one month old? [10:31:29] dcausse: yes, >31 days old get 1 replica, >50 days get zero replicas [10:31:49] dcausse: march indices are all closed now, I see field memory did decrease a little of course, can do april too [10:39:23] somewhat slowly but shards are recovering, 8 unassigned [10:41:39] it seems to me things are on their way to recovery *touches wood* [10:42:12] if I'm reading stats correctly jumping from 1 to 3 shards increased terms_memory_in_bytes from ~650M to ~850M per day (curl localhost:9200/*-2020.06.18/_stats | jq ._all.total.segments.terms_memory_in_bytes) [10:42:25] cool! [10:42:34] does not explain the spike we see 1am tho :/ [10:45:05] indeed, I can't find anything relevant in SAL in case it was say a deploy that changed logging format [10:46:04] could be a giant merge or multiple ones, will check [10:47:21] dcausse: thanks! I'm tempted to close april's indices too, what do you think ? [10:47:33] godog: fine by me [10:47:49] we can reopen them slowly when things are under control [10:48:01] sound good [10:55:34] godog: what are the kafka topics you read? [10:57:12] dcausse: there's a bunch, I'll get the list [10:58:06] I see multiple elastic metrics that increase at 1am so I wonder if it's also visible on the inputs [10:58:22] https://phabricator.wikimedia.org/P11605 [10:58:54] quite possible yeah, the udp_localhost is also mediawiki logs, I bet that's likely it [11:00:50] also I researched "give me all messages from timestamp x to y" earlier this week for an unrelated issue and couldn't find a way with our current version of kafka [11:01:31] you can find the closest offset at timestamp x [11:02:31] https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/603477/3/streaming-updater/scripts/set_initial_offsets.py is a small tool set consumer group offsets based on timestamp [11:03:01] godog: things look under control? Do you still need an IC? :) [11:03:11] oh nice thank you dcausse [11:03:25] XioNoX: I _think_ things are under control now yes, thanks for your help! [11:03:34] great, thx! [11:05:12] feel free to add data to the doc :) [11:07:42] *nod* will do [11:15:53] FWIW I'm basically monitoring things now, logstash 5 in eqiad is at 81% fs utilization so I'm thinking of ditching some indices altogether later [11:18:10] please ignore the wikimedia-extmon alert [11:18:50] I have ack'ed it [11:19:00] no, I did :D [11:19:05] :-P [11:19:09] I shows ack'ed by me [11:19:13] yep [11:19:35] thanks [11:19:49] in my app it shows ack'ed by me, not by volans :-/ [11:20:12] yep looks like you one ;) [11:20:18] I see it by volans [11:20:18] it changed [11:20:21] was me before [11:20:24] WUT? [11:20:30] uh? [11:20:40] when I hit it, it wasn't ack'ed [11:20:45] race? [11:20:50] eventual consistency? [11:20:52] mondogb? [11:20:58] yeah I see marostegui now [11:20:59] *mongodb [11:21:02] I guess to :) [11:31:20] I'll check back after lunch on the cluster [12:03:06] hey, today morning 20 mw nodes suddenly started to emit errors on wikibase that should be pretty rare (100/h right now while it's usually around 1/day) [12:03:07] https://logstash.wikimedia.org/goto/ab3489acf83a5695d656dbcc87ae7393 [12:03:33] maybe opcache corruption again? [12:04:10] started around 8:20 this morning [12:09:02] ok the cluster is fully recovered now, I'm deleting some old indices to reclaim some space [12:22:48] something is pushing way more docs than usual to the cluster (since june 19 ~3:40am) (c.f elasticsearch_indices_docs{cluster="logstash", exported_cluster="production-logstash-eqiad"}) [12:23:11] can't see anything obvious in kafka metrics [12:36:37] dcausse: interesting, is that metric on a dashboard you are looking at ? [12:36:46] godog: sadly no [12:37:01] we have this: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1 [12:37:12] but it's too cirrus specific to include logstash [12:38:11] true, yeah I was (per)using that earlier, typing manually 'logstash' into the cluster you get metrics [12:38:17] mostly for the IPS [13:02:12] godog: have you ever tried installing cumin in a pontoon environment? [13:08:50] kormat: not yet no [13:09:28] ok, i'll let you know how i get on :) [13:11:44] sweet -- I'm curious to see how that works out [13:15:08] ping if I can be of any help ofc [13:16:04] godog: so far, it's going exactly as well as i expected ;) https://phabricator.wikimedia.org/P11606 [13:17:26] haha! yeah that's fair, I don't think you have set up puppetdb yet ? [13:17:47] i didn't realise it wasn't included by having a puppet master [13:18:00] ofc not [13:18:15] you didn't pay the extra pain yet ;) [13:18:23] well crap :) [13:18:46] java and postgres [13:19:19] if it can help cumin can also use different backends that are not puppetdb [13:19:26] do depending what you're trying to do [13:19:34] might or might not be useful [13:20:02] volans: i'd like to have some more fine-grained selection of db hosts for cumin, so i'm adding a profile with some specific parameters that i'd like to have cumin match on, [13:20:12] and i was hoping i could test this in my private puppet kingdom [13:21:07] then no, you need puppetdb unfortunately [13:21:13] but [13:21:19] not sure if worth the effort [13:21:35] as (almost) any parameter can be queried [13:21:57] if you show me the profile signature and the query you'd like to do I can tell you if will work or not [13:21:59] volans: hierdata can't be queried, which is what led me here :) [13:22:21] alright, i'll give up on trying to test this and just add you to the CR when i create it [13:22:29] indeed because hieradata is abstract out of a context of a catalog compilation [13:22:31] kormat: sth like these variables should be enough to get puppetdb working in pontoon FWIW, plus of course assigning puppetmaster::puppetdb to a new host and init the pg database [13:22:35] kormat: https://phabricator.wikimedia.org/P11607 [13:23:46] godog: hah, nice :) needing a third vm just to test this is way into overkill, i think. [13:23:57] (for my current task) [13:24:47] yeah fair enough [13:26:43] kormat: tl;dr is, if there is a puppet class (including profiles) or define with a parameter, that's queryable [13:27:21] with very few limitations [13:29:03] *with some limitations [13:29:14] like if you have a hash/list as parameter, etc... [13:29:24] or custom types [13:30:04] volans: can you match an unset parameter? [13:30:28] undef? [13:31:03] or check if it's using the default value or a passed value? [13:31:10] `sudo cumin 'R:mariadb::monitor_readonly@title = undef'` [13:31:12] something like that [13:31:18] well. title is a bad example. [13:31:27] imagine that's `...%param1 = undef` [13:31:49] IIRC no because there was an issue on passing undef via the json api of puppetdb [13:31:56] hah, ok [13:32:03] need to recheck the patch that chaom.odus did a year ago [13:32:12] that's included in cumin 4.0.0rc1 on cumin2001 [13:32:40] but you can do regex [13:36:40] dcausse: still a mistery to me what were the offending terms or what lead to terms memory increasing, though things seem stable and I'm tempted to stop curator on elastic 5 cluster for the weekend in case it can trigger similar behaviours [13:41:51] godog: going back through the earlier discussion, your restart did trigger a lot of shard initialization [13:42:12] we can probably help on that side to have a more streamlined (and automated) restart process [13:42:38] our current cookbooks are specific to search, but some of the ideas could be applied to logstash as well [13:42:58] we can include ryankemper in this and see if he can adapt the cookbooks [13:43:01] that's what I commented on the doc as an action item [13:43:15] gehel: totally! that'd be useful [13:43:40] * gehel is reading the doc [13:44:15] I'm going to turn that action item into a task as a followup [13:48:59] godog: I've added a note about writes to old indices as well [13:49:35] that one is probably low priority [13:51:34] *nod* thanks gehel, I'm thinking of comparing docs counts day over day to verify it but maybe there's another way ? [13:52:28] one indirect measure is to check the number of segments per shard [13:53:09] Since we forceMerge via curator, old indices should all have a single segment [13:53:18] if that's not the case, it's suspicious [13:54:09] I haven't checked the curator logs, but we might even have something in the logs that shows if forceMerge was a noop or not [13:56:42] interesting, thanks for the lead I'll take a quick look now if there are smoking guns [14:03:56] yeah the case I'm seeing is when the cluster was catching up with kafka, anyways low priority [15:08:55] jbond42: hi. this is your (somewhat late) friday reminder regarding puppet debugging documentation :) [15:11:26] godog: I'd say that term mem increase is due to more docs being indexed between 4:30am and the incident ( [15:11:28] https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22eqiad%20prometheus%2Fops%22,%7B%22expr%22:%22elasticsearch_indices_docs%7Bcluster%3D%5C%22logstash%5C%22,%20exported_cluster%3D%5C%22production-logstash-eqiad%5C%22%7D%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D ) [15:11:29] kormat: thanks its erm ... underway :S. serioulsy it is on my list still but didn't get to it today [15:11:46] hehe. no worries :) [15:12:16] but I have no clue why... it's not even visible in kibana [15:14:17] I don't see such increase on the elastic7 cluster [15:14:45] dcausse: hehe yeah I was about to say, truly bizzarre that elk7 doesn't show a similar increase [15:15:17] there is an increase at about ~2:40 but that's it, it doesn't spiral out of control [16:17:15] hi! if anyone is around today, I was wondering if you could help me debug this pcc error: https://puppet-compiler.wmflabs.org/compiler1001/23342/malmok.wikimedia.org/prod.malmok.wikimedia.org.err [16:17:43] this is a bit weird because the change catalog has the correct output, and I am not referencing this class anywhere in the change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606459/ [16:17:57] the error it gives is: "Could not find class ::passwords::dnsdist::wikidough for malmok.wikimedia.org" [16:18:20] but I don't have this line in the latest change! this is what the line is: include passwords::wikidough::dnsdist [16:18:35] it seems to be fetching the line from the older commit for some reason? [16:19:53] the reallyyy weird thing is the change catalog has the expected output: https://puppet-compiler.wmflabs.org/compiler1001/23342/malmok.wikimedia.org/change.malmok.wikimedia.org.pson (grep for "console_key" and it says "snake oil") [16:22:57] sukhe: check this now: https://puppet-compiler.wmflabs.org/compiler1001/23344/ [16:23:07] i rebased it and compiled it one more time [16:23:35] though noop is also not what is expected? heh [16:23:37] mutante: thanks! but same :( https://puppet-compiler.wmflabs.org/compiler1001/23344/malmok.wikimedia.org/prod.malmok.wikimedia.org.err [16:24:00] it is trying to find "class ::passwords::dnsdist::wikidough" which is incorrect [16:24:15] the commit has "include passwords::wikidough::dnsdist" which is what it should be trying [16:24:49] (the commit corresponds to the structure in puppet master and labs/private) [16:25:38] if I commit something to labs/private, I just have to run puppet merge on master right? [16:26:23] sukhe: that link is before the change is applied [16:26:40] before: https://puppet-compiler.wmflabs.org/compiler1001/23344/malmok.wikimedia.org/prod.malmok.wikimedia.org.err [16:26:43] after: https://puppet-compiler.wmflabs.org/compiler1001/23344/malmok.wikimedia.org/change.malmok.wikimedia.org.err [16:26:58] it's because it was broken before the change is applied and only with the change it is not broken anymore [16:27:05] which shows up the same as a noop [16:27:26] "no difference (or compile correctly only with the change) [16:27:48] but what's the reason for the noop here [16:28:28] it's not a real noop. it is the second option "compile correcly only with the change" [16:28:42] it should be broken now when running puppet on the host? [16:30:19] ah I see what you mean now. it's because I changed the class name [16:31:21] so what's a fix for this? apologies if I am missing something here. or will it look up the correct class name correctly once I apply the change? [16:33:32] just making sure I understand "Hosts that have no differences (or compile correctly only with the change)" correctly :) [16:33:47] sukhe: the "fix" seems to be to just merge the change, i think [16:34:01] because the compiler tells us after doing that puppet would be unbroken [16:34:31] ok yeah makes sense. submitted for review. thanks! first time I have come across this and wanted to make sure what the noop here meant! [16:34:35] if you manually run puppet right now, is it broken ? [16:34:48] i would expect it to (before merging) [16:35:02] then after merging the issue should be gone [16:35:14] oh yeah [16:35:18] broken on malmok [16:37:42] ack, then it adds up [16:37:52] yeah :) [16:37:56] merge it and see the error go away [16:38:39] +1ed [16:38:45] thanks, trying! [16:40:45] mutante: all good! successfully merged. thanks! [16:41:01] also for the question yesterday, jbond42 suggested I use merge in the lookup [16:44:10] great! and ok, eh.. let's try that [16:44:34] yeah that's what I did in the current commit and it worked out fine :) [16:44:46] :)