[00:12:55] Analytics-Cluster, operations: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972202 (Dzahn) NEW [00:13:46] Analytics-Cluster, operations: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972217 (Dzahn) [00:16:09] (CR) Yuvipanda: [WIP] Database selection (2 comments) [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [00:16:44] Krenair: minor changes. I can test it later today and if it works ok we can merge and deploy today/tomorrow [00:20:37] YuviPanda, what about the stuff in the commit message? [00:21:00] Analytics, operations: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015#1972277 (mforns) NEW [00:21:35] Krenair: what do you mean by 'this list' [00:22:13] the db lisgt [00:22:14] list* [00:25:43] Krenair: and by encoding you mean the JSON encoding? [00:26:26] YuviPanda, well look at the options it produces... [00:26:30] some messed up stuff in there [00:26:57] (CR) Yuvipanda: "Yes we should sort this list - ideally in 'most popular to least popular' (similar to https://github.com/wikimedia/apps-android-wikipedia/" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [00:27:15] ottomata: ah, didn't see your ping because of the opening bracket ;) [00:27:28] yes, thanks a lot jynus, much appreciated [00:27:38] Krenair: are you talking about "\u043c\u043e\u043b\u0434\u043e\u0432\u0435\u043d\u044f\u0441\u043a\u044d Wiktionary" [00:27:40] that seems ok [00:27:41] all hail the DBA! ;) [00:28:22] yes, that sort of mess [00:28:23] it's not ok [00:28:42] that's supposed to show as молдовеняскэ [00:30:24] >>> print u"\u043c\u043e\u043b\u0434\u043e\u0432\u0435\u043d\u044f\u0441\u043a\u044d Wiktionary" [00:30:27] молдовеняскэ Wiktionary [00:30:29] Krenair: ^ [00:30:31] hmm [00:30:50] Krenair: the lack of 'u' is the problem [00:30:53] caused by the json.dumps :) [00:31:19] I still think we're crazy for trying to auto-generate a python file [00:31:25] well, I say we.. :P [00:31:28] Krenair: we could just do a json.dump into a file, and load it right after setting up 'app [00:31:31] ' [00:31:32] and see what happens [00:31:36] it will probably be ok [00:31:41] and it's highly possible I'm crazy :D [00:31:55] I wrote a python script that generates an nginx+lua conf last week (not for wikimedia :P) [00:31:58] so there's a pattern ther [00:32:35] didn't you or andrew get me to write python to generate lua, all from a puppet template? [00:32:41] which uses ruby [00:32:47] that was me yes [00:32:56] or did we change that so it was no longer in a template, but got separate config [00:33:40] I think it's not a template [00:33:43] Krenair: we could just do a json.dump into a file, and load it right after setting up 'app' [00:33:47] that's what I suggested [00:33:50] that's what you had originally suggested [00:33:52] yeah [00:34:01] so let's do that and see how that behaves [00:36:39] (PS9) Alex Monk: [WIP] Database selection [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) [00:56:07] Krenair: what do you think of my sorting suggestion? [00:56:53] I think I will think about it in 10 minutes [00:57:19] or maybe think about it while sync-master is running [00:59:00] YuviPanda, most popular to least popular? how do you plan to look that up? [00:59:15] Krenair: so for the android app [00:59:25] we just get list of wikipedias sorted by 'pageviews' [00:59:32] err [00:59:34] no [00:59:36] by 'number of good articles' [00:59:40] which is just a byte measure I think [01:00:46] 'just' [01:01:11] right, not a measure of GA / FA ratings since those are enwiki specific [01:01:31] hah [01:01:33] https://github.com/wikimedia/apps-android-wikipedia/blob/master/scripts/generate_wiki_languages.py [01:01:35] github is down [01:01:46] indeed [01:01:59] gerrit is not [01:02:16] yes [01:02:23] but I still have no idea how to find that file in gerrit [01:02:25] I guess [01:02:27] I should find it in diffusion [01:03:06] https://phabricator.wikimedia.org/diffusion/APAW/browse/master/scripts/generate_wiki_languages.py [01:33:42] YuviPanda, if you know the gerrit project, /r/project/ works [01:33:54] YuviPanda, that pulls from inside labs? [01:34:09] we should probably not be depending on other labs projects like that [01:34:16] Krenair: it is a build time thing [01:34:29] run once every few months [01:37:15] YuviPanda, so we have to download one for each family, and presumably one for special sites? [01:37:19] and they don't include dbname [01:38:42] milimetric: 'round? [01:45:14] hey mobrovac what's up [01:45:42] milimetric: there's someone asking about all-days params for AQS @ https://www.mediawiki.org/wiki/Topic:Swpsoqwzco0g3poj so take a look [01:45:56] i asked them to give more info, but maybe you can tell right away what the problem is [01:46:40] Oh, mobrovac yeah, we still haven't been able to compute that efficiently, thanks for the ping [01:46:52] kk cool [01:46:53] thnx [01:50:52] (PS10) Alex Monk: [WIP] Database selection [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) [02:39:16] Krenair: yeah, you'd actually have to 'join' [02:39:28] Krenair: use the stat stuff just for sorting [02:39:50] to 'join'? [02:39:57] no idea what you mean [02:42:40] YuviPanda [02:49:08] Krenair: I mean [02:49:18] Krenair: you get the dbnames + human readable names from the API [02:49:28] then get the sort order from stats [02:49:48] and use the sort order to sort the data you got from the API [03:59:52] Hey all, not sure if anyone is around at the moment who'd be able to help, but it seems like the pagecounts-raw uploads are stuck / missing some of today's data [03:59:54] https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-01/ [04:04:13] pagecounts-all-sites is also missing the same data, though my consumer only uses pagecounts-raw as we need data consistency back to 2010 (and as such also can't use the pageviews API) [08:00:54] Analytics, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1973342 (mobrovac) [09:25:13] Analytics-Cluster, operations: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973447 (ArielGlenn) NEW [09:36:57] hello a-team! I need to restart our beloved kafka brokers [09:37:25] let me know when you are free to ensure that EL doesn't get into a weird state again :) [09:52:05] elukey: statsv (which also consumes from kafka) tends to get into a weird state as well [09:52:11] i have not had the time to debug it [09:52:55] could you remember to restart it as well? it's running on hafnium (so `service statsv restart` should do the trick) [09:53:28] ori: sure! I'll add this note to https://wikitech.wikimedia.org/wiki/Service_restarts#Kafka_brokers [09:53:35] is it only related to Kafka? [09:53:56] yeah. it's a very primitive script; there is hardly more to it than a for-each loop [10:02:21] Analytics-Cluster, operations: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973531 (elukey) Hello Ariel, I believe that @Ottomata is still working on the host: ``` # This node was previously a Hadoop Worker, but is now waiting # to be rep... [10:19:41] oh boy, the last email from hdfs@stat1002 doesn't look good joal [10:31:00] wow elukey, that does indeed look bad [10:31:35] elukey: seems to be a camus issue [10:35:27] joal: where are you looking? Just to have an idea.. Yarn? [10:35:58] looked into hue to see which jobas are blocked --> started at the root, load_webrequest - They are blocked [10:36:32] They are failing due to a missing _IMPORTED file --> this is generated after succesful camus run - I therefore think camus is failing [10:36:59] ahh hue I forgot the name :D [10:36:59] Now looking at camus logs on analytics1027:/var/log/camus/webrequest.log [10:37:44] elukey: have restarted analytics 1027 yesterday ? [10:37:54] with, like, an update of java version ? [10:38:27] no 1027 was blacklisted by ottomata [10:38:32] there is a phab task opened [10:38:35] hm [10:39:04] but I am double checking [10:40:03] yep nothing in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:30] uptime big, so recent restat [10:41:34] Start-Date: 2016-01-27 09:32:48 [10:41:34] Commandline: apt-get -q -y install linux-image-generic [10:41:38] but not restarted [10:41:53] hm [10:41:56] Any change on java ? [10:42:19] and also Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install apt-transport-https [10:42:23] not yesterday [10:42:48] Looks like a java issue with file rights on hadoop [10:46:13] the webrequest.log looks horrible :D [10:46:26] elukey: do you have root on this machine ? [10:46:43] yep [10:46:47] If so, can you try to join screen camus and camus_loop ?h [10:46:48] is the partition mounted? [10:48:12] I mean, sorry, since we have failures in hdfs moves I was thinking if everything is ok from the fs side [10:48:35] np, everything is done through hdfs command [10:48:49] have you managed to join screen ? [10:50:01] sorry joal I didn't get what do you mean with join screen [10:50:20] There are screen running on analytics1027, probably from root [10:51:08] actually, they are from otto [10:51:18] and I wonder if there commands running from there [10:53:50] screen -ls doesn't show anything [10:54:02] you sdhould run that as otto :) [10:54:36] elukey: another question: have we changed anything that impact java in yesterday's install-reboot ? [10:57:10] well, the kernel :) [10:57:15] but not on that host [10:57:30] and without a reboot we are still using old packages [10:58:01] elukey: speaking of hadoop global (issue seems to come from hdfs rights to write being handled differently) [11:00:01] batcave? [11:00:28] yes [11:08:36] Analytics-Tech-community-metrics, pywikibot-core, DevRel-January-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1973640 (Lcanasdiaz) >>! In T123808#1971343, @jayvdb wrote: >>>! In T123808#1970520, @Lcana... [11:29:57] !log disabled puppet and # Puppet Name: camus-webrequest [11:30:23] mmm [11:31:43] !log Moving faulty camus imported files to a temporary place [11:31:57] !log disabled puppet and camus webrequestlog in analytics1027 [11:37:53] !log run camus manually to check if succesful [11:53:10] \o/ [12:13:57] joal: I am going away for ~30 mins to eat something, I'll be back in a bit to check. [12:14:07] sure, still having issues [12:14:18] ouch.. do you need me to do anything? :( [12:21:38] naaa, fighting with files to remove [12:42:22] mforns, are you onlinez? [12:42:28] elukey, yes :] [12:42:45] o/ [12:42:50] I was reading there are problems with the cluster, can I help? [12:42:51] I am going to restart kafka [12:42:56] :) [12:43:02] ok, will follow up EL [12:44:04] joal is kicking Camus very hardly in its teeth, should be fine in a bit :D [12:44:30] this camus guy, would never had thought a writer could fight that bad [12:45:36] :D [12:45:54] let me know if you need any help, in the meantime I am going to deal with Kakfa [12:45:57] *kafka [12:50:46] !log stopping kafka1012 for kernel upgrade [12:51:43] EL logs went crazy [12:53:53] O_O [12:53:59] I need to stop the node this time [12:54:05] not only restart [12:54:15] so it might take it not well [12:56:11] elukey, I think it's OK, kafka will buffer logs and after restart EL will catch up [12:56:27] elukey, do you know how long it will take to upgrade? [12:56:59] I need to restart all the hosts :D [12:58:05] so I'd say, 2 hours to be super conservative? [12:58:21] I'll bring up the node + leader election probably [12:58:29] if the leaders are not balanced [12:59:04] plus the partitions are replicated so it is only a metadata change for EL (theoretically) [13:04:34] elukey, then I think I'll stop EL [13:05:42] elukey, was this scheduled or is this an emergency measure? should we send an email to analytics? [13:07:03] elukey: successful camus run ! huray ! [13:07:13] WWWOOOOOOOW [13:07:16] elukey: can you restart cron and puppet ? [13:07:26] joal: sure [13:07:45] mforns: it is a scheduled reboot of the kafka nodes to upgrade the kernel for a security fix [13:08:04] if you could send an email it would be great, the node is not restarting -.- [13:08:56] elukey, cool, then I'll send an email to analytics mentioning the scheduled reboot and the EL temporary stop [13:09:26] elukey, is there any link to a Ohab task I can paste? [13:09:54] joal: 1027 back to work [13:10:27] mforns: I don't have one atm, but you can mention the latest kernel vulnerability bug [13:11:26] elukey, thanks [13:11:40] thanks elukey ! [13:11:43] I'm monitoring [13:21:07] elukey, joal, EL is kinda enduring the connection problems, I think we don't need to stop it... [13:21:32] just restart it when upgrade is done, unless the situation changes [13:21:42] mforns: I trust you :) [13:21:58] mforns: Thx for caring that one :) [13:23:32] !log restart EventLogging to overcome connection problems following kafka maintenance [13:24:19] joal, is there anything I can help with the cluster, it's my ops-week [13:24:36] mforns: I'm gently monitoring, thx for asking :) [13:24:51] mforns: any issue in the past with kafka hosts not responding to ssh? [13:25:01] O.o [13:25:09] not for me [13:25:27] actually I think I never ssh'd a kafka instance [13:25:28] mmmm now it is refusing me sigh sigh [13:26:33] elukey: That bug in the kernel, that was your way to get in ! I understand now ;) [13:27:03] yes you got me! I released a CVE only to reboot hosts :P [13:27:11] :) [13:27:18] now kafka1012.eqiad.wmnet doesn't respond [13:27:20] sigh [13:27:28] mwarf :( [13:35:51] Analytics, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1973848 (JAllemandou) Thanks for raising this issue @mobrovac. Problem comes from down the line, at hadoop data ingestion. Is it now fix, but the cluster will take time to catch up on late... [13:38:28] elukey, joal, mmm EL got worse, it isn't ingesting events at all [13:38:39] :( [13:39:21] mforns: what does the log say? Completely empty or some weird messages? [13:39:40] 2016-01-28 13:22:41,201 (MainThread) Unable to connect to broker kafka1012.eqiad.wmnet:9092 [13:39:41] 2016-01-28 13:22:41,202 (MainThread) [Errno 111] Connection refused [13:40:28] ah yes it is down [13:40:34] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [30.0] [13:40:37] also there is an outage in -ops [13:40:51] hello icinga [13:41:55] mforns: even if you restart EL it still get stuck with 1012? [13:42:50] joal, it seems to try to connect to kafka1020 [13:43:18] mforns: is that bad ? [13:43:34] 1012 is down [13:43:35] :( [13:44:03] joal, no, but the processor and the consumer are stuck anyway [13:44:04] Man, we were talking of OOW yesterday, and today machine dies -- That is bad luck ! [13:44:15] xDDDDD [13:44:15] I don't get it mforns [13:44:35] joal, they are not consuming [13:44:58] mforns: right. give me a minute to check the kafka state [13:46:13] mforns: We have leaders that are not 1012 assigned on every topic, so a restart should do I think [13:48:07] joal, just restarted it, but the processor and the consumer are not counsuming from kafka [13:48:19] batcave? [13:48:26] sure [13:55:16] elukey, how is it going with 1012? [13:56:02] well it caused a major outage for mediawiki, people are investigating. Still down :( [13:56:14] 1012 [13:56:20] sorry [13:58:24] elukey: you think the mediawiki errors are due to kafka1012 being down ? [13:58:45] the ops team is investigating atm [13:59:25] !log stop EventLogging until kafka is ok [14:00:10] elukey: If so, it might be related to something we have witnessed with mforns on pykafka [14:00:30] I think that the kafka nodes are hardcoded in php files as pool [14:00:37] --> When the first instance of a writing corum is not responding, pykafka fails [14:01:13] If the C-lib used by varnish-kafka does the same, then maybe kafka1012 down is the root cause [14:11:24] yep, I won a t-shirt [14:16:42] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1973932 (ArielGlenn) [14:17:15] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1973881 (ArielGlenn) I checked on stat1002:/mnt/hdfs/wmf/data/archive/pagecounts-raw/2016/2016-01 and the files aren't there. [14:23:22] mforns: how is EL working? [14:23:41] better, s/working/doing [14:23:54] elukey, I've stopped it [14:24:35] ah okok [14:24:41] oh! I forgot, I restarted it, because the good part is, events are making it into kafka from varnishkafka [14:24:47] so no data loss as of now [14:25:04] I restarted it so that server-side events get inserted, too... [14:27:55] all right so it is running atm? [14:28:17] theoretically one node shouldn't matter [14:28:25] but of course the client do mind :) [14:30:33] Analytics-Kanban, Wikipedia-Android-App: Beta Event Logging no longer functional {oryx} - https://phabricator.wikimedia.org/T123781#1973981 (mforns) Open>Resolved @Niedzielski EventLogging logs are no more in `/var/log/eventlogging/all-events.log`, they are in `/srv/log/eventlogging/` and `/srv/log/u... [14:34:43] elukey: t-shirt ? [14:34:53] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1974003 (Aklapper) p:Triage>High [14:34:54] I broke XYZ [14:35:00] :) [14:35:03] :D [14:35:08] Your fault really ? [14:37:30] yep [14:37:59] joal: https://phabricator.wikimedia.org/T125084 [14:39:55] right ... mforns_brb --> seems that event-logging client-side might be flowing trhough varnish kafka :) [14:40:08] yes :] [14:43:02] and surprisingly the server-side forwarder works [14:43:02] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1974026 (JAllemandou) Hello, Issue is know, an email has been sent to the analytics list about the problem. Hadoop data ingestion has been failing yesterday, and was restored around 13:00UTC today , but all the... [14:43:02] :) [14:43:02] * joal is happy to have diagnose the stuff correctly [14:43:02] Analytics-Kanban, Wikipedia-Android-App: Beta Event Logging no longer functional {oryx} - https://phabricator.wikimedia.org/T123781#1974027 (Niedzielski) Ah, thanks! That's very helpful! Looks like it's moved to deployment-eventlogging02 but it's working fine. [14:43:02] mforns_brb: What the thing means is that we need to change eventlogging conf to remove kafka1012 from the senders list [14:45:22] HIII [14:45:25] reading backlogs and emails! [14:45:38] ottomata: You're too late, you've missed all the fun :D [14:46:34] If i missed the fun that means things are working smoothly! :D [14:46:45] almoooooooost [14:47:22] Man, kafka is a dangerous player :) [14:48:17] good i see we have a (weird looking) webrequest partititon status email [14:48:23] i was worried we weren't getting those or something [14:48:39] ottomata: that bit is (kinda) solved [14:48:46] kafka is the still broken one [14:48:46] aye [14:48:48] so, what's up? [14:48:54] elukey: you go ? [14:49:02] or shall I ? [14:49:13] looks like I'll go :) [14:49:34] Started the day with broken jobs - camus failure yesterday after cluster reboot [14:49:57] after kafka? or hadoop [14:49:58] ? [14:49:59] reboot [14:50:00] ? [14:50:07] after hado [14:50:11] hadoop sorry [14:50:15] Analytics-Kanban, Editing-Analysis: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1974038 (mforns) a:Milimetric>mforns [14:50:30] hey [14:50:46] yep joal you have most of the context, I'll go for the API outage [14:51:03] explanation: a camus run failed while writing it's offsets file, leaving the system in a unstable state (data already imported, but no correct offset file) [14:51:24] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1974041 (Ottomata) @jcrespo, could you give us a slave resync status update when you get a chance? Danke! [14:51:35] Then camus tried to reimport the same data over and over again, failing because it actually already existed [14:51:36] doh [14:51:42] indeed [14:52:06] So, we agreed with elukey that, when restarting the cluster (even node after node), we should stop camus :) [14:52:13] not a bad idea [14:52:18] strange thoughj [14:52:20] it should be better than that [14:52:21] I spent a few hours cleaning the wrong files for camus to restart [14:52:24] that happened while restarting datanodes/ [14:52:25] ? [14:52:27] workers? [14:52:33] we haven't done namenode yet, ja? [14:52:33] hm, can't say realy [14:52:37] nope [14:52:54] ottomata: nope [14:52:57] so, are there also kafka broker issues? [14:53:01] https://phabricator.wikimedia.org/T125084#1973969 [14:53:06] all info in there :) [14:53:17] So while I was fighting with camus, elukey went to his task for today of patching kafkas and rebooting them [14:53:44] Starting with kafka1012 --> At reboot, it has not came up, and broke the full platform [14:53:47] kafka1012 is still down, went is single user mode after reboot [14:54:12] ahhh interesting [14:54:25] yeah, because kafka1012 is the first broker in the list of bootstrap brokers [14:54:26] ops is trying to fix it, I started a leader election a while ago but some partitions have only two replicas now [14:54:29] and it just gets blocked on it? [14:54:33] 15:00:37 < joal> --> When the first instance of a writing corum is not responding, pykafka fails [14:54:36] 15:01:12 < joal> If the C-lib used by varnish-kafka does the same, then maybe kafka1012 down is the root cause [14:54:54] ottomata: you now know [14:55:15] hmmm that is not how it shoudl work. [14:55:23] i highly doubt varnishkafka does that [14:55:26] I do agree with that :) [14:55:31] we have done reboots of these before and not had this problem (at least, before monolog) [14:55:38] pykafka shoudl be better too [14:55:51] elukey: did we watch EL logs while rebooting kafka's / doing leader elections? [14:57:16] ottomata: better explanation of what happened: https://phabricator.wikimedia.org/T125084 [14:57:23] ja read that [14:57:31] is kafka1012 ok though? [14:57:49] so kafka1012 is still down [14:58:01] I think it's dead, we were making fun of discussing OOW yesterday with mforns_brb and killing a machine today :) [14:58:05] Giuseppe is working on it, it went up in single user mode for some reason [14:58:31] EL logs were watched by Marcel that told us about the connection refused [14:58:37] also this, wip: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160128-MediaWiki-API [14:58:55] ok interesting [14:59:03] oh, so kafka1012 was rebooted but didn't come back alive? [14:59:25] yep! [14:59:38] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1974056 (jcrespo) It is continuing resyncing- but I do not have an ETA to finish. I will try to run a script today to appr... [14:59:42] I was trying to figure out why because I don't know how to access the console [14:59:49] and then everything went on fire [14:59:52] hmmm [14:59:56] heh ok cool [15:00:03] it look slike it is up though [15:00:05] OHH [15:00:09] no it doesn't sorry [15:00:37] hm, graphite/grafana is so funky sometimes [15:00:41] why no recent data.. [15:00:43] https://grafana.wikimedia.org/dashboard/db/kafka [15:01:28] ottomata: kafka is in pain because of backfilling old data with camus :( [15:01:51] ottomata: doing a lot of reads (and therefore taking long to move forward in backfilling [15:02:03] its weird though, other recent data for jmxtrans kafka stats are present [15:02:06] just not the top 3 graphs? [15:02:21] yup, from what I see [15:03:10] Actually, data is present, but not charted (if you look at the box displaying numbers, they actually are here) [15:03:15] ottomata: --^ [15:03:19] HMMMM [15:03:34] yeah, i think maybe having 1012 in the group makes it not able to chart? [15:03:42] possibly [15:03:44] if i deslect that one I see graph [15:04:03] okayy [15:04:14] not cool, but ok... [15:04:54] ok, status on hadoop data: no data loss, catching up (slowly) [15:07:05] joal: how hard was it to fix the camus offsets? [15:07:10] did you just put the previous ones in place? [15:07:32] no, I actually deleted the data files that camus was trying to reimport [15:07:51] leading to a huge volume of duplicate data, but no data loss [15:08:32] ottomata, I firmly believe there is something inserting events out-of-order [15:10:09] ottomata: looking at kafka-bytes out chart, camus backfilling is visible :) [15:13:40] jynus: eh? [15:16:10] backfilling out of the maintenance gaps is adding like 1% of events [15:17:17] it is worth investigating- either the replication script has a problem or some EL/logs arrive very out of sync [15:17:20] the resync? [15:17:29] yes [15:17:40] could make sense if you are resyncing close to current time [15:17:48] but for past, it shouldn't matter, right? [15:17:55] no, it is almost all timestamps [15:18:04] before the gap, too [15:18:18]