[05:55:50] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10TechCom-RFC, 10Services (designing): RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10Joe) >>! In T198256#4323745, @Ottomata wrote: > Yeah, both protobufs and thrift are options, but neither have the advant... [05:56:20] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Operations, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10Joe) [06:23:12] hello team! [06:23:30] I'd need to drain the cluster to restart mariadb on analytics1003 (and apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443893/) [06:26:22] !log stop camus to allow mariadb restart on analytics1003 [06:26:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:53:31] sqoop is running, I am wondering if I can restart the db anyway or not? [06:57:00] hi elukey [06:57:32] o/ [06:57:45] sqoop doesn't use the an1003 db, it gets data from an-store, and doesn't access hive - So we should be fine [06:58:01] However, since it runs on an1003, if you restart the host globally, it fails :) [06:58:11] nono I just need to restart the db [06:58:43] basically I need to apply socket based auth, the current lvm snapshot backup is not working anymore sine I added the root password [06:58:59] (it logs in to force the db to prepare the snapshot afaics) [06:59:30] My understanding is limited, but I say: Go ! [06:59:34] and mariadb offers a way to allow auth in localhost via socket (basically you only need to be 'root' on the os to log in as 'root' in mariadb) [06:59:47] we already use it for eventlogging etc.. [07:00:07] so after enabling this, I'll set the script to use the socket [07:00:13] and everything should restart working [07:01:01] elukey: there still is 1 query working [07:01:08] elukey: one hive query I mean [07:01:45] elukey: Just in case you hadn't seen :) [07:03:06] (03PS7) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [07:04:29] joal: thanks :) I am checking my dear friend yarn [07:05:01] * joal sings: Yarn got a friend in me [07:09:08] ah nice only sqop [07:09:13] err sqoop now [07:15:12] elukey: I have a spark job running, doesn't use an1003 db :) [07:43:51] (03PS8) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [07:44:07] !log all jobs re-enabled [07:44:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:44:14] \o/ ! [07:44:16] Thanks elukey :) [07:44:46] all done! Now we have a secure passwordless auth to the db [07:44:53] I should have done it the other day [07:45:03] but didn't think about it :) [07:45:17] elukey: You can't imagine the number of things I should have done 'the other day' ;) [07:45:25] (I saw today an alarm for backup generation staleness for the an1003 db on an1002) [07:45:32] :) [07:46:41] lemme update the email thread "to keep archives happy" [07:50:28] (03PS9) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [08:07:55] (03PS10) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [08:18:31] backups on an1002 back in a good shape [08:39:30] (03PS11) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [10:18:30] joal: is hue working for you? [10:19:18] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 2 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [10:21:22] ah lovely [10:21:29] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [10:21:52] in any case, I don't see anything now [10:24:01] so I can connect to an1003's db [10:24:05] from thorium [10:34:23] !log restart hive metastore on an1003, errors after mariadb maintenance this morning [10:34:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:15] so it is currently doing this (hue) [10:35:15] [05/Jul/2018 10:28:14 +0000] connectionpool INFO Starting new HTTP connection (40): analytics1003.eqiad.wmnet:11000 [10:35:18] that is weird [10:36:49] !log restart oozie on analytics1003 - connection timeouts from thorium after mariadb maintenance [10:36:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:37:12] hello hue [10:37:21] oozie was stuck [10:37:43] lovely [10:38:07] yeah and all the jobs as well [10:38:13] they are now listed in yarn [10:44:20] back to normal elukey ? [10:45:01] joal: seems so, oozie was a bit upset [10:45:11] (without really telling it to me in the logs) [10:45:30] but hue was hanging talking with an1003:11000 owned by oozie, a restart fixed it [10:45:46] grea [10:45:49] we have some delay in processing jobs :( [10:46:13] it'll catch up :) [10:46:43] I am wondering if we need better alarms on this, maybe a sort of "hey oozie, are you ok?" [10:46:46] or similar [10:47:04] a l7 health check I mean, not only a process one [10:47:23] I forgot to check if oozie was answering to port 11000 at all [10:47:29] it seems not from hue's logs [11:36:22] (03PS12) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [11:44:56] going afk for a couple of hours people (lunch + errand) [12:13:12] (03PS13) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [13:50:34] (03PS2) 10Sahil505: Updated the functioning & styling of filter & split switches [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443592 (https://phabricator.wikimedia.org/T198183) [13:55:47] (03CR) 10Sahil505: [C: 031] "Still has some URL/time-range problem" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [13:55:58] (03CR) 10Sahil505: [C: 04-1] Added pagecounts metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [14:41:51] (03CR) 10Fdans: "Just a couple simple fixes and we're good!" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) (owner: 10Mforns) [14:42:53] 10Analytics, 10Analytics-Wikistats: Improve Wikistats2 map zoom - https://phabricator.wikimedia.org/T198867 (10mforns) [15:18:48] (03PS3) 10Jonas Kress (WMDE): Track number of editors from Wikipedia who also edit on Wikidata over time [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/443069 (https://phabricator.wikimedia.org/T193641) [15:24:25] (03PS3) 10Jonas Kress (WMDE): [WIP] Introduce oozie job to schedule generating metrics for Wikidata co-editors [analytics/refinery] - 10https://gerrit.wikimedia.org/r/443409 (https://phabricator.wikimedia.org/T193641) [15:24:56] (03PS4) 10Jonas Kress (WMDE): Introduce oozie job to schedule generating metrics for Wikidata co-editors [analytics/refinery] - 10https://gerrit.wikimedia.org/r/443409 (https://phabricator.wikimedia.org/T193641) [15:27:29] (03PS4) 10Sahil505: Make bar-chart and line-chart resilient to breakdowns with null values [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) (owner: 10Mforns) [15:29:49] (03CR) 10Sahil505: [C: 031] "We observed a bug while accessing metrics via Topic Explorer. It is not related to this task specifically. So, it will be fixed a another " [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [15:31:28] (03CR) 10Sahil505: [C: 031] "LGTM!" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) (owner: 10Mforns) [15:55:03] 10Analytics, 10Analytics-Kanban: [Wikistats2] Negative numbers in chart axis are not abbreviated - https://phabricator.wikimedia.org/T198510 (10sahil505) Intersting! This has been reported as an issue with numeral. https://github.com/adamwdraper/Numeral-js/issues/222 [16:01:50] (03CR) 10Mforns: Updated the functioning & styling of filter & split switches (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443592 (https://phabricator.wikimedia.org/T198183) (owner: 10Sahil505) [16:04:10] (03PS1) 10Sahil505: Corrected negative number abbreviation in chart axis [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/444012 (https://phabricator.wikimedia.org/T198510) [16:10:49] (03PS3) 10Sahil505: Updated the functioning & styling of filter & split switches [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443592 (https://phabricator.wikimedia.org/T198183) [16:33:00] 10Analytics, 10Analytics-Wikistats: Improve Wikistats2 map zoom - https://phabricator.wikimedia.org/T198867 (10Milimetric) p:05Triage>03Normal [16:33:18] 10Analytics, 10Analytics-Wikistats: Improve Wikistats2 map zoom - https://phabricator.wikimedia.org/T198867 (10Milimetric) p:05Normal>03Triage [16:37:33] 10Analytics: Correct pageview_hourly and derived data for T141506 - https://phabricator.wikimedia.org/T175870 (10Milimetric) p:05Triage>03High [16:37:40] 10Analytics, 10Analytics-Wikistats: Unexpected increase in traffic for 4 languages in same region, on smaller projects - https://phabricator.wikimedia.org/T136084 (10Milimetric) p:05Triage>03Normal [17:29:10] (03CR) 10Ottomata: [C: 031] Update user-history job from username to userText [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/443483 (https://phabricator.wikimedia.org/T197926) (owner: 10Joal) [17:29:23] (03CR) 10Ottomata: [C: 031] Update mw-user-history username to user_text [analytics/refinery] - 10https://gerrit.wikimedia.org/r/443487 (https://phabricator.wikimedia.org/T197926) (owner: 10Joal) [17:47:04] joal: here all day , let me know when you are back [17:47:11] Hi nuria_ [17:47:31] still thinking on the false-pos problem after having a talk with Dan [17:47:47] joal: have a 30 minute meeting in a bit but after we can talk about false positive sif you want [17:47:49] *if [17:48:28] works for me nuria_ - Give me time to think and try some more [18:13:30] * elukey off! [18:13:36] By elukey :) [18:14:55] o/ [18:15:18] joal: i've just installed the toree stuff into all user venvs on both notebook1003 and notebook1004 [18:15:23] let' [18:15:26] \o/ !!! [18:15:33] <3 <3 [18:15:48] if all is working still Monday, i'll write an annoucement to analytics and send it [18:15:59] Awesome :) [18:16:18] updating wikitech swap page now [18:16:22] ottomata: I tried to have brunel working with python only - Didn't succeed (if you have time and wanna check) [18:16:30] oh [18:16:37] joal: if you paste me a usage example i'll try [18:16:43] ottomata: https://github.com/Brunel-Visualization/Brunel/tree/master/python [18:16:48] yeahhh..... [18:16:49] but [18:16:53] ottomata: The cars data file is in the git [18:17:13] hm ok [18:22:45] ottomata: The thing doesn't show any error, but no grph is visible :( [18:28:16] is there a replacement for kafkacat? `kafkacat -b kafka1012:9092 -L` from stat1005 should list topics in main-eqiad, but instead i get: % ERROR: Failed to acquire metadata: Local: Broker transport failure [18:30:32] ottomata, I'm seeing that the event database has missing data for june (second half) for some schemas, is that expected? is the job still refining that period? [18:34:03] for eventlogging mforns? [18:34:16] it is not expected that i know of [18:34:17] like which ones? [18:34:25] ottomata, yes EL [18:34:39] ebernhardson: kafka1012 is on its way out :) [18:34:49] the mw avro stuff is the only thing it is still being used for :p [18:34:53] i think you want [18:34:55] kafka1001.eqiad.wmnet [18:35:05] if you want main-eqiad [18:35:16] ottomata: doh, that's it exactly! thanks [18:35:17] mforns: like which table? [18:35:46] joal: were you doing brunel in pyspark or just python notebook? [18:35:52] python notebook [18:35:59] python3 [18:36:00] ottomata, https://pastebin.com/JbDX7VGg this is how many hourly partition each schema has [18:36:24] the majority should have 720 cause that's one month worth hours [18:36:30] but very few have that [18:36:42] HUH. [18:36:46] yeah i'm now looking at MobileWikiAppiOSLoginAction [18:36:56] i see that all days are present, but not all hours for all days.... [18:37:05] mforns: maybe there was failing data at that time?? but for all schemas? HMM [18:37:16] investigating [18:37:20] this is not expected at all [18:37:23] ottomata, I didn't receive refine error emails [18:37:26] usually refine does everything prety quickly [18:37:27] yeah me neither [18:38:11] ottomata, there are no _REFINE_FAILED files for june in /wmf/data/event [18:38:32] looking at camus imported data, there is no data for the missing hours [18:38:39] hmm [18:38:40] ls /wmf/data/raw/eventlogging/eventlogging_MobileWikiAppiOSLoginAction/hourly/2018/06/01 [18:38:48] ls /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_MobileWikiAppiOSLoginAction/hourly/2018/06/01 [18:39:09] i dunno how to even find what happened, its been more than 7 days... [18:39:09] let [18:39:13] lets find one for mysql [18:39:16] that is also there [18:39:17] and look [18:39:41] ok [18:40:29] ok MobileWikiAppiOSLoginAction is in mysql [18:40:48] ottomata, https://pastebin.com/F19CezPL here's the existing _REFINED files for SaveTiming, stops at 14th day of june [18:41:13] ok let's look at SaveTiming [18:42:26] hm, mforns there's no more SaveTiming at all [18:42:31] is SaveTiming still being used? [18:42:32] ??? [18:42:59] yup, max timestamp in mysql [18:42:59] 20180705183806 [18:43:07] hm [18:43:09] checking camus... [18:44:40] I did a daily histogram for savetiming and looks good: https://pastebin.com/6JuNNsXS [18:44:52] yeah [18:45:22] maybe there's a bottleneck somewhere [18:45:37] and data is still being imported to hdfs... [18:48:30] Could it be possible that camus doesn't follow the pace of data (too many schemas?) [18:48:58] no ther eis somethign wrong [18:49:09] 18/06/14 16:05:10 INFO kafka.CamusJob: move_to_earliest: false offset_unset: false [18:49:09] 18/06/14 16:05:10 ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset, eventlogging_SaveTiming uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 leader:1005 partition:0 earliest_offset:15032211 offset:15025121 latest_offset:15035066 avg_msg_size:568 estimated_size:5648760 [18:49:09] Topic eventlogging_SaveTiming will be skipped. [18:49:09] Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset. [18:49:11] the previous hour [18:49:12] camus did [18:49:17] eventlogging_SaveTiming uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 leader:1005 partition:0 earliest_offset:13463018 offset:15015199 latest_offset:15025121 avg_msg_size:568 estimated_size:5635696 [18:49:26] so, something in kafka rolled the logs/offsets beyhond what camus had [18:49:35] and camus has just been giving up since then... [18:49:35] oh [18:49:38] never seen this before.... [18:49:40] pretty bad. [18:49:42] still investigating [18:50:04] I have seen this in labs before [18:50:13] exactly what you describe [18:50:24] ? [18:50:58] More time between 2 camus runs than kafka-retention time [18:51:08] but, [18:51:12] only an hour had past [18:51:18] 18/06/14 15:05:09 [18:51:19] I'm assuming it could also happen due to offset-shifts when changing kafkas [18:51:19] was fine [18:51:21] and then [18:51:24] 18/06/14 16:05:10 ERROR [18:51:25] yeah [18:51:26] but [18:51:31] we used a NEW camus directory when doing that [18:51:33] will double check [18:51:43] but also, the offsets are too close [18:51:47] oh [18:51:48] and no [18:51:57] in both cases it was running from the same broker leader [18:52:06] 18/06/14 15:05:09 INFO kafka.CamusJob: eventlogging_SaveTiming uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 leader:1005 partition:0 earliest_offset:13463018 [18:52:07] ad then [18:52:21] 18/06/14 16:05:10 ... eventlogging_SaveTiming uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 leader:1005 partition:0 earliest_offset:1503221 [18:52:28] so, one hou rpasses [18:52:59] sorry [18:53:00] truncated ^ [18:53:12] hour 16: earliest_offset:13463018 [18:53:17] ago no [18:53:19] hour 15: earliest_offset:13463018 [18:53:20] then [18:53:25] hour 16: earliest_offset:15032211 [18:53:40] hmmm [18:53:47] checking some things... [18:57:40] nope dunno yet [18:57:52] i was wondering if some of the kafka timestamp work would have caused this [18:58:05] but still don't see how... [18:58:13] but the timings don't line up anyway [18:58:34] on june 7th we bounced kafka brokers to apply timestamp.type=CreateTime [18:58:37] OH [18:58:38] that does line up. [18:58:42] 7 days after june 7th. [18:58:46] aha [18:59:17] thikning thikngin [19:00:40] what does setting timestamp.type=CreateTime mean? [19:01:25] mforns: it tells kafka to use the message's producer set timestamp for the timestamp index [19:01:29] instead of the time the broker receives the message [19:01:39] aha [19:01:49] i'm thikning mabye somethign is weird when you switch this...although it shoudlnt' be [19:01:56] if a message doesn't have a timestamp set on it, it should fall back to append time [19:02:12] time zones? [19:04:09] no shouldn't be [19:04:14] everything is UTC [19:04:17] also, its exactly 7 days after [19:04:28] there are 2 big problems here, EVEN if this happened, Camus should ahve been fine [19:04:33] and/or we should have known about this earlier [19:04:34] I see, so it should be related to that [19:04:42] yea [19:04:43] camus should havae just restarted, either from latest or earliest! [19:04:49] either way something! [19:05:09] oof [19:05:19] is the timestamp of new events since june 7 healthy? [19:05:43] maybe EL client is sending timestamps in a way kafka does not expect? [19:06:50] or maybe just the config change breaks kafka offsets continuity? [19:07:31] messages in kafka look fine now [19:07:34] there is 7 days of history [19:07:43] i think that what happened is once 7 days past since we changed the setting [19:08:37] hmm acutlaly i'm not srue [19:08:45] mforns: i think the timestamps are all fine [19:08:50] it hink its somethign with us changing the broker setting [19:08:55] and with the messages that were already on broker disks [19:09:11] aha [19:09:13] not 100% yet but still investigating [19:11:18] ok, there is a camus seting that we should have had set [19:11:27] i will set it, and then hopefully fix data for last 7 days... [19:11:29] oof [19:12:35] we maybe can backfill data from EL logs? [19:12:44] or worst case from mysql [19:13:14] ottomata: do ou mind sharing the settings? [19:13:15] yeah [19:13:24] joal: puppet patch incoming will link [19:13:29] thx [19:13:33] Man - This is tricky [19:14:34] yeah i don't fully understand what happened in kafka yet. [19:14:44] and why only some thigns would be affected? [19:17:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 2.017e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:17:58] haha [19:18:22] :[ [19:18:24] maybe i spoke too soon in that kafka mm ticket [19:18:36] ebernhardson: were you guys just doing something possibly related to ^^ [19:18:36] ? [19:18:46] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444043 [19:19:02] https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/mapred/EtlInputFormat.java#L413-L440 [19:20:24] ok mm is not broken, just behind [19:20:28] but is catching up it looks like [19:21:48] we have EL logs in /srv/log/eventlogging/archive since 20180604 [19:21:53] great [19:21:55] ok so we can backfill [19:22:06] ottomata: i did that for your ticket, to see if it caused problems :) [19:22:16] i guess the answer is yes [19:22:29] oh haha [19:22:32] well no problems so far! [19:22:37] just big stuff [19:22:41] the instances are running [19:22:44] they haven't died [19:23:38] ottomata: hmm, the graph just keeps climbing though. Does that suggest running longer would be a problem? [19:23:57] ebernhardson: its catching up, no? [19:24:10] ottomata: well, i only paused for 10 minutes. We might pause for 30 or 60 depending on [19:24:26] oh [19:24:33] like if you left it go it would just climb? [19:24:45] we can run more mirror maker instances but it won't really help if your topic only has one partition [19:24:54] ottomata: yes, thats the purpose of that job. When we can't write to elasticsearch for maintenance reasons we send them into the job queue to wait [19:25:14] mforns: fyi https://phabricator.wikimedia.org/T198906 [19:25:49] thx [19:26:03] ebernhardson: is it possible with job queue to use more than one parititon? [19:26:11] is ordering for your job very important? [19:26:24] ottomata: hmm, i have no clue. Ordering isn't important, it's all built to expect out of order [19:26:29] ok great [19:26:33] then i bet it will be ok then [19:26:40] right now a single consumer (and producer) has to handle ALL of those messages [19:26:48] let's ask petr about that [19:27:10] makes sense. We also had some plan for the consumer itself to detect the pause and stop consuming. [19:28:58] ok mforns am running camus eventlogging with reseting to earliest [19:29:45] ottomata: Have you seen my comment on our patch? [19:29:53] ottomata, ok [19:30:10] should it catch up since 7 days ago? [19:30:29] ottomata: Given the issue, the solution of your is good - For more regular times, might be interesting to double check dedup and have some comments [19:33:51] joal: ya, true, sorry i did not see that. Refine job data will be fine [19:34:02] it also will only start re-importing for topics that were broken for camus [19:34:13] for those ones, there will be NO data imported since around june 14th [19:34:31] ottomata: for those broken topics, this is very muh needed - Let's just not break the other ones ;) [19:34:32] mforns: yes [19:34:57] indeed, but I think this is the desired behavior in any case, no? duplicate is better than lost data [19:35:13] ? [19:35:14] ottomata, ok, in a couple hours will check if there's some more data in event database for last days of june [19:35:28] k mforns refine will have to run too [19:35:33] it should pick it up automatically [19:35:34] yea [19:35:36] as for backfilling [19:35:37] hm [19:35:42] ottomata: I assume so ... Still thinking about some other kind of solution [19:35:50] i wonder if we can just pipe the data back through eventlogging? [19:35:55] camus will bucket it correctly [19:36:10] ottomata: if we have logs, seems very doable [19:36:14] will it work to kafkacat archived events into a new topic, and use that to camus data into hdfs? [19:36:21] ottomata: dedicated topics and special camus instances [19:36:33] ok [19:36:44] mforns: that would work to yes [19:36:46] too [19:36:49] but would be a bit annoying [19:36:53] yea [19:37:04] why use a new topic? [19:37:14] you think using the same topic will work? [19:37:17] yes [19:37:18] that would be great [19:37:20] they will be new messages in the topic [19:37:24] ottomata: For event* and webrequest, I think we're safe on the dedup side - For mediawiki, mediawiki_job and netflow, I have no clue [19:37:25] with recent offests [19:37:32] and camus will import them, and then bucket them into the proper location [19:37:42] mediawiki_job is just event* [19:37:48] netflow is bascially not active [19:37:48] and is the log format in all-events.log the exact same that is fed to kafka normally? [19:37:57] mediawiki i'm pretty sure is just raw [19:37:58] there is no refinement [19:38:07] they import the avro messages into their hive table [19:38:12] right [19:38:17] so, in that case, if there are duplicate,s they'll get them, yes [19:38:25] looking [19:38:57] ottomata: Let's at least inform - Do you know who we should contact about those? [19:39:00] yes mforns [19:39:00] it is [19:39:02] riight [19:39:07] nice [19:39:11] mforns: a problem though [19:39:23] i'm pretty sure that 'all-events' is not all events [19:39:26] its eventlogging-valid-mixed [19:39:28] ah... [19:39:34] so, only stuff that also goes to mysql [19:39:38] the blacklisted stuff won't be in there [19:39:41] so the high volume hive tables [19:39:42] mmmmmmmm [19:39:45] we should see if they are affected [19:39:48] oof [19:39:52] like virtualpageviews [19:39:53] oh man [19:40:00] fingers crossed on that one... [19:40:01] eeee [19:40:09] joal: we should contact ebernhardson about them :) [19:40:17] k [19:40:37] ottomata, virtualpageview has all hourly partitions [19:40:38] ok mforns it will take a few camus runs to re-import everything [19:40:43] there is a timeout, but it iwll do ut [19:40:46] gonna keep runnig [19:40:49] ok phew mforns [19:40:51] very strange [19:40:56] i wonder why some are affected but not others [19:40:58] but not sure if it has all events though... [19:41:02] checking [19:41:17] ottomata: about what? :) [19:41:28] ebernhardson: we just encountered a strange problem with camus + kafka i don't yet fully understand [19:41:41] but some eventlogging data was affected when importing into hdfs [19:41:44] https://phabricator.wikimedia.org/T198906 [19:41:50] a fix that would keep it from happening [19:41:51] is [19:41:54] https://gerrit.wikimedia.org/r/#/c/444043/ [19:42:00] i applied it for all current camus jobs [19:42:06] because it seems to be the best behavior [19:42:19] we also need throughput alarms in refined data [19:42:23] but, it means that if camus's stored offsets for mediawiki avro data every get out of whack with what kafka has [19:42:38] camus will now restart from the earliest message [19:42:44] which means you might get duplicates [19:43:00] in the error we just encountered, this wouldn't happen, since kafka actually truncated too much data [19:43:02] but [19:43:06] hmm, ok i will have to check the data. I don't think we do duplicate handling yet [19:43:12] that seems to be better than not importing anything at all [19:43:13] ahh, ok [19:43:14] nuria_: in deed [19:43:32] mulitple issues here [19:43:50] https://phabricator.wikimedia.org/T198908 [19:44:12] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:44:12] nuria_: its going to be hard to do for all [19:44:22] but, we could at least monitor the same ones we do in kafka [19:44:30] ottomata: we do not have to do all , we can do a few [19:44:36] aye [19:44:52] ottomata: we used navigation timing before (due to steady flow) and now pagepreviews seems to have also steday flow [19:44:57] *steady [19:44:58] yup [19:46:19] ottomata, histogram for virtualpageview for june looks good: https://pastebin.com/ngbgmNXj :D [19:46:43] great [19:46:47] that's fortunate [19:56:08] mforns: can you help me get a list of all affected topics? [19:56:26] ottomata, already doing :] [19:56:34] you mean schemas right? [19:56:41] yeah [19:56:44] either ya [19:56:49] el schemas == topics so sure [19:57:05] awesome thank you [19:58:05] actually, i might be able to tell by grepping camus logs [19:59:21] i see some job topics affected, but we don't really care about those... [19:59:40] webrequest is fine [19:59:45] mediawiki eventbus is fine [19:59:46] ok [19:59:47] cool [19:59:52] so just eventlogging analytics stuff [20:01:50] yeah mforns its ok, i think this is easier [20:02:01] it might be easiest to just find the earliest and latest [20:02:04] i think i have a list [20:02:13] oh cool [20:02:16] https://gist.github.com/ [20:02:26] querying the database might take some time [20:02:44] yeah, i jsut grepped the camus logs for anything that gave that error [20:02:50] and then awked for the topic [20:02:54] and the timing seems to be all about the same [20:03:02] been happening to all of those topics for just the last 507 hours [20:03:12] 3 weeks [20:03:34] ottomata, I can not see the topics, I think you copy-pasted the wrong thing [20:03:51] oops [20:03:52] hah [20:03:52] https://gist.github.com/ottomata/188f8b81f07e8e4eeb28476a1c845c6e [20:03:54] ^ [20:04:02] thx! [20:06:18] so mforns [20:06:45] if we replay all events matching those topics/schemas in files between all-events.log-20180615.gz all-and events.log-20180629.gz (inclusive) [20:06:48] i think we should be ok. [20:06:53] camus will reimport everything [20:06:54] aha [20:06:59] and then refine will deduplicate [20:07:04] awesome [20:07:17] i'm going to first do the grep for those and build a file from which to replay [20:07:22] then we can inspect and then replay [20:07:39] I'll see if there's any schema of those which is blacklisted in mysql [20:08:13] k [20:10:04] ottomata: and isn't there any problem with writing to atime partitio that alredy exists? [20:10:07] cc mforns [20:10:40] it will overwrite the files and thus avoid duplicates [20:12:05] ottomata, only SearchSatisfaction and MobileWikiAppiOSSessions are affected and are not in the log archives [20:12:07] mforns: as in the partition will be deleted and redone again even if it exists [20:12:26] nuria_, yes I think so, otto? [20:13:27] nuria_, that's what happens with ELSanitization, I assume all Refine jobs work like that [20:16:00] nuria_: for the refined yes [20:16:04] camus will not overwrite [20:16:07] so there will be dulicates there [20:16:08] but [20:16:14] Refine job will notice that the source has changed [20:16:17] and re-refine the whole hour [20:16:23] including any transformations, including any dedups [20:16:33] ottomata: we also have setting [20:16:36] ott [20:16:38] ottomata: [20:16:41] https://www.irccloud.com/pastebin/IoB4E11C/ [20:16:52] on camus eventlogging template [20:16:56] for camus [20:16:58] yes [20:17:05] HMMMM [20:17:06] hm [20:17:07] hm [20:17:13] good point looking [20:17:38] OHhhh [20:17:43] very good point nuria_ we will need to change that [20:17:44] for sure [20:18:17] setting [20:19:49] we may also need to tell refine to look back farther for changed partitions [20:19:54] at least for a backfilling run [20:21:04] aha [20:21:11] why woudl we ever want to discard? [20:21:11] hm [20:21:16] i can think of no reason to do that [20:21:51] going to set to -1 for event* stuff at leats [20:21:53] least [20:22:26] ottomata: I think that setting is to prevent late arrivals triggering inappropriate recomputation [20:22:40] yeah [20:22:48] but in our case we don't mind! [20:22:57] right? [20:23:47] For the present thing, we don't [20:24:03] I'm trying to think of other reasons but can't find :) [20:26:01] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) This is the list of affected eventlogging topics: ``` eventlogging_AdvancedSearchRequest eventlogging_CentralAu... [20:26:18] nuria_: thanks for noticing that [20:30:29] ottomata: let me know if you want to talk on batcave for another pair of eyes when rerunning camus [20:30:50] ottomata: i guess we need to fill in kafka 1st [20:31:14] camus is running looking good [20:31:21] it should be caught up with those topics [20:31:29] i'm grepping the logs now, for the offending schemas [20:34:14] ottomata: ok, can do that too [20:34:36] ottomata: what i do not understand (ahem.. what i understand less than other things) is how this would not affect all topics [20:34:44] ottomata: let me see set of hosts [20:39:19] nuria_: i dont' understand that either [20:39:33] oh maybe leaders on particular brokers [20:39:35] yes maybe... [20:40:14] nuria_: on analytics1003 [20:40:17] ~otto/eventlogging.analytics.offset.reset.T198906.log [20:40:18] T198906: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 [20:40:22] has the log messages that caused problems [20:41:03] hm nope [20:41:08] happens for topics with leaders on all brokers [20:41:13] pretty evenly [20:41:24] cat ~/eventlogging.analytics.offset.reset.T198906.log | awk '{print $17}' | sort | uniq -c [20:41:24] 6084 uri:tcp://kafka-jumbo1001.eqiad.wmnet:9092 [20:41:24] 6084 uri:tcp://kafka-jumbo1002.eqiad.wmnet:9092 [20:41:24] 5070 uri:tcp://kafka-jumbo1003.eqiad.wmnet:9092 [20:41:24] 8114 uri:tcp://kafka-jumbo1004.eqiad.wmnet:9092 [20:41:25] 5577 uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 [20:41:25] 5589 uri:tcp://kafka-jumbo1006.eqiad.wmnet:9092 [20:46:11] ottomata: i see, where is that log? [20:48:10] nuria_: on analytics1003 [20:48:14] in my homedir [20:48:20] ok nuria_ on eventlog1002 [20:48:25] /srv/el-recovery-otto [20:48:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) Grepping for missing data in all-events.log in time range: ``` cd /srv/log/eventlogging/archive; for f in all-e... [20:48:57] https://phabricator.wikimedia.org/T198906#4401632 [20:53:21] nuria_: bc real quick? [20:53:29] mforns: feel free to join too if you wanna [20:53:33] just want to vet what i'm about to do [20:53:38] sure [20:56:44] ottomata: omw [21:12:55] (03PS14) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [21:27:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) Hm, I have a theory! This only affected low volume (EventLogging) topics. Assume that the `log.message.timesta...