[00:03:43] hmm, :S [00:07:36] on the 2 day graph it looks about right, 3k/s near peak and 1.5k/s near lows times (evening PST) [00:07:44] err, /min not /s [00:21:02] ebernhardson: ah wait one sec [00:21:33] ebernhardson: ya, it is @ lowest now, ok [00:27:02] nuria: its not clear, do i need to provide all events for the schema for that hour for reprocessing, or only the bad ones? [00:27:12] (well, multiple hours) [00:27:26] ebernhardson: only the ones not processed [00:28:04] ebernhardson: if you create the file with events we can probably do the rest [00:29:14] nuria: i was trying to figure out if i should start with eventlogging-client-side like your docs suggest, or with the event.EventError table which seems more direct, but i guess doesn't have the right info still [00:29:36] ebernhardson: right, cause that is delayed from prod by couple hours [00:29:50] ebernhardson: it will have it just not quite yet [00:29:55] the annoyance is eventlogging-client-side doesn't know which ones are bad :) [00:30:11] i can probably find them with heuristics though [00:30:31] ebernhardson: right right! but you can do this tomorrow (on our end there is no rush, not sure about yours) [00:30:51] no no rush here, actually this data only goes to dashboards anyways. But be good to fix up :) [00:33:07] ebernhardson: eventerror will have your first batch no? [00:33:34] ebernhardson: the ones that happened earlier today (when i cut ticket) [00:35:14] nuria: yea some are there, i can probably make it work either way though and i guess with eventlogging-client-side i don't have to try to re-create the appropriate line format [00:37:06] i wont have time to finish this tonight anyways, tomorrow! [00:44:22] ebernhardson: k, sounds good [06:23:05] away [06:23:07] ufff [06:23:10] morning :) [08:47:00] (03CR) 10WMDE-leszek: "recheck" [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [08:47:17] (03CR) 10jerkins-bot: [V: 04-1] Use the internal WDQS endpoint instead [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [08:48:33] (03CR) 10WMDE-leszek: "T214894#5318026 says " Java code should retrieve the hostname from the config once added." (the "once added" part has happened if I unders" [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [08:49:36] (03CR) 10Ladsgroup: "> Patch Set 1:" [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [08:52:01] !log manually created /tmp/hive/operation_logs on an-coord1001 [08:52:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:55:08] (03CR) 10WMDE-leszek: "(joking) all the excuses!" [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [09:10:13] (03CR) 10Ladsgroup: "> Patch Set 1:" [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [10:32:52] * elukey lunch! [12:12:22] hey teammmm [12:54:02] hey mforns! [12:54:09] hey elukey :] [12:54:14] lookin at alarms [12:55:10] mforns: when did you restart the jobs after the deployment? [12:55:17] (to get a sense of timing) [12:55:49] I restarted them on Wednesday around 8-9pm CEST [12:56:02] except! [12:56:25] the cassandra one, I restarted on Thursday around 14pm CEST? [12:58:25] 14pm is kinda redundant ;P [12:59:09] next time let's !log everything so we have a trace (if needed) [12:59:26] oh yes... my bad [12:59:41] nah it is a minor nit :) [13:00:59] milimetric: let me know when you are online! [13:01:19] hey [13:01:24] elukey: sorry [13:01:28] cave? [13:02:44] milimetric: sure! We can do it later on, no problem! I am working on oozie debugging :) [13:02:57] elukey: up to you, I'm around [13:03:38] milimetric: would it be ok in ~1h? [13:03:47] elukey: of course, I just saw the edit hourly job fail, weird [13:03:53] super [13:04:08] yeah that one makes me sad [13:04:18] it is super weird [13:08:20] mforns: java.lang.ClassNotFoundException: org.json.simple.JSONObject is very weird, we deploy that with refinery-core? /me checks [13:08:21] elukey, good thing is there are other hourly and daily cassandra workflows that finished successfully [13:08:53] milimetric, are you looking at cassandra or edit_hourly? [13:09:33] cassandra [13:12:37] https://www.irccloud.com/pastebin/oFigBpSB/ [13:12:47] reflect("org.json.simple.JSONObject", "escape", regexp_replace(page_title, '${separator}', '')) AS page_title [13:14:12] mforns: my guess is we need an ADD JAR to something in that HQL, that maybe we were taking for granted before. Or maybe one of the workers is in a weird state where it's missing that from its JDK or something [13:14:23] oh interesting [13:14:35] elukey: is that possible? That some workers have it and some don't, and that's why some workflows work and some don't? [13:15:06] milimetric: I tried multiple runs that should have been running on different workers, and all of them failed [13:15:10] because we don't explicitly import that class anywhere, it's just assumed to exist in the classpath when you're in the Hive context, so we use the reflect call [13:15:13] plus why right after this restart? [13:15:31] elukey: oh, sorry, talking about the cassandra jobs, the other one is weirder [13:15:45] maybe it's related too, but maybe if we solve the cassandra one it'll give us a clue there too? [13:16:09] milimetric, maybe it's only for taps jobs that it happens, because the other 2 top jobs are monthly and hav enot still been executed after migrations to hive2 [13:16:19] *tops [13:16:37] the only place we use that class is the daily/monthly tops jobs [13:16:40] two hql files [13:16:55] and there's no reference of importing it, so we assumed it was there before and it's not now [13:16:56] I see [13:17:07] ok, that's a candidate! :] [13:18:26] also people triple check my findindings to avoid me derailing the investigation on a dead end :D [13:20:23] elukey, I've been googling for the error you found, elukey, but didn't find anything matching [13:21:00] I think it is specific to our deployment, since we add the java agent for the prometheus metrics [13:21:29] I suspect that somehow the -javagent etc.. option gets added to HADOOP_OPTS when invoking something [13:21:40] and a copy of the java agent (that tries to bind a port) fires up [13:21:45] and fails [13:21:52] now, why only on edit hourly is a mistery [13:22:34] mforns: is it fine for you if I try to kill/start again the coordinator? [13:22:48] of course.. [13:22:49] (trying to turn it off and on again :P) [13:22:59] edit history finished successfully, and that's as complicated a job as we have, so it's Java/Hive related, and somehow Spark avoids any problems [13:23:27] :) and then if that doesn't work let's turn it off and count to 20 [13:23:42] milimetric: is it normal that the history reduced job works only with 1 mapper? [13:23:53] uh... no [13:24:34] I don't understand where the org.json.simple.JSONObject is coming from, googling it seems to show it's in a package called mkyong [13:25:10] https://yarn.wikimedia.org/proxy/application_1564562750409_19522/ [13:25:18] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/denormalize/coordinator.properties#L60 [13:25:34] oh elukey that's history reduced [13:25:34] yes [13:25:51] though... that should also have more than one mapper... [13:26:40] !log kill/start edit hourly oozie coordinator as attempt to fix a recurrent failure [13:26:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:36:10] maybe it does make sense to have one mapper for reduced... [13:37:24] oh yes I was asking just as curiosity :) [13:40:46] milimetric, where did you find the java.lang.ClassNotFoundException: org.json.simple.JSONObject thing though? I can't find it [13:41:10] mforns: in the stack trace that luca sent with the Cassandra jobs: [13:41:22] https://www.irccloud.com/pastebin/AoSH7NWr/ [13:41:39] the last Caused by [13:42:00] basically Hive failed because "reflect" failed because the class asked of it isn't there [13:42:45] I can't find where we reference org.json.simple anywhere in our explicit code, so it must be bundled in some other jar [13:42:54] like hcatalog or one of those [13:43:05] and the old Hive must have been able to access it, new Hive can't yet [13:43:34] I mean, I do remember there was some magic early on to set up Hive to have by default a bunch of stuff loaded in its class path, maybe we didn't do that with Hive2 or maybe we have to do it differently or something? [13:43:37] yeah I agree, I propose to rollback hive2 actions for edit-hourly and cassandra at this point [13:43:39] elukey: does that ring a bell? [13:44:05] elukey: sounds good, we can test those more and figure it out [13:44:16] yes exactly, I think we spent already too much time [13:44:19] it makes sense it's just those too, because it might be just a few classes [13:44:20] prepping the rollback [13:44:37] like JSONObject, which isn't needed by the other jobs, for example, because they don't output JSON blobs [13:44:38] (03PS1) 10Elukey: Revert "edit-hourly: move oozie coordinator to hive2 actions" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527552 [13:44:51] (tops jobs do) [13:45:00] (03PS1) 10Elukey: Revert "cassandra: move oozie bundle to hive2 actions" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527553 [13:45:48] mforns: if you are ok we could review/merge these --^, deploy refinery and then restart the coord/bundle [13:45:51] really sorry :( [13:45:57] elukey, yes lookin [13:48:08] elukey, you generated that commit automatically? [13:48:17] yes with "Revert" [13:49:19] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527553 (owner: 10Elukey) [13:49:54] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527552 (owner: 10Elukey) [13:50:24] elukey, can I merge? [13:50:52] yes please! [13:50:58] (03CR) 10Mforns: [V: 03+2 C: 03+2] Revert "edit-hourly: move oozie coordinator to hive2 actions" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527552 (owner: 10Elukey) [13:51:05] (03CR) 10Mforns: [V: 03+2 C: 03+2] Revert "cassandra: move oozie bundle to hive2 actions" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527553 (owner: 10Elukey) [13:52:59] deploying refienry [13:55:33] thanks! [13:57:11] !log deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production) [13:57:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:58:01] 10Analytics: Incorporate Erik's great work on WiViVi into Wikistats 2 - https://phabricator.wikimedia.org/T229665 (10Milimetric) [13:58:21] 10Analytics: Incorporate Erik's great work on WiViVi into Wikistats 2 - https://phabricator.wikimedia.org/T229665 (10Milimetric) p:05Triage→03Normal [14:10:58] mmm, continue to all groups is taking a looong time [14:11:16] ok now [14:12:27] mforns: need any help? [14:12:40] not for now thanks! [14:14:51] ah deploy finished, good :) [14:16:04] milimetric: we can meet if you have time [14:16:44] elukey: to the cave! [14:27:46] !log finished deploying refinery [14:27:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:28:25] !log restarting oozie bundle for cassandra and oozie coordinator for edit_hourly [14:28:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:33:41] erroring jobs are re-running now [14:35:08] super mforns! [14:35:09] let [14:35:14] let's see [14:35:18] yea [14:35:28] seem to be working fin [14:35:30] fine [14:36:20] edit hourly is definitely working [14:36:23] so hive2's fault [14:39:51] mforns: we figured it out: [14:39:52] ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.96.jar; [14:40:10] if you don't do that in hive or beeline, you get the error when you try to do the reflect call from here: [14:40:11] ?? [14:40:16] oooh ok [14:40:19] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/pageview_top_articles.hql#L38 [14:40:31] so, this needs to be added to...? [14:40:36] I mean in oozie [14:40:57] oh, in the same query ok [14:40:58] will do! [14:41:39] I thought you guys were discussing sth else... [14:41:50] otherwise I'd had joined [14:42:43] we were supposed to Marcel but got nerd sniped [14:42:43] :D [14:42:53] heh ok [14:43:46] is there a way to add that jar automatically for all hive actions in oozie, or should I add it via query, by i.e. passing the artifact_directory as a parameter? [14:44:08] cassandra job also worked :D [14:47:32] test is very simple: [14:47:33] select reflect('org.json.simple.JSONObject', 'escape', '{"blah"}'); [14:47:46] aha [14:48:16] I see other jobs do sth similar, by passing artifacts_directory and refinery_jar_version from properties file to hql file [14:48:26] ok will do [15:22:47] (03CR) 10Ottomata: swift-upload.py to handle upload and event emitting (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [15:31:47] 10Analytics: Oozie queries that use 'reflect("org.json.simple.JSONObject"...' need refinery_hive jar - https://phabricator.wikimedia.org/T229669 (10mforns) [15:34:07] (03PS1) 10Mforns: Add jar to cassandra jobs for compatibility with hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527583 (https://phabricator.wikimedia.org/T229669) [15:38:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Oozie queries that use 'reflect("org.json.simple.JSONObject"...' need refinery_hive jar - https://phabricator.wikimedia.org/T229669 (10mforns) [16:04:49] so i'm looking at https://wikitech.wikimedia.org/w/index.php?title=Analytics/Systems/EventLogging/Backfilling#Backfilling_a_kafka_eventlogging_%3CSchema%3E_topic and the eventlogging-processor command given is suspicious [16:05:33] the kafka url starts as 'kafka- confluent://...' and it seems space shouldn't be there? And then it provides parameters at the end, but it uses & instead of ? as would be expected in a url [16:12:44] i was bold and edited it...correct as necessary :) [16:15:12] thanks :) [16:19:59] 10Analytics: Set up a deletion timer for netflow data set - https://phabricator.wikimedia.org/T229674 (10mforns) [16:52:14] ebernhardson: is the spike you backfilling? https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=now-2d&to=now&var-schema=TestSearchSatisfaction2 [16:57:09] (03CR) 10Nuria: [C: 03+1] "Nice troubleshooting." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527583 (https://phabricator.wikimedia.org/T229669) (owner: 10Mforns) [17:19:32] 10Analytics: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) [17:20:00] mforns: --^ [17:20:25] elukey, :] thx [17:21:44] 10Analytics: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) [17:37:03] * elukey off! [17:37:04] p/ [17:37:22] mforns: thanks a lot for the cassandra testing <# [17:37:24] <3 [17:37:37] it's running! [17:56:26] nuria: yes [18:42:34] (03CR) 10Mforns: [C: 03+1] "I tested this with the hive2 actions and it run without issues!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527583 (https://phabricator.wikimedia.org/T229669) (owner: 10Mforns) [18:47:00] nuria: o/ [18:47:35] nuria: I have one last pending question for ottomata in the software engineer JD. Can you check it out and let me know? (I'm guessing ottomata is not around as he's not in IRC). [18:48:01] nuria: once that's resolved, I'll send the JD to recruiting/HR for next steps. [18:48:18] and thanks to all of you who worked closely with me on this. it's been super helpful to have you involved and I appreciate your time. [18:54:35] milimetric: do you have time today to talk about T208612 ? (Early next week is fine, too, if you prefer that.) [18:54:36] T208612: Release edit data lake data as a public json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 [18:54:58] leila: anytime [19:01:30] (03PS1) 10Mforns: Revert "Revert "cassandra: move oozie bundle to hive2 actions"" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/527616 [19:12:13] milimetric: batcave now? [19:12:53] leila: omw [19:13:03] mforns: you're working on this, you may want to join cave [19:13:14] oh, leila! new batcave: https://meet.google.com/rxb-bjxn-nip?authuser=1 [19:13:18] https://meet.google.com/rxb-bjxn-nip [19:13:22] ok! [19:13:39] we had to say goodbye to our old one, but I made this short link: http://bit.ly/a-batcave [19:44:41] leila / mforns: https://phabricator.wikimedia.org/T208612#5388718 [19:44:47] 10Analytics, 10Analytics-Kanban, 10Research-Backlog: Release edit data lake data as a public json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 (10Milimetric) Rough draft of a blurb about why this dataset is useful: NOTE: A history of activity on Wikimedia projects as complete and re... [19:44:56] ah, sorry it's late for Marcel, have a good weekend! [20:46:26] 10Analytics, 10Research: Evaluate best format to release public data lake as a dump - https://phabricator.wikimedia.org/T224459 (10leila) [22:30:49] hey! is there any way to know what is the earliest message in a specific kafka topic? [22:31:33] from command line (e.g. kafkacat)