[00:25:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [00:33:38] 10Analytics, 06Operations, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3066309 (10Krinkle) [02:27:45] 10Analytics: Remove user_agent_map from pageview_hourly long term - https://phabricator.wikimedia.org/T156965#3066542 (10Tbayer) >>! In T156965#3017221, @Nuria wrote: > Browser data is agreggated and kept long term in https://wikitech.wikimedia.org/wiki/Analytics/Data/Browser_general > Yes, the link was already... [02:33:50] 10Analytics: Remove user_agent_map from pageview_hourly long term - https://phabricator.wikimedia.org/T156965#3066566 (10Tbayer) [05:03:19] 06Analytics-Kanban, 15User-Elukey: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#3066705 (10Milimetric) This is done for now, in that the new site is up and reports that we control are syncing to it. The old unused reports have been deleted and now what's left to do is announce a... [05:07:04] 06Analytics-Kanban: Announce analytics.wikimedia.org/datasets and deprecation of datasets.wikimedia.org - https://phabricator.wikimedia.org/T159409#3066706 (10Milimetric) [07:15:41] hi a-team - any ideas about the eventlogging weirdness at https://phabricator.wikimedia.org/T155639#3066465 ? (TLDR: grafana shows events are being sent, but they don't get recorded in the table, or only with a lag of > 2.5 days) [07:25:31] HaeB: from the SAL - 10:34 hashar@tin: Synchronized wmf-config/InitialiseSettings.php: wme: Set ReadingDepth sampling rate to 0.1% - T155639 (duration: 00m 40s) [07:25:31] T155639: Create reading depth schema - https://phabricator.wikimedia.org/T155639 [07:25:53] this is the drop of the ticket [07:26:24] (not sure if related or not) [07:26:56] (I am brb in one hour, cat to the vet again..) [07:27:23] elukey: not really. one can see that drop at https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=ReadingDepth , but it doesn't drop to 0 at all... [07:28:18] elukey: ... also, the table does have events for several hours after that change now https://phabricator.wikimedia.org/T155639#3066800 [07:29:12] will check later on, thanks for the ping! [07:59:05] 10Analytics-Tech-community-metrics, 07Regression: Git repo blacklist config not applied on wikimedia.biterg.io - https://phabricator.wikimedia.org/T146135#3066842 (10Lcanasdiaz) >>! In T146135#3061178, @Aklapper wrote: >>>! In T146135#2815705, @Lcanasdiaz wrote: >> I confirm that blacklist is not working. Work... [08:00:58] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3066844 (10Lcanasdiaz) >>! In T157898#3061177, @Aklapper wrote: >>>! In T157898#3024337, @Lcanasdiaz wrote: >> @Aklapper I confirm this is broken right no... [08:03:09] 10Analytics-Tech-community-metrics: "Email senders" widget empty though "Mailing Lists" widget states that there are email senders - https://phabricator.wikimedia.org/T159229#3066846 (10Lcanasdiaz) a:03Lcanasdiaz It seems we are creating a schema newer than the panel. Working on that. [09:00:37] HaeB: I am not an expert but I grepped ReadingDepth in the EL m4-consumer (that should be the one pushing data to mysql) and I can see logs like [09:00:45] eventlogging_consumer-mysql-m4-master-00.log:2017-03-02 08:11:30,284 [9522] (MainThread) Inserted 625 ReadingDepth_16325045 events in 1.148235 seconds [09:01:02] that have a timestamp relatively recent (not sure about the data) [09:01:46] and I also see some Warning: Duplicate entry log for ReadingDepth, but it wouldn't explain this [09:04:17] at this point it might be related to what data is stored in kafka [09:05:17] elukey: interesting, any idea what "Duplicate entry log" might mean? [09:05:44] nope, but it doesn't match the timeline that you put in the task.. [09:05:57] so it probably is a red herring [09:06:16] FWIW, i just checked again and the most recent timestamp in the table right now is still only 20170227215320 [09:06:53] (from SELECT MAX(timestamp) FROM log.ReadingDepth_16325045; ) [09:12:39] HaeB: I've started to tail the ReadingDepth topic on kafka and one of the last events has "timestamp": 1487880599 [09:12:43] that is really weird [09:14:11] 1487880599 is epoch (unix time) for now, 03/02/2017 @ 9:13am (UTC) [09:14:39] ok I made the wrong conversion then [09:14:44] I thought it was 23rd of Feb [09:15:11] elukey: you're right, i misread http://www.unixtimestamp.com/index.php ;) [09:15:35] ah nice :) [09:15:39] still, that's very interesting [09:15:56] ok so in Kafka we have timestamps that are old.. and I bet these are not produced by varnishkafka [09:16:25] (like the dt field in webrequest logs, that is added by varnishkafka) [09:19:03] ah no I am wrong, this topic is not handled by varnishkafka [09:19:07] nevermind [09:19:23] (coffee still needs to flow in my body probably) [09:20:14] so whatever is producing to the ReadingDepth topic is definitely adding events with a weird timestamp [09:29:31] HaeB: all this theory is based on the assumption that my kafkacat command tails the Kafka topic logs. I *think* it does but I am not 100% sure [09:29:47] (otherwise I might be reading from the start of the log retention time) [09:34:03] elukey: in any case, the table itself doesn't seem to contain any timestamps in the wrong (ie. epoch) format: [09:34:07] mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT timestamp FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '14%' LIMIT 100; [09:34:07] Empty set (0.01 sec) [09:35:23] back onto oozie-spark debugging [09:37:33] HaeB: does it get processed and stored like '20170227215320' maybe? [09:38:28] elukey: yes, that's what should happen ;) [09:39:12] (i.e. the timestamps in EL tables are of the form YYYYMMDDHHMMSS ) [09:57:31] 10Analytics, 15User-Elukey: Piwik puppet configuration refactoring and updates - https://phabricator.wikimedia.org/T159136#3067116 (10elukey) [11:02:16] (03PS1) 10Joal: Update oozie mediacounts load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340707 [11:08:13] joal: o/ - do yo uneed any help? [11:08:44] elukey: no real for help, but I could use a minute of batcave to complain with somebody :) [11:09:06] joal: sure, joining :) [11:55:03] * elukey lunch! [12:07:33] (03PS1) 10Matthias Mullie: Query the most common UploadWizard exceptions & errors [analytics/limn-multimedia-data] - 10https://gerrit.wikimedia.org/r/340720 (https://phabricator.wikimedia.org/T156694) [12:08:31] (03CR) 10Matthias Mullie: "Please do confirm that this config & queries actually produce the kind of results I expect (of which there is an example in the commit msg" [analytics/limn-multimedia-data] - 10https://gerrit.wikimedia.org/r/340720 (https://phabricator.wikimedia.org/T156694) (owner: 10Matthias Mullie) [13:59:17] PROBLEM - Zookeeper node JVM Heap usage on conf1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [972.0] [13:59:29] wow [14:00:48] this is the new alarm! Probably a misfire [14:00:56] k :) [14:01:22] it is not showing up in icinga [14:01:44] but https://grafana-admin.wikimedia.org/dashboard/db/zookeeper?panelId=40&fullscreen&edit looks not far from what the alarm indicated [14:02:15] I am not that worried since no Old Gen collections have happened [14:02:34] I think that the Old Gen grows over time [14:02:43] (03PS3) 10Joal: Remove pagecounts projectcounts from dump-check [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340548 [14:02:55] elukey: --^ I think this is ready to go [14:04:34] (03PS2) 10Joal: Update oozie mediacounts load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340707 [14:04:38] elukey: --^ this one too [14:04:57] joal: from a very ignorant perspective it looks good (the first), but maybe a commit comment explaining why we are removing it would help? [14:04:58] Then we'll be able to restart mediacounts [14:06:01] same comment for the second one, looks good but I am a bit ignorant :( [14:06:12] (03PS4) 10Joal: Remove pagecounts & projectcounts from dump-check [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340548 [14:10:35] elukey: updated both- Better? [14:10:40] (03PS3) 10Joal: Update oozie mediacounts load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340707 [14:10:49] thanks :) [14:11:34] joal: from my pov, we can merge [14:11:53] but we might wait for ottomata [14:12:00] since he shold be online soone [14:12:03] *soon [14:12:21] elukey: he told me yesterday we could go for those [14:15:29] joal: super, I would have +2 you but I know that you sometimes prefer to wait for the "mmmmmmmhhhh" confirmation :D [14:15:55] elukey: Let's move if you don't mind ;) [14:16:09] sure :) [14:17:07] (03CR) 10Elukey: [V: 032 C: 032] Remove pagecounts & projectcounts from dump-check [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340548 (owner: 10Joal) [14:17:23] (03CR) 10Elukey: [V: 032 C: 032] Update oozie mediacounts load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340707 (owner: 10Joal) [14:17:45] joal: going to deploy, all good? [14:18:11] elukey: all good ! [14:18:13] Thanks ! [14:18:27] elukey: let me know when deployed (+hdfs), I'll restart mediacounts [14:18:29] * elukey checks stat1002's disk space [14:18:33] :D [14:26:12] joal: hdfs deployed, you are free to go :) [14:26:21] I am going to grab a quick cup of coffee [14:26:30] awesome elukey [14:26:31] Thanks [14:27:44] !log restart mediacounts job starting 2017-03-01T11:00Z [14:27:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:34:59] mornnninnng [14:35:08] HHHHHHHHHHiiiiiiii [14:35:13] how goes? [14:35:23] ottomata: changing approach [14:35:26] joal is working like crazy [14:35:40] I merely cheer for him [14:35:46] don't have any power :D [14:35:56] not truuuuue ! [14:36:00] ottomata: you have probably received a page for Zookeeper, that was a false alarm! [14:36:14] i saw! figured with new heap alarms it was prob fine [14:36:18] ottomata: mediacounts solved - restarted, first two jobs succeded ! Yay [14:36:20] I added the monitors, they had a typo in the metric and for some reason the first good datapoint after the fix was a critical [14:36:23] spark however doesn't look good [14:36:26] ok [14:36:41] want to give me a status update since my email? or should I just letcha keep hammering? [14:36:52] i'm ready to hammer too [14:37:00] (not THE hammer, don't worry elukey) [14:37:26] ottomata: I decided to try not using hivecontext, cheating with parquet path [14:37:35] ottomata: seems to work -0 but it has not solved our problem [14:37:36] ah [14:37:43] but you said there were problems with that too? [14:37:56] ottomata: there are, but easier to cheat [14:38:06] Actually very easier, but kinda ugly [14:38:18] And I wanted hive context to work ;( [14:38:26] yeah [14:38:45] ok, i'm going to beat at that a little bit more... did oyu happen to try manually listing all of hte hive jars? [14:38:50] because now what that means is I need to change wikidata jobs back from hive context to parquet only (very doable) [14:38:55] that was going to be my next thing, although I don't think it will work [14:39:04] ? [14:39:16] ottomata: sory missed al ine [14:39:19] did you see my email? [14:39:22] oh [14:39:23] yes [14:39:27] did oyu happen to try manually listing all of hte hive jars? [14:39:29] I didn't try the jar list no [14:39:32] ok [14:39:33] i'm goign to try [14:39:46] i think i'll also make a simper faster hive context job [14:40:13] ottomata: Howver I managed to do some tricks: by dropping our jackson version number to the oneb spark uses, we get to a new conflict: javax servlet [14:40:21] And that one, I think there is nothing we can do about [14:40:37] oh you mean when wildcarding all the jars in /usr/lib/hive...? [14:40:59] ok ottomata: parquet trick works, will update wikidata jobs in that direction, and we'll be able to work with less pressure [14:41:33] ok cool [14:41:40] yeah that's fine, keep going [14:41:44] i'll see if i can poke at hive context a little more [14:41:49] at least get us a simple text case [14:41:57] it was totally a pain to test this while the cluster was so busy!!! [14:42:15] sure ottomata [14:42:27] unless there's something more useful I can help with right now? [14:48:49] 06Analytics-Kanban: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#3067857 (10elukey) [15:01:56] hello! unless I'm mistaken, it looks like the pageview REST API is missing yesterday's data, despite all the files existing at https://dumps.wikimedia.org/other/pageviews/2017/2017-03/ ? [15:03:37] weird ! [15:07:49] could they just be late? due to busy cluster? [15:07:57] not sure which jobs generate those [15:08:34] according to the directory logs, it looks like the last file showed up at 02-Mar-2017 06:14, which seems like a while ago [15:08:54] unsure what the processing pipeline is beyond that [15:11:22] andrew: thanks, we'll check up on it, we got lots of lag and some broken jobs due to a big cluster upgrade we did yesterday, so likely its related [15:11:47] much appreciated! please let me know if there's anything I can do to help [15:24:47] walking back home real quick to be in time for standup... [15:27:21] andrew: This data being is an artifact of us restarting our oozie jobs yesterday [15:27:44] andrew: data will show up later on, as we acutally backfilled february [15:29:45] ah interesting [15:30:18] backfilled Feb with pageviews API data? on our side at least it looked like it was already there [15:31:18] and should I read later on as "in the next couple hours", "sometime this evening", or "sometime in the next few days"? [15:52:11] elukey: hiiii [15:58:29] ottomata: hhhhhhhhhhhhhhhhheeeeeeeeeelloooo [15:59:54] hi a-team (repeat from earlier today) - any ideas about the eventlogging weirdness at https://phabricator.wikimedia.org/T155639#3066465 ? (TLDR: grafana shows events are being sent, but they don't get recorded in the table, or only with a lag of > 2.5 days) [15:59:58] hiiii oh we have to replace an27! [16:00:04] wondering what you think [16:00:08] we could move hue to thorium [16:00:27] and camus job running anywhere really [16:00:28] (elukey already looked into it a bit earlier today, but it looks like we need input from someone familiar with the eventlogging pipeline) [16:00:31] maybe stat1004? [16:00:40] HaeB: i might be able to help...just started stanup though [16:01:43] ottomata: my only concern would be running camus in a host that is not accessed as general purpose by users [16:02:10] ottomata: thanks! if turns out to be something bigger, feel free to make a new task out of it [16:02:14] a-team: standdupppp [16:02:28] ottomata: re: EL - I tried to run kafkacat -b kafka1018.eqiad.wmnet -C -t eventlogging_ReadingDepth this morning but the timestamp was very old (like Feb 23rd) [16:02:29] fdans: https://gist.github.com/jobar/e2eb8238aed401fb55adb320f39728e4 [16:02:43] ottomata: I wasn't sure if I was reading from the start of the logs or from the tail [16:03:16] milimetric: standdupp [16:03:41] (03PS1) 10Joal: Bump CDH version and update spark jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340758 [16:05:07] elukey: camus doesn't really do anythin on the host though, its just aplace to run cron to launch a hadoop job [16:06:40] HaeB: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-7d&to=now&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_ReadingDepth [16:06:53] i see events there, but somethign happened on teh 27th where it dropped off a loit [16:07:51] ottomata: yes, the sampling rate was changed; that's not the issue (we already discussed this above: it dropped, but not to 0 ;) [16:36:46] HaeB: we are troubleshooting at this time a big error in the cluster after our upgrade, we will get back to you once cluster is healthy again [16:38:55] it shouldn't be 0, there are events in kafka, should be around 5-10 per second [16:40:32] hmm, butyeah [16:40:33] hm [16:40:35] ottomata: does kafkacat -b kafka1018.eqiad.wmnet -C -t eventlogging_ReadingDepth tail events from kafka? [16:40:39] (03PS1) 10Joal: Update wikidata oozie jobs using spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340766 [16:40:45] it should start at end of log [16:40:45] or start from the beginning? [16:40:47] but i see what you are saying [16:40:52] ahh okok [16:40:54] -o latest should make it start from latest [16:40:58] but it shoudl do that by default [16:41:01] and the timestamps are old... [16:41:05] yeah [16:41:11] hm [16:41:12] but [16:41:14] if you do [16:41:17] kafka console-consumer [16:41:19] on a kafka host instead [16:41:21] it does the right thing [16:41:24] and i see recent timestamp [16:41:28] s [16:41:31] so there is new data htere [16:41:34] dunno why kafkacat is doing that... [16:41:45] I saw the old timestamp this morning! [16:42:03] HaeB: so, there should be recent data [16:42:13] where are you missing it? [16:42:52] elukey: kafka console-consumer --topic eventlogging_ReadingDepth | pv -l > /dev/null [16:42:56] 5-10 / sec [16:43:23] "timestamp": 1488472936 [16:44:15] ottomata: ? in the table, log.ReadingDepth_16325045 [16:44:28] see the task [16:44:47] or: [16:44:52] SELECT MAX(timestamp) FROM log.ReadingDepth_16325045; [16:44:52] +----------------+ [16:44:52] | MAX(timestamp) | [16:44:52] +----------------+ [16:44:52] | 20170228133701 | [16:44:52] +----------------+ [16:44:52] 1 row in set (0.00 sec) [16:46:44] ok looking [16:46:52] good news is on master at least, max timestamp is 20170302164530 [16:46:58] so looks like replication? [16:47:03] which, unhh [16:49:29] ungh [16:49:33] HaeB: where is the ticket? [16:49:54] ahhh I didn't think about replication! [16:50:04] so m4 master is not where HaeB is reading data [16:50:12] ya [16:50:14] and [16:50:18] this is the awfuli bash replication [16:50:24] ottomata: for now the discussion is part of this one: https://phabricator.wikimedia.org/T155639#3066465 [16:50:28] yes yes I always forget [16:50:30] insert into slave select * from master where timestamp > slave timestamp [16:51:46] so, HaeB it looks like new events are coming in [16:51:50] they are just seriously lagging [16:52:02] not sure if we can help much with that, the mysql setup is just falling apart [16:52:36] https://phabricator.wikimedia.org/T124307 [16:53:07] great, thanks for clearing that up at least (i connected to master and it's indeed up to date there) [16:53:45] so does this mean that you recommend to avoid querying slave for now? [16:54:50] also, why is only this table lagging and others not? [16:55:12] HaeB: that I don't know, especally since there's not much going on on the slave (it seams) [16:55:13] seems* [16:55:26] the eventlogging_sync.sh custom bash replication script is kind of a mess [16:57:16] HaeB: your data is on HDFS, if you care to try it in hive (sorry this isn't better yet) [16:57:18] :) [16:58:00] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hadoop [16:59:40] or, if you are feeling CLI inclined: your events are in (daily) files in /srv/eventlogging/archive/all-events.log-20170301.gz on stat1003 [17:01:12] ottomata: well yes, we discussed that multiple times already ;) (i tried to get these instructions to work more than a year ago with marcel, but back then we failed at querying more than one partition/hour, or aggregating partitions into a table) [17:01:24] ah ok [17:01:31] sorry about that, yea we want to make all that better [17:01:40] will be doing so soon (next quarter?) but not yet [17:02:18] HaeB: did you try spark? [17:03:05] joal: have you seen [17:03:06] "java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package" [17:03:07] ? [17:03:18] a-tem: milimetric grooming? [17:03:25] i see that in the same job in which i get that derby error [17:03:31] ottomata: That's where I am [17:03:35] oh [17:04:01] ottomata: no, hive. in any case, looking forward to explore this as additional option, but regarding the move it will overall mean a huge conversion effort for existing queries and recipes that are based on mysql, see the recent email threads and https://phabricator.wikimedia.org/T159170 [17:04:13] ottomata: I think the derby one is just spark trying to back flip after failing to connect to real metastore [17:04:13] ya i've been following those [17:04:22] we aren't going to just switch mysql off without worrying about that [17:04:36] ah hmmm [17:11:30] 06Analytics-Kanban: Create robots.txt policy for datasets - https://phabricator.wikimedia.org/T159189#3068215 (10Nuria) [17:13:14] 10Analytics, 10Analytics-EventLogging: Find an alternative query interface for eventlogging Mariadb storage - https://phabricator.wikimedia.org/T159170#3058941 (10Nuria) [17:14:45] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-General-or-Unknown, 07Technical-Debt: JsonData and EventLogging have multiple classes with the same name - https://phabricator.wikimedia.org/T159079#3068226 (10Nuria) ping @Krinkle can you advise here? Not sure what this ticket is about [17:15:04] (03PS2) 10Joal: Bump CDH version and update spark jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340758 [17:20:12] 10Analytics, 10Pageviews-API: Pageviews API ignores moved pages - https://phabricator.wikimedia.org/T159046#3055240 (10Nuria) This is better solved by serving metrics by page id instead of title. Once we have the mediawiki edit history data available on an API page id and title history can be tracked and thus... [17:22:03] 10Analytics, 10Recommendation-API: productionize recommendation vectors - https://phabricator.wikimedia.org/T158973#3053346 (10Nuria) We are happy to help but expect research to drive it. [17:28:05] 10Analytics, 10EventBus: EventBus logs don't show up in logstash - https://phabricator.wikimedia.org/T153029#3068274 (10Nuria) [17:41:32] HaeB: tables that are written to at different rates have lagged differently in the past [17:42:23] nuria: is a 2.5 day lag standard? [17:43:14] HaeB: it is happen before yes, for the very huge tables that Edit team had [17:43:23] HaeB: the bigger the table the bigger the issues [17:51:23] ottomata: I have confirmation - same job failed then succeeded [17:51:45] * elukey afk team! [17:51:48] o [17:51:50] o/ [17:51:55] joal: OK [17:51:57] that is interesting [17:51:59] do you have the app ids? [17:52:01] ottomata: my little intuition tells me there is misconfig somewhere [17:52:05] we should compare the hosts where they run [17:52:14] I'll find them [17:53:17] ottomata: failed one: application_1488294419903_10677 [17:53:38] joal and this is w/o hivecontext [17:53:53] ottomata: correct [17:53:59] ottomata: successfull one: application_1488294419903_10751 [17:54:07] ok, and same signer information error [17:54:13] correct ottomata [17:55:06] other info: failed oozie-launcher was on analytics1028, succesfull one was on analytics1050 [17:55:09] ottomata: --^ [17:56:07] joal the app id you gave me was the oozie launcher, looking for spark job app ja? [17:56:27] hm i guess we need to look at both? [17:56:36] hmmm probably not though, it shoudl be the spark app master that is having the problem, right? [17:56:37] hm [17:56:41] but oozie the problem [17:56:42] launching [17:57:34] joal, these two? [17:57:43] fail: https://hue.wikimedia.org/oozie/list_oozie_workflow/0004095-170228165458841-oozie-oozi-W/ [17:57:43] succeed: https://hue.wikimedia.org/oozie/list_oozie_workflow/0004121-170228165458841-oozie-oozi-W/ [17:57:43] ? [18:00:04] looks indeed those ones [18:01:20] ottomata: I double checked configuration, they are really the same [18:01:47] joal: i'm still struggling to see where to look, i want to compare jars / versions installed on hosts [18:01:58] i guess i'll first compare hosts where oozie launchers ran [18:02:08] ottomata: sounds reasonable [18:02:34] ottomata: The weird things about this jars collision is that oozie uses it's lib from HDFS !!! [18:02:41] ottomata: makes no sense :( [18:08:39] joal: i see launchers [18:08:50] failed: analytics1051 https://yarn.wikimedia.org/cluster/app/application_1488294419903_10676 [18:08:58] succeeded: analytics1055 https://yarn.wikimedia.org/cluster/app/application_1488294419903_10751 [18:08:59] right? [18:10:08] ottomata: seems right [18:10:25] ottomata: webrequest load job failed without proper reason [18:10:46] ottomata: I suggest you have a quick look at: oozie job --info 0003986-170228165458841-oozie-oozi-W@add_partition [18:11:41] bwaa [18:12:25] ottomata: sounds bad heh [18:14:09] ottomata: might we have gone through this one ? http://stackoverflow.com/questions/24166026/hdp-2-0-oozie-error-e0803-e0803-io-error-e0603 [18:15:16] !log Restarting webrequest load for tect 2017-03-02T15:00Z [18:15:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:15:26] joal [18:15:27] in etherpad [18:15:27] i see [18:15:28] sudo oozie-setup sharelib create -fs hdfs://analytics-hadoop/ -locallib /usr/lib/oozie/oozie-sharelib-yarn [18:15:29] [18:15:29] !!! [18:15:32] maybe right. [18:15:33] looking [18:16:26] hm but oozie does not have a user account on hdfs [18:16:27] which is why [18:16:31] maybe it needs to be owned as hdfs [18:19:01] joal: i want to try recreating the sharelib [18:20:58] i'm not sure what happens to runing oozie jobs though. [18:21:43] hm, i assume they will be fine, the running workflows will have already loaded the sharelib jars [18:21:47] new ones will pick up the new jars [18:22:12] 10Analytics: Add unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3068543 (10Nuria) [18:22:24] joal: , i'm proceeding... [18:22:35] !log deleteing and recreating oozie share lib [18:22:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:23:37] ottomata: shouldn't have we stopped oozie jobs? [18:24:14] (i haven't done aything yet) [18:24:20] Ah :) [18:24:22] joal: maybe? [18:24:32] i was about to, but then i was looking at what oozie user account was where [18:24:45] joal: i think running ones will just keep running? [18:24:54] ottomata: don't kniow ? [18:25:01] i assume they load the jars into memory when they launch [18:25:02] ottomata: let's go, we'll fix [18:25:34] hm, there is this hive-partition-add-wf that is in prep right now [18:25:45] oh [18:25:46] its OLD [18:25:47] old old [18:25:48] what is that? [18:25:51] https://hue.wikimedia.org/oozie/list_oozie_workflows/ [18:25:56] https://hue.wikimedia.org/oozie/list_oozie_workflow/0000768-161020124223818-oozie-oozi-W/ [18:26:28] Weird ! [18:26:47] should we kill that? [18:27:21] ya [18:27:22] i think so [18:27:28] ok, doing the sharelib [18:27:49] killing the weird step [18:29:15] ok, its recreated [18:29:23] but on more inspection, i don't think it will change aything [18:29:39] Do you want me to relaunch my jobs [18:29:40] ? [18:31:04] try one maybe [18:31:12] actually, i'll just try mine, might be easier [18:31:19] k [18:31:28] 0004194-170228165458841-oozie-oozi-W [18:31:42] ! [18:31:43] ? [18:31:46] hdfs://analytics-hadoop/user/oozie/share/lib/lib_20170228165236/spark/parquet-common.jar [18:31:59] oh [18:32:04] that's the wrong sharelib, the old dir name [18:32:05] uh oh [18:32:09] i think i need to restart oozie [18:32:14] or, wait, i'll just rename the one in hdfs [18:33:19] i think i need to restart 2 joal [18:33:25] https://hue.wikimedia.org/oozie/list_oozie_workflow/0004191-170228165458841-oozie-oozi-W/ [18:33:25] k [18:33:27] https://hue.wikimedia.org/oozie/list_oozie_workflow/0004190-170228165458841-oozie-oozi-W/ [18:33:29] going to rerun [18:33:32] ok? [18:33:48] joal: ^? [18:35:46] hmmmmm [18:35:55] joal: why is the application path set to -scap_sync-dirty? [18:35:59] https://hue.wikimedia.org/oozie/list_oozie_workflow/0004190-170228165458841-oozie-oozi-W/ [18:38:03] joal: now for my test job, i'm getting the jackson error again [18:38:09] ah [18:38:10] but also [18:38:11] java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package [18:38:15] gwah [18:38:24] ok, i think the sharelib perms are not hte problem [18:40:47] :( [18:45:14] ok joal, i dunno, hard so say if there are mismatch in configs or jars installed [18:45:16] i don't see any [18:45:16] but [18:45:22] maybe we should brain bounce with my test job [18:45:24] since it always [18:45:25] fails [18:45:26] for the same reasons [18:45:28] and should work [18:45:30] it is hive context [18:45:40] ottomata: I think the change in oozie broke stuff [18:45:48] but you say you get the servlet error without hive context [18:45:53] joal: the sharelib change I just did? [18:45:57] yes [18:46:02] alert emails [18:46:07] it broke 2 cassandra load jobs [18:46:10] but i fixed that and reran them [18:46:23] ottomata: fixed already ? [18:46:24] it broke beacuse we need to restart oozie if we change sharelib, since oozie will tell new jobs where the sharelib is [18:46:26] yes [18:46:26] awesoem :) [18:46:40] https://hue.wikimedia.org/oozie/list_oozie_coordinator/0001592-170228165458841-oozie-oozi-C/ [18:46:46] the jobs reran succesfully [18:47:00] instead of restarting oozie to pick up the new sharelib dir (it names by timestamp of creation) [18:47:07] i just renamed the dir to the old timestamp name from a few days ago [18:47:22] ottomata: ok [18:47:24] if i had left it, more would have broke [18:47:31] makes sens [18:47:32] ok, do you want to bang on this test hive context / servlet problem? [18:47:34] with me? [18:47:57] ottomata: sure, I'll be gone in 1/2 hour I think, but for now let's go ! [18:48:01] batcave? [18:48:03] ok [18:48:04] ya [19:05:21] ottomata: https://community.cloudera.com/t5/Batch-Processing-and-Workflow/quot-Failing-Oozie-Launcher-key-not-found-SPARK-HOME-quot/td-p/39830 [19:11:14] joal: https://issues.apache.org/jira/browse/OOZIE-2277 [19:12:37] looks like the channel topic should be updated to point to the new log location https://wm-bot.wmflabs.org/logs/%23wikimedia-analytics/ [19:25:15] joal failed webrequest load wf: 0003986-170228165458841-oozie-oozi-W [19:26:27] App Path : hdfs://analytics-hadoop/wmf/refinery/2017-03-01T14.41.28Z--scap_sync_2017-03-01_0002-dirty/oozie/webrequest/load/workflow.xml [19:27:54] * milimetric lies down to recover from meeting marathon thursday [19:29:20] (03CR) 10Ottomata: [V: 032 C: 032] Update wikidata oozie jobs using spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340766 (owner: 10Joal) [19:29:25] (03Abandoned) 10Joal: Update oozie mobile apps metrics jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340542 (owner: 10Joal) [19:29:27] (03CR) 10Ottomata: [C: 032] Bump CDH version and update spark jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340758 (owner: 10Joal) [19:30:45] (03Abandoned) 10Joal: Bump CDH version and update spark jobs accordingly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340540 (owner: 10Joal) [20:27:29] nuria: heads up I emailed you about notebook100[12] as I'm not sure what the plan is there but they are technically up for refresh [20:34:29] chasemp: ajem... and those are in analytics budget? [20:35:24] nuria: I don't think they are in anyones budget atm for refresh (to my knowledge). It came up as they are older and would be otherwise scheduled for refresh. [20:35:38] they were alloacted from the spare pool possibly at one time [20:35:41] chasemp: ah ok , sorry [20:35:51] chasemp: I am starting to loose track [20:36:43] nuria: no worries, just didn't know who had a handle on that as we (labs) really do not [20:36:48] chasemp: what would be your preference here? [20:37:24] either we say they are going to be decomissioned or analytics needs them and you guys should include budget for refresh I think, I'm not sure what the specs are but I dont't think it's much [20:37:58] they only got put in our column as yuvi is the person we know has been associated, just trying ot make sure they are not forgotten [20:38:00] either way [20:39:28] ok, will add to hw refresh, was just working on that, could you add specs to our hardware doc if i share it with you? (it will be pretty obvious where things go) cc ottomata [20:40:14] ^ chasemp [20:40:31] I'll give it a whirl if you share the doc sure [20:40:40] ottomata: fyi, added priority column for hardware spec [20:45:29] thanks nuria [20:53:49] nuria: ottomata ballparked. my guess is those aer oversized due to historical raisons but you would know better [20:54:09] chasemp: you as me? ahem... [20:54:20] the royal you :) [20:54:26] analytics [20:54:54] in the spirt of https://en.wikipedia.org/wiki/Royal_we [20:55:02] chasemp: yeah they are old analytics nodes [20:55:03] oow [20:55:09] same as the kafka ones we are replacing [20:55:18] ok, SOMEONE knows [21:02:48] joal: apologies if this is a silly question, but the changes just merged will bring up the oozie launcher, which means that the pageviews API jobs are now progressing, right? [21:04:20] ah nevermind, looks like we've got data for 3/1! [22:32:59] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-General-or-Unknown, 07Technical-Debt: JsonData and EventLogging have multiple classes with the same name - https://phabricator.wikimedia.org/T159079#3069274 (10Krinkle) @Nuria The JsonData and EventLogging mediawiki extensions both define a clas... [23:21:35] 10Analytics, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Analytics cluster should connect to elasticsearch over SSL - https://phabricator.wikimedia.org/T157943#3069485 (10debt) [23:22:11] 10Analytics, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Analytics cluster should connect to elasticsearch over SSL - https://phabricator.wikimedia.org/T157943#3021098 (10debt) p:05Triage>03Normal