[02:09:05] hey Ironholds [02:09:06] you around? [02:09:11] or halfak [02:09:28] or nuria [02:10:36] Analytics, operations: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240666 (BBlack) Note that aside from the hung task stuff above, there was no final kernel crash output or anything, and other "normal" logging continues through about 01:47. When icinga alerted on all... [02:43:04] Analytics, operations: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240688 (Ottomata) Yeah this is very strange. This is the 4th node we have had this happen to in the last 2 weeks or so. (Well, 1016 happened today, and we are not sure that the same thing happened th... [03:02:18] Analytics, operations: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240695 (Ottomata) Note that so far, only the older of the Dells in the cluster have crashed. analytics1011-analytics1020 [12:37:11] (PS1) KartikMistry: Add cs, el, kk and zu languages [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/207063 [13:02:28] o/ joal & milimetric [13:02:33] Heya ! [13:31:21] joal: morninnnng! [13:41:31] ottomata: Moooorning as well ! [13:41:35] wassup ? [13:41:45] impala! is workkiiing. [13:42:07] it is a bit annoying, have to set REQUEST_POOL (queue name) manually, and have to specificy an impalad to connect to [13:42:11] want to talk to you a bit about that [13:42:22] but, also, how to test it? like, what use case should we test it out for right now? [13:42:33] impala kinda needs to do the compute stats things on tables before they are useable [13:42:39] i ran that for the wmf.mediacounts table [13:43:32] is this Give me amin, and let batcave ! [13:43:35] ottomata: --^ [13:44:37] ook [13:45:19] there. [14:50:34] joal, I just invited you to a meeting with the altiscale folks. I will be pretty late for you, so don't feel obligated. [14:50:44] k, thx :) [14:50:51] I figured it would be a good chance to push on any open tickets (e.g. spark logging) [15:33:05] joal: standup! [15:35:57] Thx ottomata [15:36:41] (CR) KartikMistry: [C: 2] Add cs, el, kk and zu languages [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/207063 (owner: KartikMistry) [15:36:54] Analytics-Cluster, Analytics-Kanban, Performance: Implement Unique Clients report on cluster using x-analytics header & last access date {bear} [13 pts] - https://phabricator.wikimedia.org/T92977#1241794 (kevinator) a:madhuvishy [15:44:15] (Merged) jenkins-bot: Add cs, el, kk and zu languages [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/207063 (owner: KartikMistry) [16:00:40] Analytics-EventLogging, Analytics-Kanban, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1241891 (Tgr) declined>Open Sending a warning message when the logging call fails does not fix the issue o... [16:09:55] milimetric, how should I continue with new EL patch? [16:10:02] milimetric, shoould I deploy it? [16:10:39] milimetric, oh! I'm looking at your comments [16:40:31] milimetric, added some comments, thanks! [16:52:40] changing locations, back in a bit [17:08:19] mforns: checking now, how long are you around? [17:08:35] milimetric, for 5 more hours [17:14:30] Analytics-Engineering, MediaWiki-API, Wikipedia-Android-App, Wikipedia-iOS-App: Add page_id and namespace to X-Analytics header in App / api requests - https://phabricator.wikimedia.org/T92875#1242164 (bd808) [17:31:59] mforns: ok, merged, good work. We should deploy it? [17:32:02] I've never deployed... [17:32:18] milimetric, did you apply any changes? [17:35:12] mforns: no, agreed with your comments [17:35:28] milimetric, I also never deployed [17:35:46] ok :) so let's read the docs and meet up in the batcave afterwards and try it together? [17:36:12] I suppose that puppet pulls latest changes to the repo, so the only thing we'd need is execute setup script and restart? [17:38:30] milimetric, the repo is not up to date yet [17:38:51] milimetric, cool! didn't get your last message [17:43:43] mforns: i can't find any documentation, you know where it is? [17:43:55] milimetric, https://wikitech.wikimedia.org/wiki/EventLogging#Deploying_EventLogging [17:46:58] :) I was looking on mediawiki of course [17:47:03] that doc sprint can't come soon enough [17:48:22] ottomata: Do we have access to spark 1.3 in our codebase ? [17:49:12] ottomata: mforns and I are about to try to deploy EL. Anything we should know since it's been moved to eventlog1001? Does the process still go roughly as outlined in https://wikitech.wikimedia.org/wiki/EventLogging#Deploying_EventLogging ? [17:49:23] milimetric: process should be the same [17:49:32] joal: it is not officially instaleld, but I do have it in my homedir on stat1002 [17:49:36] you can use it [17:49:37] I can update small differences, but we do git deploy from tin, then go to eventlog1001 and hafnium and build, start, stop [17:49:48] milimetric, aha [17:49:51] yes [17:49:59] ok, cool, mforns you wanna jump in the batcave now and do it? [17:50:01] milimetric, batcave? [17:50:03] ottomata: https://issues.apache.org/jira/browse/SPARK-4987 [17:50:03] ok [17:50:05] k :) [17:50:12] * joal facepalm [17:50:35] oh yeah [17:50:48] what's that do to us with current spark? we can't read the table at all? [17:50:56] nope [17:51:06] doh, ok. [17:51:06] breaks completely on my side [17:51:07] heh [17:51:17] I wonder how Nuria has managed it ... [17:51:20] ok, welp, i guess I should install 1.3 more officially then, outside of packages. [17:51:30] you are using sqlContext.parquetFile? [17:51:33] Would help for sure [17:51:41] maybe if you only select certain fields? [17:51:46] Sorry man :-S [17:51:57] Well, I don't even look at timestamp [17:52:04] hm [17:53:53] how should I use your spark for spark-shell ? [17:53:55] hmmm. probably only for spark-submit I guess [17:54:30] you can do it with spark-shell [17:54:30] um [17:54:45] cd /home/otto/spark-1.3.0-bin-hadoop2.4 [17:54:47] bin/spark-shell [17:54:49] i think that should do it [17:54:53] cool [17:55:26] uhhh [17:55:28] but i just did that [17:55:30] and got java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384 [17:56:25] hive 0.14 :( [17:56:48] oof right [17:57:04] hm, well, hm [17:57:09] wonder if that is just spark then [17:57:13] maybe we can put 0.14 on classpath [17:57:15] trying... [17:57:46] arrrf, that's bad, even if you don't read the thing, it breaks :( [17:59:23] hm, wait no, spark-assembly.jar has hive . parquet classes in it [17:59:43] would assume that would mean 1.3 would have hive 0.14 deps included [18:01:51] haha, joal, or maybe we should just remove ts for now????? :p [18:01:59] oof [18:02:04] Yeah ... I wonder as well :) [18:02:05] sorry bout that [18:02:25] It's painfull, but is it as much as not having data workable ? [18:02:49] Biggest concern here : would have to replicate almost all of april data :( [18:02:56] mrff [18:03:02] oh, ha joal [18:03:03] http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/ [18:03:07] reading to see if it has what we need [18:03:15] • Apache Spark 1.3 [18:03:17] • Apache Hive 1.1 [18:03:20] should be good! [18:03:24] Ok, Indeed ! [18:03:58] Man ... That timestamp stuff was really a bad mistake ... [18:04:07] i mean, not quite a mistake [18:04:10] it was the right decisison! [18:04:19] we are just ahead of the times :) [18:04:22] I hope it'll work smoothly to update to 5.4 ! [18:04:37] huhuhu --> No regression test, that's not so good ;) [18:05:15] Let me know if you need help on that [18:05:42] k, reading release notes now [18:07:06] ottomata: so we started git deploy sync from tin, but it said 0/4 minions completed the fetch [18:07:10] also just in time! [18:07:10] Added Spark action which lets you run Spark applications from Oozie workflows [18:07:15] joal^ [18:07:18] hm, milimetric, weird [18:07:21] then we checked EL on eventlog1001 and hafnium and it's in detached head [18:07:27] hmm [18:07:32] ottomata: Hurroay :) [18:07:45] even weirder, what's checked out on hafnium is different from what's on eventlog1001 [18:07:52] ottomata: as you said, jsut ahead of time, one version ! [18:08:20] ottomata: should we just check out master on both machines and try git deploy sync again? [18:08:30] detached head is fine, right? that's how git deploy works? [18:08:31] not sure. [18:08:34] (we're in the batcave trying to figure this out if that's easier) [18:08:36] um, yeah, milimetric that won't hurt [18:08:38] that's what I would try [18:08:41] k [18:09:46] ottomata: it says hafnium is 351 commits behind! [18:09:51] what role does that box play? [18:10:09] awww, joal [18:10:09] The following is not yet supported in a production environment because of its immaturity: [18:10:09] • Spark SQL (which now includes dataframes) [18:10:17] milimetric: i think it is for monitoring? [18:10:25] :((( [18:10:26] k, thx [18:10:31] ottomata: --^ [18:11:58] maybe it is included, but just not officially cloudera supported, joal [18:12:05] they said the same for 1.2 [18:12:08] and we use it :) [18:13:54] ottomata: I think that's what it is, but still [18:14:12] ayway, let's try, it should solve the hive timestamp issue as well [18:15:32] k, lemme finish up something, will go through upgrade in vagrant and then in labs and see how it goes [18:15:38] milimetric: how goes? [18:15:48] ottomata: You are my savior ! [18:15:49] ;) [18:16:03] ottomata: we just merged on hafnium, git deploy had done the fetch part [18:16:09] and i guess just mis-reported that it failed [18:16:20] the merge went ok now we're checking graphite to see that it didn't break anything [18:17:10] milimetric: sometimes with git deploy i wait and ask it to retry, wait a few secs and show report again [18:17:16] sometimes it just takes longer than git deploy wants [18:17:21] gotcha [18:33:17] ottomata: hafnium says "fetch status: 128" in git deploy [18:33:50] and I get "fatal: loose object" when I try to pull from hafnium [18:34:37] sheesh y u gotta have problems?! [18:34:41] with you in 1 min... [18:35:53] milimetric: can I poke around? [18:36:05] ottomata: sure, come to the batcave [18:42:37] joal, ja, known fix in 5.4: — Hive's Decimal type cannot be stored in Parquet and Avro [18:43:03] ottomata: ok, but timestamps ? [18:43:13] oh [18:43:14] sorry [18:43:24] ;) [18:44:05] ah, from 5.3.2 [18:44:05] yes [18:44:14] nice :) [18:48:20] Analytics-Cluster: Upgrade Analytics Cluster to CDH 5.4.0 - https://phabricator.wikimedia.org/T97453#1242768 (Ottomata) NEW a:Ottomata [18:51:57] Time for me to say goodnight ! [18:52:05] Will catch up with you ottomata tomorrow [18:52:57] goodnight! [18:53:52] milimetric, did you !log the eventlogging1001 deployment? [18:56:40] mforns: yes, in -ops [18:56:58] milimetric, ok! [20:24:12] Hey folks. It looks like EL is about 2 hours behind. [20:24:19] FYI. [20:24:25] * halfak waits for his events to show up. [20:52:21] milimetric, eventlogging is crashing [20:54:15] mforns: uh oh [20:54:41] milimetric, sudo tail -f /srv/log/upstart/eventlogging_consumer-mysql-m4-master.log [20:57:45] mforns: hm... so somehow invalid events are getting past the processor? [20:57:51] or something's wrong with the consumer you think? [20:58:03] milimetric, I don't know... [20:58:11] all-events.log seems to be showing events [20:58:20] milimetric, aha [20:58:24] graphite looks like it has a downward trajectory but didn't seem to crash yet [20:59:07] it looks normal historically actually [21:01:05] milimetric, yes [21:01:22] milimetric, I think the problem is in schema.py http_get_schema [21:02:02] milimetric, line 65 [21:02:20] * milimetric looking [21:03:28] i forget if you changed that scid in your patch [21:03:38] milimetric, like the retrieved schema was not valid [21:03:47] yeah, it looks like it [21:04:58] milimetric, I'm going to change the code in production to print the schema when the schema validation raises error, makes sense? [21:05:16] yeah [21:05:19] ok [21:05:25] so I'm looking at this line: [21:05:25] scid, scid_events = events_batch.pop() [21:05:39] would that scid be incompatible with what get_schema(scid) is looking for? [21:05:50] sorry [21:05:57] get_table(meta, scid) [21:08:41] milimetric, do not follow [21:08:52] mforns: the all-events log stopped [21:09:12] milimetric, I restarted eventloggingctl [21:09:32] but should be running now [21:09:50] hm, I don't see events on it [21:10:29] yeah, graphite's showing the dip too [21:10:47] we still have the raw logs so that's ok [21:12:44] mforns: yeah, let's go in the batcave [21:12:49] ok [21:14:31] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 40.00% of data above the critical threshold [1800.0] [21:51:59] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL Stopped EventLogging jobs: reporter/statsd consumer/server-side-events-log consumer/mysql-m4-master consumer/client-side-events-log consumer/client-side-events-kafka-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events-kafka processor/client-side-events forwarder/8422 forwarder/8421 [21:59:20] milimetric, this is the url that EL uses to get schemas: http://meta.wikimedia.org/w/api.php?action=jsonschema&title=MobileWikiAppArticleSuggestions&revid=11448426 [21:59:32] milimetric, required comes already empty [22:02:57] milimetric, maybe this change: https://gerrit.wikimedia.org/r/#/c/207274/1 [22:18:35] kevinator, yt? do you know who from mediawiki-core I can contact for a problem in mediawiki API that is breaking EventLogging? [22:19:09] * YuviPanda is curious too, re ^ - now that mediawiki-core does not exist [22:19:38] YuviPanda, hehe do you know someone that can help me? [22:19:53] mforns: usually, if in doubt ask bd808 and he’ll point you to someone [22:19:58] that applies to life in general [22:20:05] YuviPanda, xD ok [22:20:08] thanks! [22:20:46] mforns: what [22:20:51] what's up? [22:20:52] bd808, I think there's a recently deployed problem with mediawiki API [22:20:58] hi bd808! [22:21:11] probably. Brad's been cleaning up a couple of things today [22:21:31] wmf3 had some bigish changes apparently [22:21:37] this url was returning boolean true values normally as of 2 hours ago [22:21:55] bd808, http://meta.wikimedia.org/w/api.php?action=jsonschema&title=MobileWikiAppArticleSuggestions&revid=11448426 [22:22:13] look at the 'required' fields [22:23:11] bd808, they are empty, and if you look at the corresponding page: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions [22:23:17] they should be true [22:24:09] bd808, thanks for pointing me to him, I already contacted him, because I thought the problem had something to do with: https://gerrit.wikimedia.org/r/#/c/207274/1, but he seems out [22:25:10] * bd808 had a wifi drop [22:25:47] bd808, can you read the previous messages? [22:25:51] Brad is off for the day by this time generally. He's on EDT (3 hours ahead of SF) [22:26:07] yeah. znc for the win :) [22:26:24] bd808, ok, the problem is completely breaking one of our main systems.. [22:27:05] there were changes to the API output formatting. let me see if I can find somebody to help track this down [22:27:18] bd808, thanks! [22:32:21] mforns: legoktm is going to help track this down. He's got a good idea what's wrong [22:32:30] have you opened a bug yet? [22:32:43] bd808, thanks for the help [22:32:51] bd808, no, no task yet [22:32:55] see, ask bd808 and all problems go away :) [22:33:07] mforns: can you write one up pretty please? [22:33:11] bd808, do you know in which project should I create that [22:33:18] bd808, sure! [22:33:23] what owns the jsonschema module? [22:34:18] mforns: hi, are there any other keys besides 'required' that are being affected? [22:34:31] (JsonSchema* is in EventLogging) [22:34:38] legoktm, hi! I don't know [22:34:52] mforns: tag the bug with EventLogging and mediawiki-api [22:34:54] legoktm, it seems that it only happens when the original value is true [22:35:02] bd808, sure [22:35:03] yes, it does that for any boolean [22:35:09] ok [22:36:04] we need a whitelist of all the acceptable boolean parameters [22:36:12] https://github.com/wikimedia/mediawiki-extensions-TemplateData/commit/6580dc3e872d20750c8ef624299af35952caefa9 is how we fixed this in a different extension [22:37:37] delightful that the fix is a "hack" :/ [22:40:04] legoktm, this would mean that everytime a team creates a schema with a new boolean value, we need to ping you to add a new field to that whitelist? [22:40:13] umm [22:40:27] :| [22:40:36] where's the code that's accessing this? [22:40:54] it might just be easier to switch to formatversion=2... [22:41:19] https://meta.wikimedia.org/w/api.php?action=jsonschema&title=MobileWikiAppArticleSuggestions&revid=11448426&formatversion=2 [22:41:23] legoktm, https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/schema.py#L62 [22:41:39] okay nope :P [22:42:22] legoktm, formatversion=2 seems to work! [22:43:32] legoktm, is formatversion=2 a deprecated version? [22:43:42] it's alpha [22:43:44] legoktm: can jsut change this right? -- https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/schema.py#L31 [22:44:13] yeah, we can change to formatversion=2 but if it's alpha we'd probably rather think some more, since this is fairly critical [22:44:27] legoktm, bd808, yes we could easily change that to use formatversion=2 [22:44:39] hi milimetric [22:44:42] hi :) [22:44:45] was just catching up [22:45:00] is it wrong that I feel relieved we didn't screw this up this time? :) [22:45:06] I don't think that's a good idea yet [22:45:07] https://gerrit.wikimedia.org/r/207297 [22:45:29] how do we figure out if any other keys were using raw booleans? [22:45:42] mwsearch? [22:45:50] there aren’t many Schema: things [22:45:59] you mean mwgrep? wouldn't really help here.... [22:46:00] legoktm: a bunch of other keys do it, basically EventLogging allows any schema to have any key that uses booleans [22:46:06] hmm [22:46:10] so, essentially, the list of fields is dynamic [22:46:10] milimetric: do you have an example? [22:46:21] sure, sec [22:47:03] legoktm: https://meta.wikimedia.org/wiki/Schema:NavigationTiming look for isAnon [22:47:03] legoktm, https://meta.wikimedia.org/wiki/Schema:MultimediaViewerNetworkPerformance [22:47:22] but that's not the point, EventLogging allows anyone to create a new schema with fields of type boolean [22:47:35] umm [22:47:42] both of those only have 'required' as boolean [22:48:03] the schema would mark it as type boolean right? [22:48:05] doh [22:48:12] I'm asking in the JSON of the schema, which of those fields are a boolean. Not in the data the schema represents [22:48:18] right, the events themselves would have raw booleans [22:48:19] sorry [22:48:29] uh.... yeah, i guess right now it's just required [22:48:55] ok, then https://gerrit.wikimedia.org/r/207297 will fix it :) [22:49:03] is there a bug for this? [22:49:13] and is this unbreak now or? [22:49:14] legoktm: what's the timing on that, when could we expect to see good data again? [22:49:24] well, EventLogging is down without some kind of fix [22:49:28] it's UBN [22:49:33] so we need to know if we should do formatversion=2 [22:50:14] as soon as I can find someone to +2 it I can deploy it [22:50:19] can you guys recreate the problem in beta? We can merge, see the fix there and then backport to wmf3 [22:50:47] bd808: checking beta now [22:50:53] mforns: feel free to go to sleep, I'm ok with this [22:51:05] milimetric, ok, just creating a task [22:51:10] cool [22:51:33] legoktm: +2 {{done}} [22:51:43] thanks [22:51:51] * bd808 should have waited for mforns to reproduce [22:52:23] I reproduced and tested it locally [22:53:24] legoktm, bd808, so no phab task needed? [22:53:28] legoktm / bd808: if we deploy this to beta, and events start showing up in the validated log, then the issue is fixed [22:53:40] mforns: we should have a task [22:53:41] right now, no events are getting added to that log [22:53:41] uh, please file a bug for this [22:53:47] sure [22:54:48] legoktm, bd808, milimetric, here: https://phabricator.wikimedia.org/T97487 [22:56:24] added the wrong reponses to the task description [22:56:48] legoktm / bd808: hm, beta seems fine now, events were coming in when I hit some pages [22:57:02] but i'm unclear if this breaking API change was deployed there [22:57:15] milimetric: we merged the core change. it should be deployed there now [22:57:22] ah, ok [22:57:29] I'll verify though [22:58:36] milimetric: confirmed that "6c0ae4d API: Force 'required' key to use bools in formatversion=1" is the HEAD of extensions/EventLogging in beta cluster now [22:58:59] https://meta.wikimedia.org/w/api.php?action=jsonschema&title=MobileWikiAppArticleSuggestions&revid=11448426 looks good again [22:59:15] also... you guys should have some tests in beta for this sort of thing. It would have been broken there for >1 week [22:59:17] \o/ [22:59:45] bd808: odd, we did test in beta a bunch [22:59:58] thanks tons legoktm :) [23:00:04] :) [23:00:09] thanks bd808 & legoktm [23:00:17] thanks guys! [23:00:38] mforns: so i'll undo your change to schema and restart [23:00:54] milimetric, I think the mediawiki code was deployed after our deploy today [23:01:00] ok [23:01:10] yeah, but it was supposed to be in beta for a while, is what I think bd808 said [23:01:18] and you tested there over the last two days right? [23:01:21] milimetric, I see [23:01:33] does beta eventlogging use meta.wikimedia.org for the API? or beta cluster's metawiki? [23:01:43] milimetric, yes, but pointing to the normal API [23:01:54] oh [23:02:06] that's the mismatch then, yea [23:02:44] yeah, we should see if we can make beta tests more thorough [23:02:48] and maybe continuous :) [23:03:08] mforns: events are coming into all-events.log again, all is well [23:03:09] even then it should have been caught on test2wiki [23:03:10] thanks! [23:03:18] milimetric, good! [23:03:20] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK All defined EventLogging jobs are runnning. [23:03:21] legoktm: what's test2wiki? [23:03:31] test2.wikipedia.org ? [23:03:40] if ( $wgDBname === 'test2wiki' ) { [23:03:40] // test2wiki has its own Schema: NS. [23:03:40] $wgEventLoggingDBname = 'test2wiki'; [23:03:40] $wgEventLoggingSchemaApiUri = 'http://test2.wikipedia.org/w/api.php'; [23:03:41] $wgEventLoggingBaseUri = "//{$wmfHostnames['bits']}/dummy.gif"; [23:03:43] $wgEventLoggingFile = "udp://$wmfUdp2logDest/EventLogging-$wgDBname"; [23:03:55] hm, i don't think we test there, i've never heard of it [23:04:15] well it's explicitly configured for you to test there... [23:04:18] if you have a sec to explain how that works or if you have any docs, that'd be great [23:04:49] code will be deployed on test2wiki on wednesdays and then it will hit metawiki on tuesday [23:05:05] hm, odd, why would we ever test on beta then... [23:05:16] beta is master, and test2wiki is still release branches [23:05:33] seems like test2 would let us catch more problems earlier then [23:05:58] like, it's a little late if the problem hits master... [23:06:34] this should have never hit production, it should have been caught in beta [23:07:00] you mean for the API change, right? [23:07:03] yes [23:07:08] on our side we should configure beta to use the beta API though [23:07:17] that would've helped and then maybe we could've helped you find the bug [23:07:22] because we would've seen it last Friday [23:07:31] or just test in test2 [23:09:15] good night everyone, see you tomorrow :] [23:09:26] nite [23:11:09] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [1200.0] [23:45:03] halfak: milimetric's computer just crashed :-( [23:45:18] he is rebooting [23:45:24] kevinator, thanks. Good to know I didn't miss him.