[00:32:40] Analytics-Cluster, Analytics-Kanban: Report monthly pageviews for the annual report - https://phabricator.wikimedia.org/T95573#1196508 (Nuria) Well, lila mentioned a max of 500 per day (about 15.000 per month). I think the cluster is overkill as we will be using an infrastructure to made to count and proc... [12:10:29] (CR) Krinkle: "recheck" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/199162 (https://phabricator.wikimedia.org/T93690) (owner: Jdlrobson) [13:49:47] Analytics, Ops-Access-Requests, operations: Grant Sati access to geowiki - https://phabricator.wikimedia.org/T95494#1197783 (Ottomata) I sent en email yesterday to Jody asking for confirmation of Sati's NDA, as the instructions in your link say to do. [13:55:57] halfak: sorry! I just realized the instrumentation meeting we have in 5 minutes conflicts with my standup [13:56:10] is it ok if we do it right after standup? Should only be 15 minutes [13:56:25] Woops. Standup isn't on your calendar! [13:56:28] Sure. No problem [13:56:45] I should have known that you had standup then :S [13:57:45] halfak: weird... i'll see to fixing that [14:10:57] Analytics-Kanban, Analytics-Wikimetrics: confirm vagrant setup works for wikimetrics - https://phabricator.wikimedia.org/T95690#1197822 (Nuria) NEW [14:17:14] Analytics-Cluster, Analytics-Kanban: Add automata value in agent_type field of the refined table - https://phabricator.wikimedia.org/T95693#1197885 (JAllemandou) NEW [14:22:26] halfak: ok, done, wanna chat? [14:24:37] milimetric, don't talk to halfak [14:24:44] he just told me olives aren't really food [14:24:47] we are shunning him. [14:24:48] * Ironholds shuns [14:24:53] done [14:24:55] * milimetric shuns [14:25:02] :P [14:25:07] (I love all of you ;p) [14:25:24] :) I also think olives are more ways to torture small children than food [14:25:43] but I don't want to get shunned so I won't say that kind of thing [14:25:43] milimetric, just went into the batcave looking for you. [14:25:47] I suppose we need a new call. [14:25:50] halfak: oh, i'll do it [14:25:57] kk [14:29:05] nuria: Do you mind going for my CR first, like that I can deploy ;) [14:30:53] ottomata, nuria : by the way, I will also add the timestamp value in the CR [14:31:01] I had forgotten about that one ... [14:40:18] joal: looking [14:40:21] Thx [14:40:41] nuria: I'm waiting for my test on hive on timestamp to make the last change [14:46:11] (PS3) Joal: Add ts (unix timestamp), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 [14:46:31] ottomata, nuria: pushed change about timestamp as well [14:47:19] oh COol [14:48:12] hm, joal, camus by default will use 'timestamp'. That is also what rcstream uses [14:48:16] we might want to use that as the field name [14:48:22] i think eventlogging uses that too [14:48:25] kkkkk [14:48:28] no problemo [14:48:42] Since date was dt, I used the same abbreviation :) [14:48:49] joal, is that the proper hive field? [14:48:51] type? [14:48:51] joal: nice, who knew you could do that [14:49:03] huhu, hive doc ;) [14:49:07] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp [14:49:53] I'll try to have the thing as a timestamp type [14:49:57] give me a minute [14:50:09] ja, not sure how it will actually store it, but i think if we can do that it would be better, as we can use hive date functions i think [14:50:15] and also it will probably display nicely [14:55:19] (CR) Ottomata: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [14:56:11] ottomata: We can use timestamp as fieldname, but it is also a reserved keyword for the type ... [14:56:31] oh hm. [14:56:34] RiiiGhhht [14:57:31] (CR) Nuria: [C: 1] Add ts (unix timestamp), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 (owner: Joal) [14:57:47] joal: so much work to do in teh bot front [14:58:13] nuria: ? [14:58:23] joal: not for this patch [14:58:43] nuria: yeah, if we want a more precise analysis, there is some work to do [14:58:44] joal: but for the future, as for example bingbot is not there [14:58:53] It is for sure [14:59:04] ah yah, ay wait maybe i missed it [15:00:23] two different user agents for bingbot: spider, and iPhone ! [15:00:47] spider bingbot is about 3% of our traffic [15:00:54] nuria: --^ [15:01:06] joal: ah ok, bigbot is being detected ok by Ua-parser, right [15:01:12] that is what we wnat [15:01:14] *want [15:01:31] yup [15:01:33] hm, joal, what shoudl we do? [15:01:35] considered as spider [15:01:39] about timestamp name [15:01:44] i guess not use it? [15:01:56] ottomata: I would prefer to for ts, yes [15:02:13] prevent any forced use of `timestamp` to users [15:02:13] hm. ok. [15:02:18] hm. [15:02:24] ok [15:02:51] nuria: btw, i think you are right, HiveContext is not availbale in cloudera jars yet [15:02:56] in spark submit i am getting [15:03:00] Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf [15:03:03] hmMM [15:03:05] wait. hm [15:03:06] that is haodoop [15:03:08] ok hang on [15:03:17] ottomata: but it is on our mvn deps path [15:03:21] yes [15:03:22] it compiles fien [15:04:07] on: jar -tf ./spark-hive_2.10-1.2.0-cdh5.3.1.jar [15:04:13] Sometimes maven deps can trick you :) [15:05:04] joal: ya mvn will get stuff acording to deps no matter whether you use it or not [15:05:13] Analytics-EventLogging, Ops-Access-Requests, operations, Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198063 (RobH) This actually has to have @mark approval, not Toby. (Daniel and I discussed in IRC, this is a task u... [15:05:16] ottomata: and the cloudera vs we have is 5.0? [15:05:33] 5.3.1 [15:05:40] actually. [15:05:53] https://phabricator.wikimedia.org/T93952 [15:06:13] (PS4) Joal: Add ts (unix timestamp), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 [15:06:32] nuria, ottomata : last one, I promise ;) [15:06:38] ts is now a hive timestamp [15:06:40] Analytics-EventLogging, Ops-Access-Requests, operations, Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198068 (mark) Approved. [15:07:51] joal, is that timestamp in seconds or milliseconds? [15:09:19] milliseconds, from hive doc [15:10:04] hm, ok, can you note that in the column comment? [15:10:28] Sure, will do [15:12:06] joal, ottomata nice , that makes things easy, now we have to make sure we have the right locale so as to get times on utc, but i guess that is cluster configuration [15:12:29] times are all utc [15:12:30] :) [15:12:32] no prob [15:12:38] the dts are utc, hive assumes utc [15:12:39] etc. [15:13:29] (CR) Nuria: [C: 1] Add ts (unix timestamp), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 (owner: Joal) [15:17:30] ottomata: shall I go and merge ? Or do you want to review the comment for milliseconds ;) [15:18:28] (CR) Nuria: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [15:18:57] joal: i don't see the millisecond comment :) [15:19:15] (PS5) Joal: Add ts (unix timestamp in milliseconds), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 [15:19:18] Arrrrrrrrriving :) [15:21:38] (CR) Ottomata: [C: 2] Add ts (unix timestamp in milliseconds), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 (owner: Joal) [15:21:40] :) [15:21:47] Analytics-Cluster, Analytics-Kanban: Add better timestamp field to refined webrequest data - https://phabricator.wikimedia.org/T94584#1198117 (JAllemandou) [15:23:24] nuria, coOOOl HiveContext works. [15:23:39] i need to puppetized hive-site.conf symlink in /etc/spark/conf, and then i need to make hive jars be included on spark's default classpath [15:23:40] that's it. [15:23:42] ottomata: doing what? (besides classpath) [15:23:53] doin gthat now... [15:23:56] ah the conf there [15:23:57] but if you want to try [15:24:01] aham [15:24:38] unset cp; for f in /usr/lib/hive/lib/*.jar; do cp="$cp:$f"; done [15:24:49] spark-shell --driver-class-path $cp [15:24:52] nuria, ottomata : I go merge and deploy the new fields [15:24:55] ok cool [15:24:57] do it! [15:24:57] yay! [15:25:24] (CR) Joal: [C: 2 V: 2] Add ts (unix timestamp in milliseconds), access_method, client_type and is_zero fields to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/202914 (owner: Joal) [15:27:23] ottomata: trying [15:28:03] Analytics-EventLogging, Ops-Access-Requests, operations, Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198133 (Andrew) [15:28:50] Analytics-EventLogging, Ops-Access-Requests, operations, Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198137 (Andrew) Open>Resolved Done. Tomasz, re-open this ticket or ping me directly if you don't have access... [15:35:00] Sorry for the merge commit guys :( [15:35:09] Forgot to rebase ... Arfff [15:50:41] joal: np, looks good [15:51:22] nuria: waiting for last 15:00 jobs before deploying [15:51:37] nuria: wit [15:51:52] nuria with new starting date at 16:00 [16:04:07] joal, ottomata: ok so -given that hive context works now- should we go for that for the new job? I will do some tests to compare times for 1 day ok? [16:04:51] ok. i think so, because I think it will make ooziefying easier [16:04:58] now you can select data based on months very easily, ja? [16:05:03] where month=4 [16:05:44] ottomata: if no perf issue, let's go then :) [16:07:33] ottomata: yaya [16:07:35] like: [16:07:48] ottomata: val userSessions = hc.sql("SELECT uri_path, uri_query, content_type, user_agent, x_analytics, dt from webrequest where year=2015 and month=03 and day=10 and hour=01") [16:08:10] yes, and sources too, i guess [16:08:12] so for oozie there is no question is easier cause , just like we do for everything else, we parametize the sql [16:08:14] because you don't need bits or upload, right? [16:08:20] yup [16:08:58] ottomata: yes, this was just a test. i am going to run it for 1 day on mobile data and see if it makes a difference on times to quantify ok? [16:09:15] ottomata: if so, i will get patches "more ready" [16:09:59] k, cool [16:10:49] cool, nuria, all puppetized, you shouldn't need classpath option anymore [16:11:14] ottomata: efficiency to the maxxx [16:11:23] https://gerrit.wikimedia.org/r/#/c/203358/ [16:12:20] ottomata: ok, job with parquet takes for 1 day (with 12 executors) 15 mins, let's see how does hivecontext do, [16:13:23] nuria: eager to know ! [16:24:04] Deploying now ! [16:26:56] joal: do we re-start the cluster when we deploy? [16:27:12] We restart refinery job in ozzie [16:27:45] nuria: FYI, oozie job killed [16:28:23] joal: so in oozie every job runs with its own classpath? [16:29:05] nuria: every job has its own oozie definition at load time, yes [16:30:35] nuria: Code updated, altering table in hive [16:31:01] joal: ok, will wait until you are done before launching new job with hive context [16:31:12] Sounds like a good idea :))) [16:31:15] Sorry for that [16:34:01] nuria: Table updated [16:34:12] joal: and job restarted it? [16:34:13] nuria: restarting oozie job [16:34:17] joal: k [16:34:51] nuria: job restarted, double checking everything is fine for the next runs [16:36:06] joal: k [16:37:19] nuria, joal, is next monday vacation day in wmf? [16:37:38] mforns: I have no idea [16:37:53] mforns: easter was last monday for us in france [16:38:23] mforns: no, easter is not a federal holiday here [16:38:26] joal, it is marked as Thomas Jefferson's birthday in the calendar as a US Holiday [16:38:38] mforns: Ahhhhhhh [16:38:41] mforns: ah wait [16:39:31] mforns: maybe it is?, no clue., some american person should know then [16:39:42] nuria, joal, thanks! [16:39:48] ggellerman____: do you know if monday is a holiday? [16:40:29] nuria: don't think so, but will confirm [16:41:24] nuria: I don't know why that's on Google calendar...not celebrated in the US that I know of [16:47:00] ggellerman____: k, that is what i thought [16:51:17] ggellerman____, thanks! [17:01:36] nuria: you mentioned some analysis scripts? [17:02:03] yes, christian had some parqued code on El extension to do that. [17:02:03] (for el) [17:02:13] milimetric: you can see them at: [17:02:57] nuria@stat1003:~/EventLogging/server/tools$ [17:03:08] thx! (looking) [17:08:57] nuria that's a pretty awesome script, but I think I need to keep writing mine. Because what I'm seeing is that different tables have missing data for different periods of time [17:09:14] milimetric: but are you looking client side vs server side? [17:09:26] milimetric: cause netweork outages affect those two differently [17:09:31] *network [17:09:32] no, i was just looking at the tables themselves [17:09:44] nuria: Something wrong with the cluster I think :( [17:09:51] in the Edit schema, for example, which takes both client and server side events, there is a huge chunk missing [17:10:05] milimetric: right, but tables populated from "server side events" might be fine if we had a network outage between varnish and el machine [17:10:09] milimetric: makes sense? [17:10:26] milimetric: as data gets to machine through 2 completely different paths [17:10:35] yes, that does, I think there are lots of things to check here, I won't try to write some generic script to do it all [17:10:51] milimetric: what tables did you looked at if i may ask? [17:11:05] i don't remember, i was randomly picking some from the show tables list [17:11:13] joal: did not try anything yet, was trying spark shell still [17:11:25] milimetric: ok, cause that maybe it [17:11:41] i'm writing a simple script that will just give me the count per hour for all tables, and run it for a few hours before and after I notice the problem [17:11:46] nuria: just there was no jobs in the interface ... [17:11:53] But some have started again [17:12:25] the cat_db part of his script is actually what i want to do [17:12:36] but per hour instead of per 100 seconds and for all tables instead of specific ones [17:13:12] milimetric: ok, after you can clasiffy tables on client and server side events and see if that makes a difference when it comes to event drop [17:13:27] milimetric: note that many of tables are "dead" . i. e. they receive no events [17:13:58] yes, that's why i'll look before and after, to catch those cases [17:14:44] milimetric: k [17:33:12] (CR) Nuria: "When I tested this (by adding "raise Exception" to run method on ReportNode on report.py)" (2 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/203241 (https://phabricator.wikimedia.org/T88610) (owner: Mforns) [17:33:42] ottomata: yt? [17:37:48] yup hiya [17:38:10] nuria: [17:39:06] ottomata: best syntax i found that seems to work is: [17:39:14] https://www.irccloud.com/pastebin/9EyS7DSF [17:39:23] ok? [17:40:28] why not use from wmf.webrequest [17:40:28] ? [17:40:29] cc joal, job with hive context and 12 executors on cluster: https://yarn.wikimedia.org/cluster/app/application_1424966181866_79314 [17:40:45] ottomata: argh, see good that i ask you [17:40:58] ottomata: cause *cof* *cof* i didi not think about it [17:41:59] mforns: let me know if comments on CR make sense [17:42:21] nuria: aye, that way we can parameterize an argument to the job as table [17:42:22] mforns: we can work on repro the testing together if you want to [17:42:26] table=wmf.webrequest [17:42:27] nuria, I've seen them, will look at them closesly and respond & fix [17:42:32] mforns: k [17:42:34] nuria, thanks for the review! [17:42:47] ottomata: ahem, yes, much better [17:43:11] ottomata: let's see how long does the job take [17:45:59] nuria: cluster a bit under pressure right now, catching on refine webrequests [17:46:17] nuria: Might not be the best moment for your test [17:51:39] ottomata: have we changed something in the parquet conf of the webrequest refined table ? [17:56:56] not I [17:56:57] what's up? [17:56:59] joal: ? [17:57:04] yup [17:57:15] Got null values for new columns [17:57:25] I think I have nailed down the issue [17:57:30] Will let you know [17:57:34] once confirmed [17:58:56] milimetric: yt? [17:59:57] joal: ok, wil re-run later once this one finishes [18:00:20] kevinator: howdy, yes [18:00:25] I just filed: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150409-EventLogging#Thursday_Apr_9_18:49:48_UTC_2015 [18:00:27] batcave? [18:00:32] sure [18:03:00] (CR) Declerambaul: "re the trailing dot, i recommend against it. you can use the :paste command in the repl to avoid the syntax error." (10 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [18:08:39] Analytics-EventLogging, Mobile-Web: MobileWebClickTracking table is huge and thus querying too slow - https://phabricator.wikimedia.org/T76671#1198805 (Jdlrobson) Open>declined we're no longer using this table. We split into multiple tables. [18:09:54] halfak: yt ? [18:10:13] Meeting. I have another one afterward. I'll be around again in 1.5 hours [18:10:20] :( [18:10:26] np [18:10:35] Just to let you know I got that email :) [18:10:40] halfak: --^ [18:11:07] Will check with you next week [18:11:13] The one from altiscale? [18:11:17] yup [18:11:20] Great :) [18:11:29] Have a good weekend o/ [18:11:34] You too ! [18:11:38] (CR) Ottomata: "Awesoome, thanks Fabian!" (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [18:11:55] mforns: ja Fabian just reviwed your code :) [18:11:58] yay! [18:12:18] mforns: Sorry for trailling dots ;) [18:14:03] ottomata, mforns: i will address his comments, let me upload a new patch. I just want to make sure to test it before it's submitted. [18:18:29] cool, yup np [18:18:47] Analytics: Cannot permalink easily to a single graph - https://phabricator.wikimedia.org/T76670#1198938 (Jdlrobson) [18:19:02] ottomata, joal, nuria: I have been reading his comments, they are cool [18:19:58] joal, why are you sorry? I agree with you and Declerambaul that it's better to have them leading the line. [18:20:19] Analytics, Language-Engineering, MediaWiki-extensions-UniversalLanguageSelector, Mobile-Apps, and 4 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#1198952 (Jdlrobson) [18:20:24] Oh, I didn't get that :) [18:20:28] mforns: --^ [18:20:30] joal, I just put them trailing because of the spark-shell [18:20:45] I thought it was personnal ;-P [18:21:17] mforns: anyway, Fabian's comments are cool, I definitely agrgee ! [18:21:18] the spark-shell fails with leading dots, interprets the first line (without dot) as a plain assignment. [18:21:34] joal, totally! [18:21:37] mforns: aye, fabs suggests to use :paste [18:21:45] which is ok., but i agree could be annoying [18:22:00] ottomata, what is :paste? [18:22:05] in the repl [18:22:07] type :pa [18:22:11] (or :paste) [18:22:15] it will let you paste in a block of code [18:22:16] then [18:22:18] ctrl-D [18:22:25] and it will eval that block all at once [18:22:38] Didn't know the trick either [18:22:43] Sounds really usefull \! [18:22:44] me neither! learned it yesteday :) [18:23:16] ottomata, I see! [18:23:30] ottomata: I think I found the error [18:27:03] ottomata: When rerunning jobs after a schema change [18:27:26] If a partition was created before and then overwritten, new values inside are not seen [18:27:47] I have the case here [18:29:46] new values [18:29:49] meaning all the new data? [18:29:53] or the new fields you are adding? [18:29:53] yup [18:30:03] only values added [18:30:10] ? [18:30:26] hm, don't understand [18:30:38] partition existed with parquet data before [18:30:42] you overwrote it [18:30:43] For instance, here ts, and other values newly added by the schema change are null [18:30:47] ah [18:30:49] correct [18:31:04] you sure it got overwritten? [18:31:04] I need to remove partition manually, then recreate it [18:31:17] just drop, add partition? [18:31:18] Yup, (almost) sure [18:31:22] I think so [18:31:25] I'll check [18:31:25] and then the fields have data? [18:31:33] I think so [18:31:37] I need to double check [18:32:12] For the moment I am checking every partition created, just to be sure [18:32:24] And since cluster is a bit loaded, take time [18:39:01] nuria, are you looking at Declerambaul's comments? [18:39:38] mforns: ya, i had already changed some code /added hive context so i will address those [18:40:56] nuria, do you mind if I add comments to combineByKey comment? [18:41:45] mforns: no, this here explains pretty well how does it worK: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html [18:42:55] nuria, yes, I remember having used this when implementing the code [18:43:14] ottomata: confirmed ! [18:43:22] and fixed [18:43:33] Everything back to normal [18:44:28] Will update documentation and send an email to analyitics list [18:45:23] ok [18:48:57] Analytics-Cluster, operations, ops-eqiad: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1199134 (Cmjohnson) spent some time chatting with Dell tech. I did get firmware updates for the R720 that are bootable so I would like attempt to upgrade the bios on a few of the older s... [19:02:28] Analytics-EventLogging, operations, Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199213 (Cmjohnson) [19:02:30] Analytics-EventLogging, operations, ops-eqiad: vanadium failed disk /dev/sda - https://phabricator.wikimedia.org/T94926#1199210 (Cmjohnson) Open>Resolved a:Cmjohnson replaced disk [19:07:05] oh mah goodness, why was that so hard. i haven't gotten spark streaming + avro + schema registry to work [19:07:10] but i did just get a java consumer to use it! [19:07:13] that was actually pretty easy! [19:07:32] KafkaAvroDecoder (from confluent) + point it at schema-registry and give it a topic name [19:12:26] Analytics-EventLogging, operations, Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199253 (Cmjohnson) confirmed ge-4/0/11 is vanadium. I deleted the interface from the switch The disk was replaced. @[[ https://phabricator.wikimedia.org/p/RobH/ | Robh ]] [19:13:04] Analytics-EventLogging, operations, Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199255 (Cmjohnson) @[[ https://phabricator.wikimedia.org/p/RobH/ | RobH ]] did you add to server spares? [19:13:16] (CR) Mforns: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [19:13:52] Analytics-EventLogging, operations, Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199257 (Cmjohnson) confirmed in IRC that no it wasn't done...keeping ticket to complete [19:22:19] (CR) Nuria: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [19:39:30] (CR) Nuria: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [19:45:05] (CR) Mforns: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [19:51:05] milimetric, yt? [19:52:42] mforns: hey [19:52:49] hey! [19:53:54] milimetric, do you know of public reports that were added 2014-12-17 to wikimetrics for NamespaceEdits, NewlyRegistered, RAE, RNAE, RSNAE, across all wikis? [19:54:19] I found, all those public recurrent reports, created at this date [19:54:28] I just built the stupidest thing [19:54:54] milimetric, the public reports folder is getting big, 2.4 GB for now [19:55:10] yeah, those should be the Vital Signs reports [19:55:19] they should be created by user WikimetricsBot [19:55:26] milimetric, exactly [19:56:03] ok, I feel better, I thought it had something to do with: https://gerrit.wikimedia.org/r/#/c/180071/5/wikimetrics/api/centralauth.py [19:56:14] milimetric, this was merged one day before [19:56:52] milimetric, so yes, the problem with wikimetrics is that tar-ing of the public reports folder is taking too long [19:57:15] mforns: that makes sense, hm... how to fix :) [19:57:17] milimetric, it contains 5000+ folders and 600000+ files [19:57:21] yep :) [19:57:25] boy oh boy [19:57:39] we could compact the individual files [19:57:45] that would reduce the size dramatically [19:57:51] oh duh, we should totally be doing that [19:58:08] milimetric, what do you mean by compacting the individual files? [19:58:19] instead of storing each day separately, we should store "compacted_1", "compacted_2", etc. every few months when the data gets big [19:58:28] milimetric, I see [19:58:34] you know how in each public report folder we store each day separately and then the full_report.json [19:58:42] yes [19:58:48] understand [19:58:58] cool - yeah, the backup will just break until we do that [19:59:07] yes [19:59:27] milimetric, I'll file a task [20:00:35] thanks for looking into it mforns [20:02:08] hey milimetric, batcave w me about all this crap i'm working on? [20:02:46] ottomata: ok, but just finishing up with nuria [20:07:44] ottomata, joal : also loads of parsing errors in spark now, will run jobs later when there are perhaps less tasks scheduled [20:12:04] I don't get the reason for parsing errors though nuria [20:13:21] joal: maybe "compressing errors" is a better description [20:13:24] https://www.irccloud.com/pastebin/S0tYvnVp [20:13:42] joal: but i hadn't seen those up to today [20:15:23] weird, never seen that yet nuria [20:16:33] joal: maybe the hive conf is missing something that tells it that stuff is compressed on a certain way [20:16:49] hmmm [20:17:15] does the error occur on parquet file or on shuffle data ? [20:18:10] joal: https://yarn.wikimedia.org/proxy/application_1424966181866_79389/stages/stage?id=1&attempt=142 [20:19:59] home: I guess it happens when a task get remove because of preemption [20:20:07] But I can't be sure [20:23:37] joal: can you kill that job? [20:23:46] Sure, easy :) [20:23:54] want me to ? [20:24:51] Actually it's gone already [20:24:55] nuria --^ [20:24:56] Analytics-Kanban, Analytics-Wikimetrics: Compact Wikimetrics' old report files - https://phabricator.wikimedia.org/T95756#1199587 (mforns) NEW [20:25:07] joal: ok [20:26:02] (CR) Declerambaul: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [20:36:57] * joal is back ! [20:37:04] what a lag [20:43:04] joal: you know... could it be that teh addition of the new fields increase the memory? [20:43:26] hm [20:43:58] joal: we can look at this on monday though, no rush, it's super late for ya [20:44:09] nuria: We probably don't know how spark-hive behaves in case of schema change ... [20:44:31] It's planned: I'll go to bed in a minute or so ;) [20:45:56] nuria: what's awesome about ngraph (I took a very quick look) [20:46:13] yessss [20:46:19] hm, nuria, weird [20:46:22] can I reproduce? [20:47:30] ottomata: i just tried to run the code, let me send to gerrit, give me 5 mins [21:05:19] (PS6) Nuria: [WIP] Add Apps session metrics job [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [21:05:23] Deskana: do you have few minutes? Are you still on the 5th? [21:05:23] leila: I'm downstairs. I'm coming up in 5 minutes or so. [21:05:24] no rush. let's chat when you're here Deskana. [21:05:24] (CR) Nuria: [WIP] Add Apps session metrics job (6 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [21:05:44] ottomata: ok, so the memory errors can be reproed loading 1 hour of data [21:06:09] okay [21:06:15] who is Ashwin and what the hell is he or she doing? [21:06:25] jaja [21:06:43] cause Ironholds is going to break his neck ninja-style [21:06:54] no, I'm going to go "LOOK AT TOP. EXPLAIN." [21:07:08] I'm looking at 10 simultaneous python sessions, each on a different core [21:08:08] Ironholds: that name sounds familiar from a research session [21:08:08] yeah. DarTar, leila ? [21:08:08] ottomata: using hive context [21:08:08] nuria: i just did [21:08:08] who is Ashwin and why are they running 10 simultaneous python sessions on stat1002, having not sent out an announcement prior? [21:08:11] ottomata: and later: [21:08:17] https://www.irccloud.com/pastebin/MdcgJAWC [21:08:20] val hc = new org.apache.spark.sql.hive.HiveContext(sc) [21:08:20] and later: [21:08:20] Hi Ironholds. what's up? [21:08:20] val data = hc.sql("SELECT uri_path, uri_query, content_type, user_agent, x_analytics, dt from wmf.webrequest where webrequest_source='misc' and year=2015 and month=03 and day=20 and hour=0") [21:08:20] data.map(_(0)).take(10) [21:08:50] Ironholds: Ashwin is collaborating with Bob and Leila under an NDA [21:08:54] okay [21:08:59] ottomata: and later: userSessions.take(50) [21:09:12] they need to stop whatever they're doing, send out an announcement checking anyone else needs that machine and, when they don't hear anything back, restart it [21:09:16] ottomata: or yeah, that might work too [21:09:19] ssh into stat2 and look at top. [21:09:37] leila, DarTar , ^ [21:09:40] Ironholds: ahhhh that might be part of our problems too [21:09:40] no poblems there nuria [21:09:43] with what I did [21:09:55] and this is why we send out advance warning >.> [21:10:02] ottomata: did you try the longer snippet? [21:10:03] we should really have it in a guide somewhere [21:10:08] I’ll ask leila to chime in [21:10:16] my plan of "give the evil eye to anyone I see doing it" only works if we stop hiring people [21:10:24] otherwise I have to deathglare over and over [21:10:26] Ya DarTar , i am also using stat1002 and it is over 100% cPU [21:10:31] and we should have ashwin on IRC ideally [21:10:49] ideally. sub-ideally, we should kill whatever...that. is. [21:11:10] Ironholds: what's the problem with that? [21:11:21] nuria, it looks like the oldest process is 19037 - assuming a parallelised fork, killing that should reduce the rest [21:11:25] well, either there’s a usage policy that is agreed upon and documented or we cannot blame people for not adhering to it [21:11:49] DarTar: now that we're at it, there is a card in Trello for you to add Ashwin to the research-internal list. [21:11:51] ;-) [21:11:54] I remember the days when ops would come to me saying OMG do you realize that you’re in breach of policy X Y and Z ;) [21:12:03] leila, running over 10 cores without notifying anyone or sending out an email in advance to check if anyone needed to use this machine for anything, with no documentation of what is running, in a way that is potentially disrupting AnEng's work? Quite a lot. [21:12:26] like, I don't know what's running, how long it's going to take, how to turn it off if it breaks anything... does anyone else except ashwin know how long it'll run for or what it's doing? [21:12:30] Ironholds: what process is this interfering with right now? [21:12:36] DarTar: we need a policy but also checking htop once in a while doesn't hurt [21:12:49] nuria: agree, and we can document this [21:12:50] nuria suspects it's responsible for problems that AnENg is working on and it's an active blocker for me even touching the machine [21:13:00] ottomata: did you try the longer snippet [21:13:15] oof nuria, it is hard to do so [21:13:16] because anything complex and processor-intensive will get killed by the system [21:13:17] in spark-shell [21:13:20] Rgggn [21:13:21] i ahve to go soon [21:14:39] i wish it was faster to iterate on this, it is hard to build a jar and submit... [21:14:39] leila: looking at the mailing list request, I think we did this a while ago [21:14:39] nuria: memory errors? or snappy compression problems? [21:14:40] I don't think so, DarTar, or at least I'm not aware of it. [21:14:40] ottomata: both [21:14:40] i only saw this one you linked to org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) [21:14:40] okay, well I have to go get fancied up to head over to Harvard [21:14:40] ottomata: we got snappy exceptions like: [21:14:40] https://www.irccloud.com/pastebin/2jKo3Oqe [21:14:40] nuria: and if you run the exact same thing using SQLContext instead of HiveContext? [21:14:40] so, can somebody please (a) get Ashwin to shut this down and (b) get Ashwin to document what it is, what it does, how long it's expected to run for, and do so in advance next time? [21:14:44] I agree we need a formal policy. [21:14:48] ja, nuria, saw those, that is weird [21:15:02] but this isn't exactly, like, complex and obscure insider knowledge. It's how to treat a shared resource 101. [21:15:17] ottomata: i can try that, but do not worry if you have to leave do so, hopefully your friend doesn't cry when he sees my code [21:15:54] ottomata: but spark shell is also going to run into issues if 1002 is over 100% CPU right? [21:16:04] ottomata: doesn't seem it would work [21:16:05] good night everyone! have a nice weekend :] [21:16:09] if I don't use yarn mode, ja [21:16:20] leila: Hey, I'm coming back up now. [21:16:22] Ironholds: both item (a) and (b) need discussions, except for parts of (b) [21:16:23] leila: Sorry for the delay. [21:16:32] leila, what do you mean? [21:16:39] ottomata: ya i need client mode to see exceptions, so yes leila DarTar we need to lower the cpu usage of those scripts [21:17:05] unless they are about to finish! [21:17:16] in which case i take a my lunch break [21:17:21] nuria, can you do --master yarn --deploy-mode client and see exceptions? [21:17:23] Deskana: no rush. [21:17:56] ottomata: lemme try [21:18:06] ottomata: on spark shell right? [21:18:12] ottomata: not spark submit [21:18:46] alright, I give up. I'll send out an email and CC Ashwin in [21:18:48] Ironholds: Email before starting a heavy job is good [21:18:53] agreed [21:19:01] documenting how long it's expected to take, also good? [21:19:17] whether Ashwin stops the current job, or does something else, I need to figure out what the problem is now, since stopping the job can delay our work. [21:19:18] what the job is doing is probably a good idea for someone to know so that we don't have a cluster locked for folding@home ;p [21:19:24] nuria: yes? dunno [21:19:25] ok i gotta go [21:19:27] good luck nuria! [21:19:30] those should all go in the email, Ironholds. how long it's expected to take [21:19:31] ottomata: ciao [21:19:38] so to put it another way: Ashwin did not notify anyone or ask permission or check [21:19:44] because of this, the job is inconveniencing people massively [21:19:58] however, stopping the job means inconveniencing Ashwin, so instead, let's massively inconvenience everyone else until whenever it's done? [21:20:00] Ironholds: Bob and I are aware that Ashwin is running jobs [21:20:07] the scale is different from one day to the next [21:20:27] leila: so the machine is now trying to fit 1) eric z scripts 2) ashwin 3) cluster work [21:20:33] yes, and today the scale is "nobody else can run things on the machine" [21:20:36] leila: all those 3 do not fit [21:20:49] cluster work and existing stats have to take priority over research that can be re-run [21:21:01] nuria: can we jump in the batcave? [21:21:02] cluster work and existing stats can't necessarily be rerun (Or at least it can be a real pain to restart those) [21:21:34] leila: yes, one easy thing would be using nice, to let the sytem prioritize [21:21:38] Ironholds: you have a point and that is an email should have gone out about it. [21:21:41] leila: going into batcave [21:21:49] beyond that, I'll talk to nuria to resolve the current issue [21:22:00] okies. Let me know what the result is. [21:23:00] thanks for bringing it up, Ironholds, and it would be better that for the future you choose a milder language, because you know, at least Ops know who is on the cluster, and this person probably is a volunteer and it's not correct to attack them like this. [21:24:02] leila: in batcave, or wait ... [21:24:03] what did I say that was improper language? [21:24:16] or an attack? [21:37:22] Analytics-Kanban, VisualEditor: Schema:Edit seems to incorrectly set users as anonymous {lion} - https://phabricator.wikimedia.org/T92596#1199862 (Halfak) I just checked the rates at which we see weirdness. It looks like 0.7% of edits saved by registered editors have saveSuccess events associated with us... [21:49:27] halfak: are you around? [21:49:44] Yup. I got 10 minutes :) [21:49:50] great. thanks, halfak. :-) [21:50:02] so you know that Ashwin is using bunch of cores in stat1002 [21:50:17] he says that he has already talked to you and you have told him about how he can reduce the priority [21:50:25] and that in that case, he should be OK [21:50:32] Yeah. [21:50:36] I'm looking at his jobs, those are PRI 38, 39 [21:50:52] I don't know why Ironholds can't start his job if Ashwin's has very little priority [21:51:09] Is Ironholds trying to start a job on hadoop? [21:51:15] Because that is unrelated to stat1002 [21:51:26] not sure, halfak [21:51:41] nuria: do you know if Ironholds wanted to start the job on Hadoop? (I think he left already) [21:51:41] * halfak sees "bzcat" and "python" and knows that the XML utilities are being used :} [21:52:04] leila: no, i do not know [21:52:30] because halfak is right, if that was a Hadoop job, Ashwin's job shouldn't interfere with that nuria. [21:53:06] so halfak, do you know how NICE works if Ironholds wanted to start his job outside of the cluster? [21:53:25] based on the conversation with you, Ashwin had assumed setting priority to the lowest suffices. [21:53:30] leila: ya , that sounds right [21:53:43] leila, +1 [21:53:50] I think that ashwins work is causing no one harm. [21:53:56] mmm [21:54:01] halfak: even when cpu is >100% [21:54:02] okay. nuria, what should we do? [21:54:04] mmmm [21:54:22] nuria, yes. If you start something with higher priority, Ashwin's processes will slow/stop [21:55:04] humm. okay. this makes sense halfak. but it's good to hear that you confirm this. [21:55:13] :) [21:55:16] happy to help [21:55:28] so, maybe Ironholds didn't set higher priority for his job and that's the only thing he had to do nuria? [21:55:49] leila: well, he shouldn't have to set it [21:56:03] I think the default priority is 20 [21:56:15] which is higher priority than what Ashwin has set, 38, 39 [21:56:37] so as long as Ironholds submits his job, according to what halfak says it should just work. [21:56:53] you know, i do not not know whta is teh default but since eric z jobs are at 20 that sounds right [21:57:08] Default == 20 [21:57:09] let me test that halfak is correct [21:57:15] :) [21:57:22] testing halfak in action. ;-) [21:57:35] * halfak waits for the results to gloat [21:57:40] :D [21:58:39] Hi guys, I'm the one using a lot of cores [21:58:49] ashwinpp: welcome. :-) [21:58:55] hahaja , i was taelling leila we have changed so many things today that any one of them (for what i was doing) might be causing issues [21:58:57] o/ ashwinpp. [21:59:10] holaaa [21:59:17] * i was telling [21:59:27] hey halfak, your library works awesome btw ;) [21:59:32] Woot! [21:59:40] ashwinpp, nuria is testing what halfak had told you earlier, i.e., if you job has the lowest priority, you should be fine. [21:59:42] I'm very glad that it is making your work easier ashwinpp [21:59:44] :) [21:59:46] also i do not know how many cores does 1002 have, halfak do you know? [21:59:56] nuria, I don't [22:00:15] I can check [22:00:21] 16 [22:00:35] 16 cores according to python's multiprocessing library [22:02:04] ashwinpp: just hang around for some time until Nuria finishes her test, please. :-) [22:02:56] sure [22:04:11] now that we are waiting, ashwinpp, do you have a bio of yourself handy? [22:04:24] I want to send it to the internal list to introduce you. [22:04:44] halfak: I know that my 10 minutes are over. If you have to go, please go, and I'll update you via email. [22:05:21] halfak: now cpu with my stuff is reported to be at 500%.. cause that makes a lot of sense, meybe it is adding core's usage [22:05:33] *maybe [22:05:47] leila: sure, should I mail you? [22:05:57] sure. thanks! [22:06:19] nuria: does it let you add your job though? [22:06:42] ow yeah, I see it with PR 20 nuria [22:07:35] leila: ya, but then the 100% we saw before is nothing...man, i do not know, otto IS the man. we will ask him on monday [22:08:09] thanks for checking nuria. [22:08:18] so, do you see a need ashwinpp kill jobs nuria? [22:08:46] If you don't, I'll send an email to the internal list explaining what happened and where we left off, or you're more than welcome to do so. [22:10:39] leila: well, right now i have a ton of memory exceptions, let me see if i get those with code that i knew yesterday it was running fine [22:10:53] sounds good nuria [22:11:20] leila: but ...doesn't seem that related no.. let me verify [22:11:41] no, it doesn't, but it can't hurt that you check before people sign off for the day. :-) [22:20:03] leila: no, ooms again, but who knows where this come from [22:23:09] so nuria, did your job go through? [22:23:21] ashwinpp: how long do you expect the current jobs on the cores run? [22:26:47] actually a long time [22:27:35] like days, ashwinpp? [22:27:39] from previous estimates approximately 48 hours [22:27:52] and when did you start them? [22:27:58] but that was also when I had put my jobs at the lowest priority [22:28:11] got it. [22:28:36] 6 hours back [22:28:44] so how about this? don't kill any of them, since we need results, I send an email to the internal list explaining the details, and we let them kill as needed? [22:29:08] are there processes that can be killed more easily? or will you be online over the weekend that people can reach you if a process needs to be killed? [22:29:17] ashwinpp, ^ [22:29:40] sure [22:29:54] But I would request that all jobs be killed or none [22:30:37] I see. Okay. will you be around over the weekend? [22:30:59] because then I would have incomplete data not knowing which jobs were killed and when, and there is no way to resume the job because I need to go sequentially over the entire dump again. [22:31:13] Yes, I can be around on IRC [22:31:43] okay. great. I'll send an email to set expectations. [22:31:49] nuria: we decided not to kill the jobs [22:32:09] there is no half-way killing, we should either kill all or none, nuria. I'll send an email to the internal list to explain this. [22:32:31] ashwinpp will be around over the weekend in case something pops up, nuria. I'll be available by phone/email, too. [22:33:01] ashwinpp: I'll take care of the internal email now. don't worry about it. [22:33:13] leila: why don't you communicate my phone as well as email address [22:33:28] sounds good, ashwinpp. that's better. [22:33:30] that way you need not be bothered [22:34:28] *thumbs up* [22:34:37] thanks for being so responsive, ashwinpp, and thank you leila for leading on this :) [22:34:51] np, Ironholds. sending the email to the internal list now. [22:38:40] *thumbs up* [23:19:04] Analytics-EventLogging: Send graphite metrics for Schema. as well - https://phabricator.wikimedia.org/T95780#1200088 (yuvipanda) NEW [23:25:24] Analytics-EventLogging, operations, Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200104 (yuvipanda) NEW [23:42:40] Analytics-EventLogging: Send graphite metrics for Schema. as well - https://phabricator.wikimedia.org/T95780#1200168 (yuvipanda) The general problem to be solved here is 'alert based on arbitrary EL criteria', I think. [23:55:31] Analytics-EventLogging: Send graphite metrics for Schema. as well - https://phabricator.wikimedia.org/T95780#1200229 (Nuria) >This would also allow easy alerts for very specific cases that might also interest non-ops/techie folks. >Use case in point is that @Deskana and @bearND would like to g... [23:58:01] Analytics-EventLogging, operations, Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200234 (Nuria) This was caused by the migration and we corrected the problem already. Please see: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-Event... [23:58:37] Analytics-EventLogging, operations, Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200245 (Nuria) Closing ticket, let me know if something needs to happen here additionally. [23:58:45] Analytics-EventLogging, operations, Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200246 (Nuria) Open>Resolved