[01:25:06] Analytics-Backlog, Labs, Wikimedia-Apache-configuration, operations, and 3 others: https://wikitech.wikimedia.org/beacon/statsv 404 Not Found - https://phabricator.wikimedia.org/T104359#1600598 (Krenair) Open>Resolved [08:32:50] Analytics, operations, Graphite: Graphite `reqstats.` hierarchy is filled with apparently unused metrics for each of our wiki domains - https://phabricator.wikimedia.org/T111318#1601266 (hashar) NEW [08:45:39] Analytics, operations, Graphite: Graphite `reqstats.` hierarchy is filled with apparently unused metrics for each of our wiki domains - https://phabricator.wikimedia.org/T111318#1601287 (hashar) [08:45:42] Analytics-Kanban, operations, Monitoring, Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1601288 (hashar) [08:46:39] Analytics, operations, Graphite: Graphite `reqstats.` hierarchy is filled with apparently unused metrics for each of our wiki domains - https://phabricator.wikimedia.org/T111318#1601266 (hashar) Per @fgiunchedi , `reqstats.` is being overhauled: {T83580}. [13:23:15] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [13:25:24] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:17:16] holaaaaa [14:18:51] hey nuria! [14:22:16] hey, you're up early [14:22:58] 5am the cat came and woke me up. I think I'm gonna get a dog :) [15:09:33] Hi nuria [15:09:39] holaaa joal [15:09:52] Quick question about the ticket update [15:10:12] joal: yes, please correct as needed [15:10:38] You say the check implies change to pageview definition itself [15:10:47] And I don't get why [15:11:03] joal: ah, poor wording on my part [15:11:09] So before being an idiot in the ticket, I prefer to be one on IRC :) [15:12:27] joal; [15:12:30] corrected [15:13:00] joal: "This implies changes to the code that tags pageviews but not to the pageview definition itself." [15:13:00] hm, I still don't get it :) [15:13:11] batcave for a minute ? [15:13:33] joal:k [15:24:53] Analytics-Cluster, operations, ops-eqiad, Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1602601 (Ottomata) I just installed and puppetized these nodes. Thanks! [15:25:07] Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1602603 (Ottomata) All new Hadoop workers have been racked and installed, woot! [15:31:39] ottomata: Standup :) [15:31:57] ! [16:12:31] Analytics-Backlog, Analytics-Cluster: Audit kernel version on analytics worker nodes - https://phabricator.wikimedia.org/T109834#1602736 (Ottomata) Currently doing: apt-get install linux-headers-3.13.0-62 linux-headers-3.13.0-62-generic linux-image-3.13.0-62-generic linux-image-extra-3.13.0-62-generic... [16:31:50] ottomata: coming to retro? [16:32:01] ottomata: i wasn't sure if you attend now [16:42:20] Analytics-Kanban: Puppetize dashiki dashboard deployments - https://phabricator.wikimedia.org/T110351#1602952 (Milimetric) a:Milimetric [16:43:00] Analytics-Kanban, Reading-Admin, Patch-For-Review, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1602958 (Milimetric) a:Milimetric>Nuria [16:43:13] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [16:45:22] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [16:48:29] Analytics-Backlog: Resurrect logbot so we can leave a breadcrumb trails for incident reports, teammates, etc. - https://phabricator.wikimedia.org/T111393#1603009 (Milimetric) NEW [17:11:12] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1603196 (Nuria) ClientIp is always encrypted and takes no space so I do not think is an issue. [17:13:29] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1603202 (Milimetric) p:Triage>Normal [17:14:37] Analytics-Backlog, Research consulting, Research-and-Data: Workshop to teach analysts, etc about Quarry, Hive, Wikimetrics and EL {flea} - https://phabricator.wikimedia.org/T105544#1603206 (Milimetric) [17:17:53] Analytics-Backlog, Analytics-Cluster: Create Kafka deployment checklist on wikitech - https://phabricator.wikimedia.org/T111408#1603220 (ggellerman) NEW [17:18:33] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [30.0] [17:21:39] Analytics-Backlog, Analytics-Cluster: Spike replacing Camus with Gobblin - https://phabricator.wikimedia.org/T111409#1603236 (ggellerman) NEW [17:22:33] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [17:26:47] (CR) Bmansurov: "Dan, sorry this has been dragging on for too long. I've been pretty busy with other stuff, but tomorrow I'm planning on wrapping this up." [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [17:29:11] Analytics-Backlog, Analytics-EventLogging: Send EventLogging validation logs to Logstash {oryx?} - https://phabricator.wikimedia.org/T111412#1603272 (mforns) NEW [17:39:34] Analytics-Backlog, Analytics-EventLogging: Send EventLogging validation logs to Logstash {oryx?} - https://phabricator.wikimedia.org/T111412#1603378 (Ottomata) With {stag} (and currently but not official yet), all event processing errors go to a special topic in Kafka 'eventlogging_EventError'. EventErro... [17:58:35] ottomata: so the EL dashboard shows 3 clear holes: http://grafana.wikimedia.org/#/dashboard/db/eventlogging [17:58:46] and the logs and db are fine during those times [17:58:49] no loss or anything [17:59:13] so something must be up with how stats are being reported or received, heard anything about it from ops? [18:00:05] heh... so now grafana has errors loading the data altogetehr [18:01:00] milimetric: yes, extra strange is that evnetloging zmq data doesn't have the holes [18:01:29] yeah dunno [18:01:56] milimetric: oooh congrats :D [18:02:18] ottomata: the zmq has the holes, they just show up a little different [18:02:32] for the hole yesterday, only the raw rate shows up, the other two disappear [18:02:49] for the first hole today, at 13:00, all the metrics are not on the graph if you zoom in there [18:02:54] congragz? [18:03:02] oh, metrics meeting, lemme tune in [18:03:12] milimetric: yeah, they just announced you finished 3 years :D [18:04:39] ooh, I had forgotten, yay :) [18:52:26] ottomata: problem with camus earlier ? [18:52:50] looking at the kafka chart, 2 hours ago [19:01:53] yeah joal i saw that too, i looked at camus logs, and everything seems fine [19:02:06] oh, i did do a rolling restart of nodemanagers (2 hours ago though? no.) [19:02:07] hm [19:02:51] weird [19:43:52] madhuvishy: tried every trick but the fact that pykafka has a tests package inside it's what tripping the test runner and doesn't look like there is an easy way to not make the test running go into that directory, i will try a bit more [19:48:19] nuria: whatcha doin? [19:50:56] ottomata: getting tests on EL master to run [19:51:10] ottomata: with ahem a very ..crude workarround [19:51:23] ottomata: cause now they error (nothing wrong with tests) [19:51:45] ottomata: rather the pykafka egg has a test directory that trips the runner [19:52:11] ottomata: makes sense? [19:53:37] nuria: wanna talk now ish? [19:53:48] just had too much good chinese food :) [19:54:04] madhuvishy: let me submit the crude workarround, to see if you can think of anything better [19:54:08] madhuvishy: one sec [19:54:11] nuria: alright [19:56:58] hm yea [20:04:50] Analytics-Backlog, Analytics-EventLogging: eventlogging tests in mainline error - https://phabricator.wikimedia.org/T111438#1603865 (Krenair) [20:06:46] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban: eventlogging tests in mainline error - https://phabricator.wikimedia.org/T111438#1603883 (Nuria) [20:07:15] hey milimetric [20:07:22] You have a minute ? [20:07:27] sure joal [20:07:47] I'd like to sync on when we are on PV API [20:07:51] madhuvishy: ok, change is in, ...very high tech as you see [20:07:55] where we are sorry [20:08:02] madhuvishy: all yours [20:08:08] joal: no movement at all yet [20:08:14] nothing on any of the tickets we opened [20:08:14] nuria: the gerrit patch is not linked on phab [20:08:28] joal: if you wanna try to ping those guys in the european morning tomorrow, I'd appreciate it [20:08:47] Faidon, Alex, or Filipo I guess [20:09:21] I can do that :) [20:09:34] more precisely: I will do that [20:09:49] thx much [20:10:08] madhuvishy: right, i wonder why? let me add it by hand [20:10:25] I'm looking for the master ticket milimetric [20:10:30] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban: eventlogging tests in mainline error - https://phabricator.wikimedia.org/T111438#1603915 (Nuria) https://gerrit.wikimedia.org/r/#/c/235849/ [20:10:32] nuria: ya i see it linked in the commit message - this has happened a couple times now [20:10:37] backlog dashboard ? [20:10:43] looking [20:10:49] Found it [20:10:54] nuria: may be it needs to be the line above change id? [20:10:54] kanban, paused 1st [20:11:08] dont know [20:11:11] madhuvishy: let me correct that [20:11:15] joal: https://phabricator.wikimedia.org/T107056 [20:11:23] right [20:11:24] thanx milimetric [20:11:41] So there is only material provisionning so far, right ? [20:11:46] milimetric: --^ [20:11:50] in terms of blocking tasks? [20:11:54] yup [20:11:56] yes [20:12:11] well, that's kind of all that's left really for puppetization [20:12:13] plus the puppet stuff in the ticket itself [20:12:18] right [20:12:27] and once they approve a certain setup we can know how to change the REST endpoints [20:12:34] nuria: yeah now it appeared [20:12:40] and then that'll tell us how to change the load job, if at all [20:12:55] right [20:13:05] madhuvishy: world is coming to an end !!! i forgot the format of commit messages.... good that qchris is not watching [20:13:10] so now we're just waiting on ops, and have to ping :) [20:13:15] nuria: lol why does it say uploaded by alex monk [20:13:16] I'll ping too if they don't get back to us soon [20:13:28] milimetric: I'll still build an oozie bundle with hive cubes [20:13:37] nuria: :-) [20:13:39] To have a bse ready to be cahnged [20:13:56] joal: agreed [20:14:01] coo [20:14:03] cool [20:14:30] he's always watching nuria, otherwise the world would come to an end [20:14:37] ottomata: quick sync on hardware for PV API please? [20:14:38] madhu: ok, ready if you want to talk about your job or anything else [20:14:49] nuria: yeah, batcave? [20:15:33] ottomata: I'm gonna try to ping Faidon or Alex or Filipo tomorrow european morning [20:16:07] joal: when you ping them, the main thing we need is a review on the puppetization [20:16:10] ottomata: I'd like to confirm how many spare we have for Cassandra/Restbase [20:16:14] the rest we can mostly take care of [20:16:23] madhu: give me 2 mins [20:16:24] like the hardware, I think otto can help us with [20:16:31] sure [20:16:33] ottomata: Faidon agreed on having 3 better than two [20:16:42] right milimetric [20:17:03] milimetric: As you see, I'm trying to grab information ;) [20:17:58] Also milimetric, we'll wait for better knowledge of core restbase integration before correcting the top end point, right ? [20:18:37] joal: hm, I'm a little skeptical. I'm more leaning towards going with mostly what we have now [20:18:52] and then later shifting towards what gabriel was suggesting [20:19:05] so yearly; monthly, daily are only for last year, last month and last day, right [20:19:14] oh, that [20:19:16] oh, global setting [20:19:16] we can talk about that with the team [20:19:26] yes, global domain I was talking about [20:19:28] milimetric: missujnderstanding :) [20:19:31] joal: yes here [20:19:44] but the /top endpoint I don't have strong opinions about. Maybe let's talk at standup tomorrow [20:19:45] So milimetric, fully agreed [20:19:53] Let's start with what we have [20:19:58] Then maybe change [20:20:32] joal: we have 4 older dells left [20:20:33] But we still need the 'top' endpoint corrected [20:20:46] if 3 go to cassandra(?) and 1 as an extra stat box [20:20:49] i think that will be fine [20:20:55] ottomata: awesome :) [20:21:32] ottomata: I prefer to know a bit before talking with the ops ;) [20:21:41] ottomata: thanks ! [20:21:50] ottomata: question - can we change the /etc/spark/conf/log4j.properties to append logs to a file or something [20:21:55] joal too ^ [20:22:06] it's now configured to append to console [20:22:17] which means when we run through oozie, they are lost [20:24:39] madhuvishy: http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs [20:25:37] madhuvishy: You can pass a very ugly parameter to the job for a specific conf file (on hdfs I think, for multi-node purpose) [20:25:42] madhuvishy: sure [20:25:59] joal: yeah, i think we should change it though [20:26:09] madhuvishy: why not :) [20:26:32] milimetric: let's discuss the 'top' end-point tomorrow in standup :) [20:26:38] ottomata: is this stuff in a repo? I can submit a patch if so [20:26:50] or is it like tell ottomata and it will happen :) [20:27:38] joal: sure, I'll try to remember if you don't [20:27:43] :) [20:27:58] * joal is a specialist of forgetting to talk about stuff :) [20:31:32] madhuvishy: its in puppet [20:31:37] in the cdh module [20:31:40] * madhuvishy looks [20:31:45] its just a file there [20:31:52] could be templatized, mabye. [20:32:10] ottomata: it's a submodule? [20:32:18] modules/cdh/files/spark/log4j.properties [20:32:19] yup [20:33:34] madhuvishy: i'm heading out, any other last qs? [20:33:34] :) [20:33:50] ottomata: no.. I'm looking at it, thanks :) [20:35:47] mk laterrssrsrsr [20:40:02] madhuvishy: I have a lead ! [20:40:09] joal: oh? [20:40:46] In hue, the failed workflow tells you an error code: JA018 [20:40:59] Then in oozie doc: JA018 is output directory exists error in workflow map-reduce action [20:41:08] huh [20:41:29] I think my original guess (yesterday) was correct [20:41:34] I explain more in detail [20:41:42] sure. batcave? [20:41:46] sure ! [20:41:49] omw [20:56:23] (CR) Mforns: [C: 1 V: 2] "LGTM!" (2 comments) [analytics/wikihadoop] - https://gerrit.wikimedia.org/r/233937 (https://phabricator.wikimedia.org/T108684) (owner: Joal) [21:02:11] Analytics-Backlog, Analytics-EventLogging: Send EventLogging validation logs to Logstash {oryx?} - https://phabricator.wikimedia.org/T111412#1604211 (mforns) This is awesome! \o/ Probably there's a tool to easily move those to Logstash, Will look for it. Thx [21:11:59] just thought to say hi milimetric, now that I'm around for the rest of the day. ;-) [21:16:15] Analytics-Backlog, Team-Practices-This-Week: Get regular traffic reports on TPG pages - https://phabricator.wikimedia.org/T99815#1604284 (JAufrecht) Since automation is not cheap, let's do one more manual inspection of data and then decide whether or not there's anything in the data worth the investment of... [21:25:19] Analytics-Backlog, Team-Practices-This-Week: Get regular traffic reports on TPG pages - https://phabricator.wikimedia.org/T99815#1604341 (JAufrecht) a:JAufrecht>kevinator [21:31:23] madhuvishy: I'll try to investigate more tomorrow [21:31:40] madhuvishy: Let me know if you find anythin by email :) [21:31:43] joal: no problem, i'll try these things and leave an update for you [21:31:45] :) [21:31:51] Have a good end of day ! [21:32:03] ciao joal, madhu let me know what you find out [21:32:11] Bye a-team, see you tomorrow ! [21:32:22] bye joal see you! [21:35:01] nuria: sure [21:35:13] good night joal! [21:49:08] Analytics-Kanban, Reading-Admin, Patch-For-Review, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1604466 (Nuria) Apps need to tag on X-Analytics header whether a request is a pageview or a preview (r... [21:53:37] milimetric: then, for the mobile stuff i will just go ahead and do code changes? [21:54:13] milimetric: did we agree that if we have preview=1 in the x-analytics header it will not be counted as a pageview? [22:06:14] madhu: about to logoff for a bit, did you learned something? [22:06:55] nuria: nope, i dont see the logs show up with those options [22:07:02] i'll keep trying [22:11:26] madhuvishy: ya, it might work in a simper setup than ours [22:11:34] nuria: hmmm [22:11:48] wonder if the logs go to the actual machine which executes it [22:11:52] not sure [22:12:09] spark.executor.extraJavaOptions and spark.driver.extraJavaOptions [22:30:32] Analytics-Backlog, Analytics-Dashiki, Editing-Analysis, Research-and-Data, VisualEditor: Start generating a visual editor adoption metric - https://phabricator.wikimedia.org/T109158#1604667 (Halfak) I think I need a better notion of what is meant/desired by "adoption" before I can comment effect... [22:40:13] Analytics-Backlog, MediaWiki-API, Research-and-Data: log user agent in api.log - https://phabricator.wikimedia.org/T108618#1604694 (ggellerman) [22:44:51] Analytics-Backlog, Research consulting, Research-and-Data: Workshop to teach analysts, etc about Quarry, Hive, Wikimetrics and EL {flea} - https://phabricator.wikimedia.org/T105544#1604702 (Halfak) I'm in. Let me know when you want to run the event and I'll help out. [23:35:53] Analytics-Backlog, Analytics-Dashiki, Editing-Analysis, VisualEditor, Patch-For-Review: Improve the edit analysis dashboard {lion} - https://phabricator.wikimedia.org/T104261#1604805 (Neil_P._Quinn_WMF) p:High>Normal [23:50:06] Analytics-Backlog, Analytics-Dashiki, Editing-Analysis, Research-and-Data, VisualEditor: Start generating a visual editor adoption metric - https://phabricator.wikimedia.org/T109158#1604878 (Neil_P._Quinn_WMF) p:High>Normal [23:55:21] (PS7) Madhuvishy: Report RESTBase traffic metrics to Graphite [analytics/refinery/source] - https://gerrit.wikimedia.org/r/234453 (https://phabricator.wikimedia.org/T109547)