[00:08:12] Analytics-Kanban, Patch-For-Review, WMF-deploy-2015-10-06_(1.27.0-wmf.2): Bug: client IP is being hashed differently by the different parallel processors {stag} [8 pts] - https://phabricator.wikimedia.org/T112688#1687831 (Ottomata) Phew, ok, I got this deployed, but I had to do a little hackery. Our pa... [00:10:01] Analytics-Backlog, Analytics-EventLogging: Upgrade eventlogging servers to Jessie - https://phabricator.wikimedia.org/T114199#1687834 (Ottomata) NEW [01:08:57] Analytics-Kanban, Patch-For-Review, WMF-deploy-2015-10-06_(1.27.0-wmf.2): Bug: client IP is being hashed differently by the different parallel processors {stag} [8 pts] - https://phabricator.wikimedia.org/T112688#1688102 (Jdforrester-WMF) [01:41:20] Analytics, Analytics-Cluster, Fundraising Tech Backlog, Fundraising-Backlog, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1688147 (Jgreen) >>! In T97676#1687605, @awight wrote: > Furthermore, when we do make the change, the `count` c... [02:33:16] Analytics-Backlog, Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1688177 (Bawolff) Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly) [03:22:24] Quarry: 'New query' highlighted when looking at existing queries - https://phabricator.wikimedia.org/T106411#1688191 (Ricordisamoa) Because of https://github.com/wikimedia/analytics-quarry-web/blob/2f23db6fa4c2891cefdb40fd2f13e00b6514ba9a/quarry/web/templates/query/view.html#L1 [04:32:50] Analytics-Backlog, Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1688272 (Bawolff) I mean, like http://quarry.wmflabs.org/query/5417 [04:38:46] Analytics-Backlog, Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1688278 (Ricordisamoa) >>! In T95582#1688177, @Bawolff wrote: > Can't you already write db queries of this form using UNION and foreign table references? (... [05:03:52] Analytics-Backlog, Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1688289 (Bawolff) >>! In T95582#1688278, @Ricordisamoa wrote: >>>! In T95582#1688177, @Bawolff wrote: >> Can't you already write db queries of this form us... [09:49:01] Analytics-Tech-community-metrics, Possible-Tech-Projects: Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T89135#1688646 (Anmolkalia) @jgbrah, thank you for the guidance. I am on it. Thanks. [11:50:43] Analytics-Tech-community-metrics: Remove deprecated repositories from korma.wmflabs.org code review metrics - https://phabricator.wikimedia.org/T101777#1688869 (Aklapper) [12:09:30] (PS3) Addshore: Split metrics up [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 [12:34:06] (PS4) Addshore: Split metrics up [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 [12:41:41] (PS1) Addshore: chmod +x all .sh files [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242529 [12:50:19] headsup, stat1001-stat1003 will be rebooted starting in about 10 minutes [13:10:22] ok, thx moritzm [13:13:56] all three rebooted into fresh kernels, you can use them again [13:18:31] great [13:20:07] :) [13:20:11] stat1002 looks so empty ;) [13:20:31] :) [13:24:28] (CR) Aude: Split metrics up (1 comment) [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 (owner: Addshore) [13:26:02] (CR) Addshore: Split metrics up (1 comment) [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 (owner: Addshore) [13:27:00] (CR) Aude: Split metrics up (1 comment) [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 (owner: Addshore) [13:35:47] Analytics-Backlog: Standardize Hive UDF code comments and generate documentation {flea} - https://phabricator.wikimedia.org/T114238#1689125 (Milimetric) NEW [13:52:03] (PS1) Addshore: Add READMEs for all metrics [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242546 [13:53:33] (CR) Addshore: [C: 2 V: 2] Add READMEs for all metrics [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242546 (owner: Addshore) [13:53:43] (CR) Addshore: [C: 2 V: 2] chmod +x all .sh files [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242529 (owner: Addshore) [13:54:03] (CR) Addshore: [C: 2 V: 2] Move create SQL from comments to own files [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242100 (owner: Addshore) [13:54:16] (CR) Addshore: [C: 2 V: 2] Classify wikidata_social.php [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242097 (owner: Addshore) [13:54:30] (CR) Addshore: [C: 2 V: 2] Move scripts to a src dir [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242095 (owner: Addshore) [13:54:50] (CR) Addshore: [C: 2 V: 2] Also copy tsv files to aggregate-datasets [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240725 (owner: Addshore) [13:55:28] (CR) Addshore: [C: 2 V: 2] Add sql to tsv script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240724 (owner: Addshore) [13:55:39] (CR) Addshore: [C: 2 V: 2] Add social stats tracking script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240710 (owner: Addshore) [13:56:32] (CR) Addshore: Add social stats tracking script (2 comments) [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240710 (owner: Addshore) [13:56:46] (CR) Addshore: [C: 2 V: 2] Add getclaims property use tracking script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240653 (owner: Addshore) [13:56:56] (CR) Addshore: [C: 2 V: 2] Script for tracking site_stats over time [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/240652 (owner: Addshore) [13:56:59] #spam [13:57:40] *twiddles thumbs* ahh, jenkins isnt going to do the merging.... [14:00:50] (CR) Addshore: [C: 2 V: 2] Split metrics up [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/242199 (owner: Addshore) [14:15:34] holaaa [14:15:46] ey nuria [14:23:53] Analytics-Tech-community-metrics, Possible-Tech-Projects: Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T89135#1689285 (Qgil) [14:25:01] Analytics-Tech-community-metrics, Possible-Tech-Projects: Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T89135#1027974 (Qgil) Thank you @jgbarah! Please update the description and propose some microtasks. When this is done, we will promote this project idea to #Outreachy-Round-11 candid... [14:29:43] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1689330 (mforns) I'm with Nuria in that we have to evaluate whether the complexity in the... [14:35:02] Analytics-Kanban: Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] - https://phabricator.wikimedia.org/T108925#1689357 (JAllemandou) Hey Tilman, About bots no backfill was done. It could have been possible to do it from July 17th (2 month of refined webrequest are kept), but we t... [15:21:02] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1689560 (BBlack) Logging events and stats without having significant complexity or perf is... [15:35:55] Analytics-Kanban, Patch-For-Review, WMF-deploy-2015-10-06_(1.27.0-wmf.2): Bug: client IP is being hashed differently by the different parallel processors {stag} [13 pts] - https://phabricator.wikimedia.org/T112688#1689630 (ggellerman) [15:38:52] Analytics-Kanban: Introduction to Hive class {flea} [13pts] - https://phabricator.wikimedia.org/T113545#1689645 (ggellerman) [15:39:06] Analytics-Kanban: Introduction to Hive class {flea} [13 pts] - https://phabricator.wikimedia.org/T113545#1689650 (Milimetric) [15:50:35] Analytics, operations, Patch-For-Review: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1689684 (Addshore) Open>Resolved a:Addshore [15:55:32] joal, sorry for the late notice. [15:55:41] Would you be able to join us for a meeting re. altiscale cluster in 5 min [15:55:42] ? [15:56:05] halfak: I have a event-bus meeting in 5 :( [15:56:17] No worries! I'll have some stuff to talk to you about though. [15:56:22] It turns out that Nitin, our engineer, got an injury that will prevent him from doing substantial work. [15:56:43] So I was hoping to get a hand from you loading data into our "research cluster" hive instance. [15:57:02] halfak: Woould love to :) [15:57:20] :) Cool. ttyl [15:57:27] Will ask the team to ensure my resource is not bottlenexk anywhere, but I'd gladly help :) [15:57:30] later ! [17:11:28] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1690049 (Nuria) There are many good points on @BBlack reply. We could give a little more... [17:33:23] hello nuria. non-technical comment: can you release R61 from the event we have at 11? madhuvishy added R31 but she doesn't have enough permission to remove R61. :D [17:33:38] leila: :) thanks [17:34:15] np, madhuvishy. as the person who complains about rooms not being available for meetings :D, I take responsibility for releasing this one. :D [17:39:44] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 processor/client-side-03 processor/client-side-02 processor/client-side-01 [17:40:49] milimetric: i just turned on the 12 processors [17:40:50] lets check it out. [17:41:06] cool, i'm at scrum of scrums [17:41:11] so first i'll tell everyone we did that [17:41:14] and then we'll check :) [17:41:17] :D [17:41:23] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [17:41:34] k [17:44:04] milimetric: Can i get a link to the hashing salt patch, whenever you get a chance? [17:44:29] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1690191 (mforns) In the particular case of the CentralNoticeBannerHistory schema, I see th... [17:44:32] csteipp: for eventlogging? [17:45:34] leila: yes, let me get to it [17:45:51] leila: done [17:46:06] thanks nuria [17:47:13] csteipp: https://gerrit.wikimedia.org/r/#/c/238854/ [17:47:46] csteipp: i think that's the one, which puts a shared salt on /etcd [17:47:55] nuria: Cool, thanks [17:48:35] csteipp: is the same hash that was used before (os.urandom), just stored in etcd instead of in memory [17:48:46] so it can be shared by multiple processes [17:50:07] ottomata: back to libjars, will report accordingly, cc madhuvishy [17:50:25] k [17:50:29] GOOD LUCK [17:50:30] :) [17:53:36] nuria: no hangout in the bot mneeting :) [17:53:53] joal:ains.. [17:54:25] leila: alas, i'm working from home and cannot use both rooms :P [17:54:34] mm, same here. [17:54:42] then maybe nuria should cancel the other room, too? [17:54:44] a-team: IP hash frequencies look good! what's cool about this now too is that the hashed IPs stay the same even during eventlogging restarts, whereas before they'd reset whenever that happened [17:54:45] I guess no one is in the office [17:54:46] :-( [17:54:57] i am! [17:54:58] leila: i cancel 6th floor [17:55:01] oh for meeting [17:55:03] nawww [17:55:06] nm [17:55:08] ottomata: :) [17:55:11] leila: i should cancel 3rd floor too? [17:55:46] madhuvishy: do you know of a EL schema that is inactive? [17:55:55] Analytics [17:56:21] hm ok, cool [17:56:22] perfect. [17:56:22] https://meta.wikimedia.org/wiki/Schema:Analytics [17:56:32] ottomata: data in db looks good [17:56:43] I'll keep an eye on the dashboards just to be sure [17:57:10] madhuvishy: ok if I blacklist this schema from eventlogging-valid-mixed? it will then not go into MySQL [17:57:16] i guess I could blacklist temporarily [17:57:28] i'm going to blast data into the system with this schema [17:57:48] ottomata: it's just a test schema so you can do whatever, temporary or not [17:57:52] ottomata: no one uses this [17:57:54] yeah [17:58:08] ok [17:58:25] well ok, i will blacklist and leave it so, but just remember if you test with it in the future that it wont' go into mysql [17:59:16] ottomata: that's fine - we added it to the list of tables to delete permanently - so hopefully no one will use it [17:59:39] I'll add this to the talk page, just in case someone gets super confused :) [18:00:07] k [18:00:08] thanks [18:00:12] milimetric: it says obsolete etc [18:00:32] right, but just in case someone else wants to use it to test in the same way we are [18:00:39] and then gets confused when events don't go into sql [18:00:55] milimetric: right, okay [18:01:24] feeding the archives / leaving breadcrumbs is something I try to be better at since I worked with Christian :) [18:01:31] I've gotta run out for a bit, brb [18:33:09] mforns_gym: simple chnage for you to look at whenever [18:33:09] https://gerrit.wikimedia.org/r/#/c/242634/ [18:33:10] no hurry [18:48:29] Analytics-Kanban: Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] - https://phabricator.wikimedia.org/T108925#1690491 (JAllemandou) Tilman: I confirm the discrepancy we observe comes from a difference in the computation of the pageview boolean, and not the sampling (see below).... [18:58:05] Hi a-team ! [18:58:13] hey joal [18:58:14] Need to run, will see you tomorrow ! [18:58:23] good night :) [18:58:30] Thanks for very interesting meeting nuria and madhuvishy :) [18:58:47] nuria: Please let me know if you find the libjar siolution [18:59:35] nuria: an idea could be that package com.linkedin.camus is already existing in a jar, and therefore is not looked after in another for the example subpackage ... But it's a wild guess [19:00:44] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [19:02:34] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [19:03:54] PROBLEM - Throughput of event logging events on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [600.0] [19:05:53] Analytics-Cluster: Trouble with access to hue - https://phabricator.wikimedia.org/T114292#1690557 (Krenair) [19:09:03] RECOVERY - Throughput of event logging events on graphite1001 is OK: OK: Less than 15.00% above the threshold [500.0] [19:09:29] Analytics-Backlog, Analytics-Cluster: Trouble with access to hue - https://phabricator.wikimedia.org/T114292#1690567 (madhuvishy) a:madhuvishy>Ottomata [19:09:53] Analytics-Backlog, Analytics-Cluster: Trouble with access to hue - https://phabricator.wikimedia.org/T114292#1690550 (madhuvishy) @Ottomata Could it be that his LDAP account needs to be manually synced? [19:14:26] nuria: I ran the gist that you pasted [19:14:33] and it seems like the job ran for me [19:14:35] but [19:14:42] it says [CamusJob] - Discarding topic (Decoder generation failed) : test [19:14:52] so nothing got written [19:17:31] madhuvishy: but didn't you get a classnotfound exception? [19:17:39] nuria: Nope [19:17:43] waittt [19:17:54] the hadoop job ran and the logs say the job finished [19:18:08] but check the local log [19:18:18] log_camus_avro_test.txt [19:18:19] ? [19:18:20] yes [19:18:27] yup that's the one [19:18:30] let me paste [19:18:50] nuria: aaah yes [19:18:58] it does say decoder not found [19:19:01] madhuvishy: good, well not really [19:19:04] when i scroll up [19:19:09] madhuvishy: but consistent [19:19:19] ok, is joal arround still? [19:19:34] nuria: nope, he had to leave [19:20:11] ok, let's try to see if ottomata is there [19:21:03] cool, i'm poking around too to see what's going on [19:21:35] so, the -libjars is working (as in the hadoop job swallows the option) [19:21:55] madhuvishy: but the jar is not being loaded [19:22:10] madhuvishy: let me check for typos one more time [19:22:19] right [19:22:34] andrew mentioned yesterday may be it needs to be loaded by all workers [19:22:41] madhuvishy: waitttt [19:22:56] madhuvishy: i think i know what it is [19:23:10] oh? [19:23:42] madhuvishy: ya, the package must be wrong this time, cause libjars is working [19:25:12] madhuvishy: yes, trying again 'DummySchemaRegistry' had the wrong package [19:25:47] madhuvishy: after i modified the camus jar to get -libjars ... things should work [19:25:52] madhuvishy: let's see [19:25:55] okay [19:26:49] madhuvishy: batcave? [19:27:11] nuria: yup joining [19:37:17] ottomata: can you let yuvi in? :D [20:11:18] ottomata: there? [20:11:46] ottomata: is there a way we can flush the test topic on kafka? [20:35:50] Analytics-EventLogging, Beta-Cluster, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, and 2 others: Beta Cluster EventLogging data is disappearing? - https://phabricator.wikimedia.org/T112926#1690997 (DStrine) [20:42:59] ottomata: what is json encoded avro? [20:44:38] leila: ha! i needed help myself a few mins ago [20:44:41] i forgot my card today too [20:44:45] just made it back in [20:44:50] nuria: flush? [20:44:56] ottomata: ;-) [20:45:09] ottomata: as in 'delete all messages' [20:45:23] ah [20:45:23] no [20:45:25] no way [20:45:48] hm, maybe. could mayyybye alter topic configs and set a short retention time [20:46:03] but, not realy woth it [20:46:09] nuria: can you just configure to consume from latest? [20:46:24] so you don't consume old data that is bad format? [20:46:36] ottomata: yeah that's what we tried [20:46:46] but it looks like a different problem [20:47:12] they want us to send "json encoded avro" and i am not sure what that means [20:47:56] who is they? [20:49:29] ottomata: these people [20:49:30] https://groups.google.com/forum/#!topic/camus_etl/ipmjc02PrBY [20:49:49] the error they report here is the exact same one we get [20:51:28] !!!!! [20:51:28] The decoder by default expects json encoded messages coming from kafka, not avro [20:51:31] !!!! :? [20:51:35] YES [20:51:46] and i dont understand what that is even supposed to mean [20:53:14] oh, i know what it means, still reading.. [20:53:17] ottomata: {"id":123456,"name":"pepito perez", "muchoStuff":{"a": "1"}} is what we tried to send [20:55:41] oh you are trying the dummylog too [20:55:53] yes [20:56:06] ok, looking... [20:59:26] ottomata: ok, some progress today, we can launch a job and pass it a third party jar that contains schemas: https://gist.github.com/nuria/833fef6a74574125a3fc [21:00:25] ottomata: now, we just need to have a dummy schema registry or some registry that we can use to validate messages, the current example on camus is failing but i think we can fix it for reference [21:00:42] yes reading some code now. [21:02:26] yall wanna batcave? [21:02:28] nuria: , madhuvishy [21:02:29] ? [21:03:07] hey ottomata :] when you finish this discussion, do you have 5 mins to talk? [21:03:11] yes [21:03:12] oh [21:03:13] NO [21:03:13] ah [21:03:16] xD [21:03:17] RFC meeting is now, want to join that [21:03:18] sorry yall. [21:03:27] np [21:04:10] oh, mforns nm, that is just an IRC meeting? [21:06:35] mforns: what's up? [21:07:56] ottomata: nuria has meeting with Kevin [21:08:34] ottomata: where is RFC meeting happening? [21:09:13] in #wikimedia-office [21:10:13] ottomata, it's about the logstash puppet change [21:11:05] I should apply the logstash::eventlogging role to the deployment-logstash2 instance in the deployment-prep project [21:12:17] is this to be done in here? https://github.com/wikimedia/operations-puppet/blob/production/hieradata/labs/deployment-prep/host/deployment-logstash2.yaml [21:12:50] naw, i think here mforns [21:12:50] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=8c2fe47d-6ea4-4faa-90f9-83a2baee3bae&project=deployment-prep®ion=eqiad [21:12:57] scroll down, see logstash roles checked [21:13:06] mmmmmmmm [21:13:25] i can add the role to that list if you like [21:13:34] role::logstash::eventlogging yes? [21:13:42] ok, or tell me how to do it [21:13:48] not sure if you can, try: [21:13:51] https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [21:13:54] you have to have super powers [21:13:57] yes role::logstash::eventlogging [21:13:58] scroll down, add class [21:14:17] oh [21:14:17] mforns: i don't think yo have super powers, so i am adding [21:14:31] no, I'm regular [21:14:36] the regular guy [21:14:43] :] [21:14:48] done. [21:14:52] thx [21:14:55] mforns: answering bd808's question about multiple consumers [21:15:06] aha [21:15:22] i don't know how the logstash kafka plugin works [21:15:29] but i think running on multiple hosts will cause duplicates, right? [21:15:38] unless there is some auto balancing stuff [21:15:46] plus, EventError is a single partition topic [21:15:54] aha [21:15:55] so, unless we increase partitions, you can't parallelize those [21:16:16] ok, so I should load the role only in one of the machines [21:16:20] yes, i think so [21:16:37] bd808, which one should I, any preference? [21:16:46] ja hm, not sure the best way to do that, maybe conditional [21:16:57] if $::hostname == 'logstash1001', maybe? [21:16:58] dunno [21:17:31] node /^logstash1001\.eqiad\.wmnet$/ { [21:17:36] node /^logstash100[2-3]\.eqiad\.wmnet$/ { [21:18:15] mforns: we should look at the current server load and see if there a "best" candidate. 1001 would be the worst. It is already a SPoF for apache2+hhvm logs via syslog [21:18:24] aha [21:18:42] bd808: whichever you choose, should he just make that a conditional in the node blck? [21:18:50] if hostname ... [21:18:55] that's probably easiest [21:18:56] role logstash::eventlogging [21:18:56] ? [21:19:11] I think we have done things like that before in site.pp [21:19:24] aha [21:20:11] can't we just use regexp like it's done in node /^logstash100[1-3]\.eqiad\.wmnet$/ { [21:20:15] ? [21:21:37] oh I see, you mean within the section, ok [21:21:39] mforns: yes [21:21:46] probably not worth separating out the nodes [21:21:52] cool [21:22:20] btw, I added a nitpicky comment in input/kafka.pp [21:22:20] bd808, so 1002 then? [21:22:27] cool! [21:24:10] ottomata, I was looking for zookeeper_url in labs realm and couldn't find it [21:25:20] mforns: 1003 looks like it is taking less traffic right now [21:25:21] ottomata: I think we can merge https://gerrit.wikimedia.org/r/#/c/231574/ then. You should be able to +2 it, alex gave the go-ahead after that last fix [21:25:51] you said I could just use kafka::config::zookeeper_url, that it would use "deployment-zookeeper01.deployment-prep.eqiad.wmflabs:2181/kafka/deployment-prep" when in labs, but I could not find it in puppet, am I missing something? [21:25:59] ottomata, ^ [21:26:09] bd808, 1003, cool thx [21:27:00] ottomata: wait :) maybe we need to wait for the new aqs group [21:27:08] I'll ping you then or alex tomorrow [21:30:18] mforns: its in hiera [21:30:28] https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [21:30:30] ottomata, I see [21:31:09] thx [21:37:14] ottomata, btw, when I try to configure https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-logstash2.deployment-prep.eqiad.wmflabs, it says "The specified resource does not exist." [21:43:25] hmmm [21:43:27] ? [21:43:37] this link doesn't wokr? [21:43:37] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&project=deployment-prep&instanceid=8c2fe47d-6ea4-4faa-90f9-83a2baee3bae®ion=eqiad [21:43:52] ottomata, no, same message [21:44:05] maybe I don't have permits to configure that [21:44:07] weird, mforns maybe log out and log back in [21:44:08] maybe not. [21:44:11] ok [21:44:11] want me to just check the box? [21:44:44] ottomata, mmm [21:44:58] the code is not there yet [21:45:04] ok [21:45:05] right. [21:45:21] bd808: ^ mforns cannot configure deployment-prep instances, you can help? [21:46:40] ottomata, bd808, I pushed the changes to gerrit, if you think that's OK, I'll pull puppet from deployment-logstash2 and apply the role in the machine [21:46:53] yeah. I can help at the top of the hour when the rfc meeting is done [21:46:54] we want to cherry-pick that patch, apply it and see what happens right? [21:47:04] aha, cool [21:47:27] let me know when you have some time, thx [21:53:03] Analytics-Tech-community-metrics, Phabricator, DevRel-November-2015: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1691218 (Krenair) [22:02:28] dear beloved a-team, do you guys have use cases lined up for $real_hardware in labs. i.e. managing a box in labs like a vm that is hardware for mucho disk or mem or cpu [22:02:44] I'm trying to collect some actual use cases before we design a solution, seems sensible [22:05:17] chasemp: I think halfak has a bunch of those types of cases [22:05:36] o/ [22:05:38] * halfak reads [22:05:52] yeah. [22:05:56] So. I have two. [22:06:02] chasemp: we'd like a copy of the hadoop cluster we have in prod [22:06:16] so we can test, but nothing else huge. Most of the other stuff we do in labs is dashboards and those can use tiny instances [22:06:30] but that is just functional testing [22:06:31] not load testing [22:06:34] milimetric: how many boxes would that be? [22:06:36] so vms are fine, ja? [22:06:37] mforns: If my irc stays working, I have some time to help now. [22:06:48] bd808, cool [22:07:09] bd808, have you seen the latest changes? [22:07:16] are you ok with them? [22:07:33] https://gerrit.wikimedia.org/r/#/c/241984/ [22:07:41] * bd808 looks [22:10:52] mforns: will $role::analytics::kafka::config::zookeeper_url have the correct data in the beta cluster? [22:11:35] bd808, ottomata says this variable is stored in hiera and it has the correct value depending on the realm [22:11:52] perfect [22:12:22] halfak: if I make a ticket would you mind fleshing out the ideas you have? [22:12:29] Yes [22:12:32] Rather no [22:12:36] I would not mind ;) [22:13:24] right :) cool thanks [22:13:33] mforns: l cherry-pick to the beta cluster puppet master, apply it on deployment-logstash2 and then we can see if logstash starts up or not :) [22:13:47] *I'll [22:13:55] bd808, cool [22:15:15] bd808, I can send some invalid events to beta cluster's eventlogging instance to generate some contents in the topic in question [22:16:04] doing battle with puppet on logstash2 at the moment. :/ [22:23:03] mforns: ok. The role is applied and the generated logstash config looks like: https://phabricator.wikimedia.org/P2127 [22:23:12] I just realized that we will need a bit more logstash config to go with this [22:23:56] bd808, aha, what would we need? [22:25:28] we are going to need config to add the "es" tag to these messages to route them into Elasticsearch. We will probably end up wanting to do more transforms too once we see what the records look like. [22:25:50] So we need a filter file similar to https://github.com/wikimedia/operations-puppet/blob/production/files/logstash/filter-logback.conf [22:26:05] I can write it up and add it to the patch for you [22:26:07] easy peasy [22:26:39] aha, ok, if you're busy I can do this, as well [22:27:38] I've got time to help. No worries [22:27:45] ok bd808 thanks! [22:28:10] chasemp: Andrew's right, just functional testing, we don't need anything for load testing in labs that I know of [22:29:54] PROBLEM - Throughput of event logging events on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [22:30:11] mmm [22:34:51] milimetric, can you see eventlogging metrics in graphite? [22:35:06] mforns: ottomata and I are doing some load testing :) [22:35:15] oh! ok :] [22:35:20] heheh [22:35:22] yes we are doing stuff [22:35:26] i thought i scheudled downtime for that [22:35:26] coool [22:35:41] I think that throughput is much higher now :) [22:35:49] woohoo [22:36:31] anyway, is this a reason why raw and valid metrics wouldn't show up? [22:36:36] Analytics-EventLogging, Beta-Cluster, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, and 2 others: Beta Cluster EventLogging data is disappearing? - https://phabricator.wikimedia.org/T112926#1691338 (DStrine) [22:37:45] aj, never mind, I forgot, now we are looking at grafana [22:38:15] RECOVERY - Throughput of event logging events on graphite1001 is OK: OK: Less than 15.00% above the threshold [500.0] [22:40:51] Analytics-Backlog, Analytics-EventLogging, Fundraising-Backlog, Unplanned-Sprint-Work, and 2 others: Promise returned from LogEvent should resolve when logging is complete - https://phabricator.wikimedia.org/T112788#1691357 (Ejegg) Open>declined Seems to be a no-go in EventLogging, we'll jus... [22:41:21] mforns: ok. in theory we are ready to catch the error events. can you make some happen? [22:41:29] bd808, sure! [22:41:57] bd808, I've seen the new changes, awesome [22:42:47] mforns: we haz events! https://logstash-beta.wmflabs.org/#dashboard/temp/AVAga-y3a1EjumVdqYuC [22:43:09] now to tweak the formatting a bit in that new filter [22:43:21] woohoo! [22:44:47] !log testing Event Logging by sending large amounts of events up to 600k to see how fast they process [22:48:04] PROBLEM - Throughput of event logging events on graphite1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [600.0] [22:48:57] bd808, I've sent events with json parsing errors, too. I think a good improvement would be having the event.message field in the message column [22:49:10] *nod* [22:49:18] probably you alread knew, ok :] [22:50:16] Analytics-EventLogging, Fundraising-Backlog: Nested EventLogging data doesn't get copied to MySQL - https://phabricator.wikimedia.org/T112947#1691416 (Nuria) Let us know if this is still a problem . Thus far we have not supported nested schemas but if workarround does not suffice we can do needed changes. [22:50:19] mforns: https://gerrit.wikimedia.org/r/#/c/241984/8/files/logstash/filter-kafka-eventlogging.conf,unified [22:50:52] so event.message will be message, recvFrom goes into host and ERROR for the level [22:51:35] this is perfect [22:51:37] we don't really have the wiki value. it could be buried in the rawEvent but getting it out seems yucky [22:51:53] no, there are many schemas that do not have it [22:52:12] bd808: mforns, there are some errored events that we could probably parse more good info out of [22:52:20] especially if the problem is an invalid schema, and not invalid json [22:52:31] ok. I'll reapply and let you know when I'm ready for another batch of errors [22:52:32] schema would be particularly useful [22:52:36] we should make a ticket! [22:53:12] Analytics-Kanban: Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] - https://phabricator.wikimedia.org/T108925#1691440 (Tbayer) >>! In T108925#1690491, @JAllemandou wrote: > Tilman: I confirm the discrepancy we observe comes from a difference in the computation of the pageview bo... [22:53:21] bd808, I was wrong, the wiki field is in the eventCapsule, hence in all schemas [22:53:34] ottomata, yes schema would be sweet [22:53:57] we can parse that out of event.rawEvent easily I think right? [22:54:15] there is some number and then the next value is the schema name? [22:54:21] bd808, if the error is schema validation (and not json parsing) yes [22:54:21] !log sending 1200k events to Event Logging now, to see how fast they're processed [22:55:08] mforns: totally can do it, we just need to add more fields to EventError [22:55:17] and then populate them during error handling if we can [22:56:10] bd808, ottomata, we can also regexp for the schema like '"schema":"([^\"]+)"' [22:56:26] yeahhhhhh, could. maybe worth trying i guess [22:56:45] we can parse the json too (assuming it's not malformed) [22:56:50] aha [22:56:58] just thought it would be faster [22:57:02] yeah depends, sometimes the data is truncated [22:57:10] and regexing for schema would work in both cases [22:57:12] so meh? [22:57:14] etiher way! [22:58:18] if the data is truncated, we should look for '%22schema%22%3A%20%22([^%]+)%22' [22:58:27] though [22:59:13] but, if the data is truncated, it may not contain the schema... [22:59:21] mforns: hit me with some more errors. We will be able to see them at https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/eventlogging [22:59:29] bd808, OK [22:59:48] sent 7 more json parsing errors [23:00:28] hmm. message isn't getting filled in [23:00:31] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1691516 (Nuria) In the case of this schema: https://meta.wikimedia.org/wiki/Schema:Centra... [23:00:42] and 7 validating more [23:00:47] so that rename doesn't work. I'll do something else [23:01:09] bd808, maybe it should be _message? [23:01:14] instead of message? [23:01:42] no, makes no sense [23:02:47] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1691538 (Nuria) You can keep track of client side length errors on dashboard: https://graf... [23:04:47] I think its more likely that renaming a nested element doesn't work [23:05:19] bd808, the format in the rename clause is strange though, it has 4 elements [23:05:53] does it actually work like this? ["event.message", "message", "recvFrom", "host"] [23:05:59] well it worked for the host [23:06:05] their syntax is wacky. that's really a hash with 2 from => to pairs [23:06:10] ok [23:06:20] crappy DSL [23:06:32] ready for another burst [23:06:41] running [23:07:22] hmmm... that didn't work either [23:07:46] I'm going to have to resort to reading documentation :) [23:07:58] hehe [23:11:04] RECOVERY - Throughput of event logging events on graphite1001 is OK: OK: Less than 15.00% above the threshold [500.0] [23:14:30] Analytics-EventLogging, Analytics-Kanban: {stag} EventLogging on Kafka - https://phabricator.wikimedia.org/T102225#1691563 (Ottomata) Dan and I just pushed Analytics schema events through this system. During our largest test, we posted around 9000 valid events per second through by using ab to hammer bit... [23:14:41] a-team: https://phabricator.wikimedia.org/T102225#1691563 [23:14:42] COool! [23:15:49] hey ottomata madhuvishy , do we put our HTTP response headers into hadoop? https://phabricator.wikimedia.org/T113672 notes that we don't log API failures, but we do return a MediaWiki-API_error: HTTP header [23:15:51] : ) [23:16:20] :O [23:16:27] spagewmf: only specific headers that are considered by varnish get through, like x-analytics [23:16:43] but there'd have to be custom code to add a header that we don't already grab [23:17:10] as a matter of standard, we've been adding whatever we need on the analytics side to x-analytics instead of adding more headers [23:18:19] Analytics-Kanban: Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] - https://phabricator.wikimedia.org/T108925#1691583 (Tbayer) >>! In T108925#1660700, @JAllemandou wrote: ... > Now some more interesting thing: spiders tagging. > > {F2624413} > > Here we can see a difference be... [23:19:59] bd808, maybe we can try: rename => ["[event][message]", "message"], as done (inverted) here: https://groups.google.com/forum/#!topic/logstash-users/Fmqk3mK-2mY [23:20:21] yup. patch is almost ready [23:21:03] the magic was documented at https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html#logstash-config-field-references [23:21:29] aha [23:22:05] Analytics, MediaWiki-API: api.log does not indicate errors and exceptions - https://phabricator.wikimedia.org/T113672#1691610 (Spage) If - we analyzed api.log in Hadoop - and implemented {T113817} - and the Hadoop webrequest table captured the custom `MediaWiki-API-Error: `//error-code// HTTP header, the... [23:22:21] milimetric: thanks ^ I noted the possibility of doing this in the bug. [23:23:51] mforns: can I get another round of errors? [23:23:55] sure [23:23:55] Oh java thingy, why u crash? hive (wmf_raw)> select x_analytics from webrequest WHERE year=2015 AND month=9 AND day=30 AND hour=12 LIMIT 10; [23:23:59] FAILED: RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception nulljava.lang.NullPointerException [23:24:33] mforns: w00t. much nicer [23:24:46] oooh [23:25:38] bd808, cool! [23:26:17] I smell python here :) "u'14' is not of type u'integer'" [23:26:43] hehehehe [23:27:07] one question, can I filter by text within a field, for example rawEvent? [23:27:14] this whay I could filter by schema [23:28:20] I retried that request in `use wmf;` and it printed a bunch of '-'. So I tried hive (wmf_raw)> select x_analytics from webrequest WHERE year=2015 AND month=9 AND day=30 AND hour=12 LIMIT 10; [23:28:31] yeah. if you want to isolate on the rawEvent field then your query would be rawEvent:"something" [23:28:41] and got 'org.apache.hadoop.security.AccessControlException: Permission denied: user=spage, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x [23:30:57] mforns: it uses Elasticsearch's "query string syntax" for searches -- https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax [23:31:06] * bd808__ is having bouncer issues today [23:31:16] spagewmf: why wmf_raw? [23:31:18] bd808__, thx [23:31:48] spagewmf: are you in the analytics-privatedata-users group now? [23:32:57] naw, you aren't so you won't be able to read this stuff [23:32:57] ottomata: wmf_raw because I thought that might have the raw HTTP header ( and it reminds me of WWF Raw wrestlemania 2007 :-) ). I'll check my groups. There's no /mnt/hdfs/user/spage directory [23:33:10] spagewmf: wmf has everything raw has + more [23:33:15] and is stored in a more efficient format [23:33:17] you shouldn't need to use raw [23:33:22] also..>..>>. [23:33:46] wmf.webrequest has wrong file permissions. [23:33:55] you shouldnt' be able to read it at all in the groups you are in right now :o [23:34:48] `groups`(1) returns "wikidev statistics-privatedata-users" , not analytics-xxx [23:35:35] Analytics-Cluster: Fix file permissions for wmf.webrequest data - https://phabricator.wikimedia.org/T114327#1691644 (Ottomata) NEW a:Ottomata [23:35:40] yeah, spagewmf! [23:35:44] it is wrong on our side [23:35:49] https://phabricator.wikimedia.org/T114327?workflow=create [23:39:50] bd808__, I think to get the schema, it will be a bit complicated, no? I can only imagine a hacky way to get that with regexps from the rawEvent [23:40:16] we could populate the schema field in EventLogging code, when producing to kafka [23:41:06] ottomata: ah, I'm in the three-day waiting period for getting entrĂ©e into le analytics-privatedata-users club, T114150. So exciting! [23:41:32] mforns: it would probably be nicest if you sent it in from the EL python code [23:43:37] bd808__, yes, agree. [23:45:06] ottomata, do you think we can deploy EL_EventError in logstash without "schema" field? we could always filter by schema using: type:eventlogging AND rawEvent:*SchemaName* [23:45:31] and create a new task for this, adding the schema from eventlogging code [23:58:44] kevinator, yt? [23:58:59] mforns: yes, here [23:59:19] kevinator, do you have 5 mins for questions on the quarterly review presentation? [23:59:37] yes... batcave? [23:59:41] sure!