[00:14:27] SMalyshev: I can try [00:14:33] sup? :) [00:14:38] madhuvishy: cool [00:15:59] madhuvishy: so the question is as follows: suppose we have a dataset and I want to group it into batches (e.g. 1000 rows per batch) and process it in batches (e.g. send it out to another service) [00:17:02] madhuvishy: what would be the best way to do such thing? e.g. I can do data.foreach() and manually cound rows etc. but I;m not sure it's the best way, and also the last batch may be less than 1000... [00:17:43] I know there's take() but it only takes from the start... is there something like iterated take()? [00:17:45] SMalyshev: aah, I don't think I've done that in the past, all our spark jobs load entire partitions [00:18:06] interesting, let me think [00:19:28] I could also do probably count() and then partition by (count/1000) but that looks like too expensive probably [00:21:17] SMalyshev: yeah, things like count are not lazy, so will take a long time [00:22:06] yeah that's why I wonder if I can do better [00:26:38] SMalyshev: where does this 1000 number come from? [00:27:10] as in, what is the motivation? do you just want to process smaller batches at a time, or do you need it to be a specific size everytime? [00:28:06] madhuvishy: it's not exactly 1000, just some big number. It comes from basically we'd be sending data about documents to ElasticSarch, and for each document it's 1-2 numbers, so we don't want to do HTTP request for each document. Instead, we want to batch a number of updates (e.g. 1000) and use ES bulk API to update all of them in one request [00:28:39] is this the pageviews stuff? [00:28:44] yes [00:28:47] its for both pageviews and page rank [00:29:01] Partitions may be the right approach, although they are not evenly sized :( [00:29:05] yeah [00:29:09] that's what i was thinking [00:29:15] but if you're looking specifically at pageviews [00:29:18] right, both. And maybe more in the future, I guess. I want the script to be generic "get tons of data from analytics and ship it to ES" [00:29:21] which is why I asked if it needs to be an exact division [00:29:27] the refined, ETLd pageview table is a lot smaller and more structured than the raw webrequest data [00:29:27] sooo [00:29:34] like, the aggregates [00:29:37] Ironholds: how big is the partition? [00:29:59] SMalyshev, much like post-war europe, the size of the partition varies wildly [00:30:00] Ironholds: this is much later in the pipeline. Something else will read in data and generate scores for page rank, and page views. Those final scores are written out to something in hdfs [00:30:07] ebernhardson, ahhh [00:30:11] Ironholds: this script just reads the pre-computed data, and sends it to ES [00:30:28] madhuvishy: no, doesn't need to be exact. but if it's 100G of data, there's a chance i will choke memory or ES [00:30:37] SMalyshev, oh it'll never be that much [00:30:42] (unless that's hyperbolic) [00:30:51] SMalyshev: Spark dataframes have a repartition function [00:30:54] yeah I have no idea :) [00:31:10] i think you split it into arbitrary smaller partitions [00:31:16] madhuvishy: yeah I've seen but it says how many partitions, right, now how many per partition? [00:31:21] or did I miss an option? [00:31:34] SMalyshev: yeah you dont know how many per partition [00:31:39] because I have no idea how many pages there will be... [00:31:42] you'd have to experiment a bit [00:31:45] on the question of existing partition size [00:31:50] we can do a bit of experimenting as madhuvishy says [00:31:58] like, SELECT COUNT(*) GROUP BY [all the partitions] [00:32:07] over a day of data, work out the divide [00:32:31] ok, I guess that can be a way to do it... how expensive is repartitioning? [00:32:38] if we're looking at the ETLd webrequests to start with you will probably be fine but might not be. If we're looking at the aggregated pageviews data you will definitely be fine. [00:32:52] but I can just run a hive query and get you some baseline numbers if you throw a phab ticket into the analytics sprint ;p [00:33:15] SMalyshev: it wouldn't be as even, but maybe foreachPartition? that allows you to run a process per partition, and within a partition you would manually batch 1000 and then send [00:33:48] but the last request in each partition (however many) will be smaller [00:34:01] ebernhardson: yeah that's what I'm looking for but I don't know how big the partition is... I guess yeah, if I say I want 10k partitions then probably they'd be around 1000 [00:34:21] Ironholds: is the data in hive partitioned by hour or sth? [00:34:31] s/hive/hadoop [00:34:38] madhuvishy, hour and source [00:34:42] well, ymdh and source [00:34:55] that script won't be working on raw analytics data though. It would be working on output of aggregator script. [00:35:09] It may also work to convert the data frame into an rdd and group by something like that [00:35:15] and map across each group [00:35:19] which would probably be just page_id:score pairs for all pages withing the timeframe [00:35:45] SMalyshev: aah grouping by page id won't work? [00:36:01] I am not sure how many rows you'd have per page [00:36:04] 1 :) [00:36:10] aah [00:36:11] madhuvishy: hm... That's an interesting idea... if I try to group by padeId%1000 it may work [00:36:46] why %1000? [00:36:49] though I'm concerned it's much more work than it should be. repartitioning may be cheaper [00:37:14] SMalyshev: may be - you can try them on the spark-shell and see how long it takes [00:37:45] yeah I'll try that probably and see. ok, I've got some good ideas, thanks! [00:37:46] yeah I'll try that probably and see. ok, I've got some good ideas, thanks! [00:38:23] np! let me know how it goes [03:43:03] (PS11) Nuria: Add pageview quality check to pageview_hourly [analytics/refinery] - https://gerrit.wikimedia.org/r/240099 (https://phabricator.wikimedia.org/T109739) (owner: Joal) [05:36:50] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744752 (Smalyshev) @GWicke I would be interested to participate. I'll be in the office, could you add me to the invite? [07:33:55] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1744824 (Luiscanasdiaz) @aklapper your changes were merged. I do think it is a good approach. [07:36:28] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1744836 (Luiscanasdiaz) @aklapper not sure about the impact of this changes, I mean, everything work but makes a bit difficult the m... [08:01:34] Analytics-Wikistats: Discrepancies in historical total active editor numbers - https://phabricator.wikimedia.org/T87738#1744887 (Nemo_bis) Independent debugging could also be performed by checking whether any individual wiki exhibits such a fluctuation in the same month and whether there are significant fluc... [08:06:09] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744891 (mobrovac) >>! In T114443#1744752, @Smalyshev wrote: > @GWicke I would be interested to participate. I'll be in the office, could you add me to the invite?... [09:08:14] Analytics-Backlog: Sanitize pageview_hourly - https://phabricator.wikimedia.org/T114675#1744996 (JAllemandou) Task breakdown: # Write an oozie coordinator to backfill sanitization from exisiting pageview_hourly to pageview_hourly_new (path TBD) # Backfill by hand: create pageview_hourly_new, launch and moni... [09:50:37] Analytics-Kanban, RESTBase, Services, Patch-For-Review, RESTBase-API: configure RESTBase pageview proxy to Analytics' cluster {slug} [3 pts] - https://phabricator.wikimedia.org/T114830#1745043 (akosiaris) [09:50:39] Analytics, Services, operations, Patch-For-Review: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1745040 (akosiaris) Open>Resolved a:akosiaris LVS for AQS is up and running. We had to migrated restbase on AQS to port 7232 to avoid conflicting with the services restbase inst... [09:50:58] !log restart cassandra on aqs1003 [09:51:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:03:31] Analytics-Tech-community-metrics, DevRel-October-2015: Correct affiliation for code review contributors of the past 30 days - https://phabricator.wikimedia.org/T112527#1745058 (Aklapper) I pushed my first update. [[ http://korma.wmflabs.org/browser/scr-contributors.html | Data on korma ]] should get updat... [10:11:05] morning a-team [10:13:57] Hi mforns [10:14:07] hello! [10:16:57] mforns_: about backfilling EL, have you done it yesterday with Dan ? [10:17:51] (PS2) Joal: Correct camus-partition-checker to use hdfs conf [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [10:31:00] joal, sorry didn't see your message. We tried, we deployed the new patch and created a separate eventlogging instance in dan's home folder [10:31:23] ok [10:31:27] np mforns :) [10:31:38] joal, we created a file containing all events that were created during the outage [10:31:43] right [10:31:46] and piped it to the mysql consumer [10:31:57] however, it didn't go as expected [10:32:22] aouch :( [10:32:26] we were slowly progressing but it was too late for me.. so I left [10:32:37] k [10:32:49] Thanks a lot for having taken care about that [10:33:06] we were managing to get the events inserted, with some timezone issues... [10:33:24] but we got an error created by a badly formated event [10:33:47] that crashed the process [10:33:55] hm, how is that even possible, given we validated everything ? [10:34:01] we should fix that error first probably, and then try to backfill again [10:34:20] mmm, it seemed some parsing issue [10:34:55] k [10:34:57] the event text contained several single quotes, double quotes and slashed [10:35:01] *slashes [10:35:07] pfff [10:35:14] hehehe [10:35:18] classical, but uncool [10:36:15] joal, I'm going to work in the browser report changes now, and in the afternoon, I'll take that again with Dan [10:36:59] (PS3) Joal: Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [10:37:09] ok mforns_ [10:37:15] Thanks a lot [10:37:22] np [10:44:01] (CR) Joal: "Tested with:" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 (owner: Joal) [10:54:48] * joal is gone chopping some wood [10:54:55] * joal will be back in a few hours [10:56:51] Analytics, Services, operations: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1745136 (mobrovac) [11:44:42] (PS1) Christopher Johnson (WMDE): adds graphite module [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/248022 [11:44:56] (PS4) Mforns: Add oozie job to compute browser usage reports [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) [11:45:28] (CR) Mforns: Add oozie job to compute browser usage reports (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) (owner: Mforns) [11:47:40] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] adds graphite module [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/248022 (owner: Christopher Johnson (WMDE)) [13:25:32] * joal is back ! [13:29:14] hayiIiii [13:31:52] Hey ottomata :) [14:03:57] ottomata: I have tested the CamusPartitionChecker: works fine :) [14:05:09] I let you review / merge, I'll add puppet code and we can discuss on how to deploy [14:06:36] awesoome [14:07:14] (PS1) Christopher Johnson (WMDE): adds bulk sparql query and output scripts removes total_views [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/248033 [14:09:50] ottomata: reading the camus puppet module module [14:10:11] ottomata: shouldn't the camus init.pp also depend on refinery beiing deployed ? [14:10:25] Marbe we can't enforce that since refinery is manually deployed ? [14:10:59] Like ensuring that /srv/deployment/analytics/refinery exists ? [14:11:16] joal: that would be nice, but i didn't want to introduce that dependency. [14:11:21] kinda weird to depend from module to role [14:11:26] so, i made the script a parameter instead [14:11:43] with a default value, makes sense [14:11:55] ah but [14:12:01] And refinery is a role instead of a module because of being manually deployed ? [14:12:01] we could force the dependency here [14:12:04] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/refinery.pp#L62 [14:12:09] well, actually, it is already being forced [14:12:11] via the [14:12:13] require role::analytics::refinery [14:12:28] so, use of the camus module does not explicitly depend on refinery [14:12:35] but our use of it does via the role:::: camus class [14:12:49] joal: refinery is a role, but could be a module. maybe. [14:13:09] i generally like to keep very WMF specific things out of modules, unless it makes a lot of sense [14:13:19] ottomata: I am just trying to put my head around the stuff, not change it :) [14:13:25] aye [14:13:33] in the case of a refinery module, the only think i would put in the module would be the main role::analytilsc::refinery class [14:13:41] the role dep, is it in camus role ? [14:13:43] anything else in the refinery.pp role file, i thin i would keep in a role [14:13:49] there is no camus role. [14:14:07] joal: generally, but not always, i think of modules as very pluggable libaries [14:14:13] right [14:14:14] and roles as usages of those libraries [14:14:19] sometimes, modules must use other modules [14:14:26] so the line is kinda blurry [14:14:35] (PS2) Christopher Johnson (WMDE): adds bulk sparql query and output scripts removes total_views [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/248033 [14:15:39] ottomata: so the dependency of refinery is made in the role that also uses camus [14:15:42] right ? [14:17:28] yes, um [14:17:32] role ... refinery::camus [14:17:33] does [14:17:38] require role::analytics::refinery [14:17:47] which ensures that that class is realized before it [14:17:58] and that class includes [14:17:59] package { 'analytics/refinery': [14:17:59] provider => 'trebuchet', [14:17:59] } [14:23:39] ottomata: shall I go by adding camus::checker class and use it in refinery::camus role ? [14:23:53] camus::check would be in the camus submodule [14:24:19] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1745395 (Ottomata) COOL. As part of this discussion, I'd like us to think about not only fields that are relevant to edit event... [14:27:28] HMMM joal, hm. [14:27:34] i see the reason for your questions now. [14:27:35] hm. [14:27:47] so, camus checker is 100% in refinery, right? [14:27:52] hm. [14:28:25] camus checker needs refinery-job, a camus.properties file, hadoop conf (for HDFS), and hadoop + spark libs [14:29:24] Oh, and java, obviously [14:29:30] ottomata: --^ [14:30:53] joal: hm [14:31:00] maybe add a wrapper in refinery/bin for your thing [14:31:15] hmm [14:31:15] no [14:31:20] is this going to be launched by oozie? [14:31:43] As we prefer ottomata [14:31:51] whatcha think? [14:31:57] Since it's java, it is launchable via oozie [14:32:09] right, but the reason to lauch by oozie would be data based [14:32:10] hm. [14:32:12] but since it's very camus related, I would have bundled it with camus [14:32:16] yeah [14:32:16] hm [14:32:18] holaaa, let me know if you need help backfilling eventlogging cc milimetric [14:32:30] joal, maybe you should just incorporate this comand into the existing bin/camus wrapper? [14:32:34] as an option [14:32:35] like [14:32:56] morning nuria. we've gotta fix that bug. I just had some blood drawn, so I'm having a late start, gotta get some food :) [14:33:08] hmmm [14:33:11] but you don't have the date in there. [14:33:23] ok, milimetric let's talk later, cause when backfilling from a file some scaping is needed [14:33:24] that's going to be the hard part, how do you know which date to check? [14:33:35] ottomata: by default the thing uses the last camus run in history [14:33:41] hm [14:33:42] ok [14:33:48] so, eyah, then that does make sense [14:33:56] you can pass it a date if you prefer, but with no date, it uses the last one [14:34:00] all you need is the properties file then [14:34:04] yessir [14:34:06] so you can add it to the bin/camus script [14:34:11] since that is also being passed the properties file [14:34:48] k ottomata, will look into that [14:35:07] --check-flag [14:35:08] or something [14:35:12] of just --check [14:35:20] which then makes this job be launched after the camus one is doen. [14:35:22] done [14:36:07] hm, question: when you say done, you mean launched right, not finished ? [14:36:21] uh [14:36:27] i mean finished, after the camus mr job is finished [14:36:30] right? [14:36:50] so, bin/camus script does [14:36:53] sys.exit(os.system(command)) [14:36:58] instead, don't exit [14:36:59] will lokk into that camus script [14:37:03] just check return val [14:37:11] (mabye? not sure what ret val of hadoop job will be) [14:37:16] but, just don't exist [14:37:18] exit* [14:37:21] then if --check was given [14:37:28] run your job after os.system returns [14:37:45] if you think its worthwhile you can augment the script to use something other than os.system [14:37:52] subprocess or whatever [14:37:57] up to you [14:38:21] ottomata: got it [14:39:41] thx ottomata [14:43:14] nuria: ok, I'm about to try to fix that bug [14:43:31] first I'll try to repro on an04 and I'll fiddle with the python there until I figure it out [14:44:24] milimetric : are you sure there is a bug? [14:44:34] cause backfilling from a file is not teh same [14:44:37] than a stream [14:44:39] oops, sorry, I forgot I didn't email, it's on phab: https://phabricator.wikimedia.org/T116241 [14:45:08] it's that weird schema with the array in it [14:45:17] I thought it was just failing but it actually kills the consumer completely [14:45:25] so we can't insert the other events [14:45:52] the other bug is that the change somehow didn't work, it wasn't sleeping and the memory use grew to like 15GB before we killed it [14:45:56] gotta look into that too, but the bug first [14:46:03] nothin's ever easy :) [14:46:05] milimetric: every event or just a particular one of taht schema? [14:46:25] it looks like just that particular one, but we have to protect the consumer from whatever's happening there [14:46:34] it's getting past validation somehow, that's the weird part [14:47:14] milimetric: you know, we should be able to test with a unit test with teh event in question [14:52:57] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1745447 (jcrespo) I am rolling the schema change right now: ``` MariaDB EVENTLOGGING m4 localhost log > SHOW CREATE TABLE MobileWebWatching_11761466\G *************************** 1. ro... [14:53:59] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [14:55:50] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [15:03:35] Analytics-Kanban: Backfill mobile and upload for oct 20th [3 pts] {hawk} - https://phabricator.wikimedia.org/T116283#1745474 (JAllemandou) NEW a:JAllemandou [15:07:33] Analytics-Tech-community-metrics: Empty "subject" and "creator" fields for mailing list thread on mls.html - https://phabricator.wikimedia.org/T116284#1745489 (Aklapper) NEW [15:07:48] (PS1) Joal: Update oozie diagram to reflect current status [analytics/refinery] - https://gerrit.wikimedia.org/r/248041 [15:08:12] Analytics-Tech-community-metrics: Review/update mailing list repositories in korma - https://phabricator.wikimedia.org/T116285#1745496 (Aklapper) NEW [15:12:35] Analytics-Engineering, Wikidata: Dashboard repository for limn-wikidata-data - https://phabricator.wikimedia.org/T112506#1745505 (JanZerebecki) [15:13:28] ottomata: could you merge this: https://gerrit.wikimedia.org/r/#/c/248045/ [15:13:33] it'll let us keep going with backfilling [15:17:39] milimetric: cool [15:17:42] edit the comment there [15:17:49] sorry actually, no rush, I can just grep -v for now to backfill [15:17:51] the commit is good, but it would nice to have the same info in the comment [15:18:00] oh good point [15:18:12] Analytics-Tech-community-metrics, DevRel-November-2015: Legend for "review time for reviewers" and other strings on repository.html - https://phabricator.wikimedia.org/T103469#1745512 (Aklapper) [15:18:42] (PS2) Joal: Update oozie diagram to reflect current status [analytics/refinery] - https://gerrit.wikimedia.org/r/248041 (https://phabricator.wikimedia.org/T115993) [15:19:00] milimetric, "Exclude CentralNoticeBannerHistory from mysql" :] [15:19:22] are you planning on retrying backfilling? [15:19:53] mforns: yep, nuria and I are in the cave, we're gonna start to try now [15:19:58] (PS1) Joal: Update bin/camus to include CamusPartitionChecker [analytics/refinery] - https://gerrit.wikimedia.org/r/248048 (https://phabricator.wikimedia.org/T113252) [15:20:01] oh ok [15:20:07] may I join? [15:20:10] no! [15:20:13] of course [15:20:14] xD [15:20:14] mforns: those events were not making it into db [15:20:15] :) [15:20:16] yessir:D [15:21:34] Analytics-Kanban, Patch-For-Review: EventLogging mysql consumer can be killed by a bad event with a nested array - https://phabricator.wikimedia.org/T116241#1745517 (Nuria) [15:22:07] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1745519 (jcrespo) It will take a day to convert MobileWebClickTracking_5929948 (>300GB). If it fails, we will try with regular tokudb online table. Writes can continue without problem w... [15:23:21] Analytics-Kanban, Patch-For-Review: EventLogging mysql consumer cannot insert events that have a nested json schema that includes a plain "array" - https://phabricator.wikimedia.org/T116241#1744214 (Nuria) [15:37:28] Analytics-Tech-community-metrics, DevRel-November-2015: Legend for "review time for reviewers" and other strings on repository.html - https://phabricator.wikimedia.org/T103469#1745559 (Aklapper) `review_time_pending_ReviewsWaitingForReviewer_days_median` provides "Review Time for reviewers (days, median)"... [15:38:22] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [15:40:12] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [15:52:34] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1745607 (Aklapper) Merged. Thanks! I'm going to close this task once it's live on korma. [16:01:52] ottomata, madhuvishy standuppp? [16:03:11] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [16:03:21] Analytics-Cluster, Analytics-Kanban: Use Burrow for Kafka Consumer offset lag monitoring - https://phabricator.wikimedia.org/T115669#1745630 (Ottomata) I was able to successfully create a Burrow .deb for Jessie today. It is a little hacky and needs some work, but the idea should be fine. [16:05:45] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review, WMF-deploy-2015-10-27_(1.27.0-wmf.4): Add the schema name to the EL EventError topic [8 pts] - https://phabricator.wikimedia.org/T115121#1745634 (Nuria) Open>Resolved [16:06:05] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Decommission remaining old Hadoop Workers {hawk} - https://phabricator.wikimedia.org/T112113#1745637 (Nuria) Open>Resolved [16:06:07] Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1745638 (Nuria) [16:07:00] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [16:07:42] Analytics-Backlog, Analytics-Cluster: Implement better Webrequest load monitoring {hawk} - https://phabricator.wikimedia.org/T109192#1745641 (Nuria) [16:07:43] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Improve daily webrequest partition report {hawk} [5 pts] - https://phabricator.wikimedia.org/T113255#1745640 (Nuria) Open>Resolved [16:12:21] Analytics-Kanban: Create deb package for Burrow - https://phabricator.wikimedia.org/T116084#1745657 (madhuvishy) [16:57:22] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: Procure hardware for future druid cluster - https://phabricator.wikimedia.org/T116293#1745790 (Nuria) NEW [17:02:47] nuria: \o/ [17:07:54] Analytics-Tech-community-metrics, DevRel-October-2015: Automated generation of (Git) repositories for Korma - https://phabricator.wikimedia.org/T110678#1745829 (Aklapper) >>! In T110678#1734742, @Dicortazar wrote: > @Aklapper, how can we automatically retrieve the list of Git repositories available from s... [17:09:52] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: Procure hardware for future druid cluster - https://phabricator.wikimedia.org/T116293#1745833 (kevinator) This will be similar to T100442 [17:15:54] joal, you want to go to https://plus.google.com/hangouts/_/wikimedia.org/a-batcave-2? [17:15:58] for oozie changes? [17:16:01] sure mforns ! [17:18:10] nuria: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/eventlogging.pp#L227 [17:18:45] madhuvishy: right, but what machine is m4? [17:18:50] madhuvishy: isn't taht an alias? [17:19:00] madhuvishy: is it db1046.eqiad.wmnet? [17:19:09] oh - hmmm not sure [17:20:52] sorry a-team, got disconnected faster than expected :) [17:27:34] (PS12) Joal: Add pageview quality check to pageview_hourly [analytics/refinery] - https://gerrit.wikimedia.org/r/240099 (https://phabricator.wikimedia.org/T109739) [17:38:17] nuria: I pushed the patch for whitelist, I let you confirm and merge :) [17:38:34] Joal: looking now [17:38:47] if you look at patch 11, you'll see the diff :) [17:38:51] nuria: --^ [17:39:01] joal: what did you do [17:39:10] git-wise that is [17:41:34] joaL; but your change does not have dan's changes, does it? [17:41:38] oh, sorry nuria [17:41:50] it does actually : it is onto it :) [17:42:15] if you look the entire workflow file, it contains the transform part [17:42:38] So what I did: picked the commit using the command given by gerrit [17:42:57] check out a new branch to make that safer than master [17:43:17] joal: i see, yes, it has dan's workflow [17:43:19] actually before picking the commit, I did fetch all and pull [17:43:36] on master, to make sure I had the latest master [17:43:45] Then pick up, then rebase on master [17:43:56] ottomata: i made an initial patch for the module [17:44:18] 3 conflicts to resolve, so update the f3 files, then add them, then rebase continue [17:44:24] finaly review, done :) [17:44:28] nuria: --^ [17:44:31] joal: k, merging, my apologies again [17:44:32] makes sense ? [17:44:40] nuria: not to worrie :) [17:44:56] (CR) Nuria: [C: 2 V: 2] Add pageview quality check to pageview_hourly [analytics/refinery] - https://gerrit.wikimedia.org/r/240099 (https://phabricator.wikimedia.org/T109739) (owner: Joal) [17:45:28] nuria: before I leave, talk about the cassandra java code ? [17:45:48] joal: sure [17:46:00] cave ? [17:46:06] or is it full ? [17:46:27] nuria: https://gerrit.wikimedia.org/r/#/c/247758/3 [17:46:31] looks like it is already on by default [17:46:46] madhuvishy: cool! [17:47:11] ottomata: just saw that , great [17:47:18] ottomata: still have to add the role [17:47:25] ottomata: abandoning patch [17:52:52] nuria: i'm gonna eat lunch and take a moment to breathe. my computer crashed but it's ok now [17:53:14] milimetric: mystery solve with database [17:53:17] *solved [17:55:09] milimetric: let's sync in when you are back [18:27:34] ok, nuria, so the db spike is explained? [18:27:50] (we can talk in cave for a minute, but I have a meeting in 5 [18:27:53] milimetric: yes, mforns do you want to explain? [18:28:10] milimetric, jaime crespo is working on the bucketization of the editCount fields [18:28:12] milimetric: dba is bucketing EL editing data [18:28:22] so executing scripts [18:28:24] o! [18:28:30] milimetric: so no backfilling until that is finished [18:28:32] i know [18:28:33] makes sense [18:28:36] will he ping us? [18:28:36] and this will take at least until tomorrow [18:28:42] i think our stuff is good [18:28:51] it was just slow for this reason [18:29:10] I spoke with him about this this morning, but I didn't think about that when we were discussing, sorry! [18:29:18] sok [18:29:45] so the last thing to check will be what queue size to use to keep memory usage reasonable during backfilling [18:29:54] that'll probably take a few tries, but we'll do it when jaime's done [18:30:05] thx all, i'll put the related tasks in paused for now then. [18:30:58] ok, makes sense [19:01:37] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Puppetize Piwik to prepare for production deployment - https://phabricator.wikimedia.org/T103577#1746119 (JMinor) [19:04:08] ottomata: are you still in the meeting? [19:04:32] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1746131 (Ottomata) [19:06:27] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1746147 (Ottomata) etherpad from today's meeting: https://etherpad.wikimedia.org/p/eventbus-events [19:06:35] madhuvishy: no, done. [19:06:50] ottomata: coolll. i pushed changes, and added a role class [19:07:16] need to define what consumer groups to listen to, and not sure if i should hardcode them [19:07:28] can you make it just do all? [19:07:34] i guess tha'td be annoying :) [19:07:42] madhuvishy: hiera :) [19:07:52] ottomata: oh yeah, hiera [19:08:16] a-team: did we decide what we're gonna do for a place to stay in January? [19:08:36] mmmm [19:08:49] ottomata: okay, do i have to do something different for labs? [19:09:11] the other configs i'm importing from kafka::config and i assume it'll already do the right thing [19:09:25] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Support Pywick in production - https://phabricator.wikimedia.org/T116308#1746155 (JMinor) NEW [19:10:03] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog: Stand up piwik in a permanent and privacy-sensitive way - https://phabricator.wikimedia.org/T98058#1746166 (JMinor) [19:10:05] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Support Pywick in production - https://phabricator.wikimedia.org/T116308#1746165 (JMinor) [19:10:41] milimetric: i thought just hotel [19:10:43] i thought we had to [19:10:57] madhuvishy: ja that is the right thing [19:12:32] joal: nice oozie chart! :) [19:13:18] ottomata: ya okay, and if i do hiera('role::analytics::burrow::blah::consumer_groups') it'll get it from the right place - so i don't have to define anything separate for labs? [19:14:20] no, the proper way ( i think) is to just set the variable name on the module. [19:14:28] that's actually how the kafka stuff should work too [19:14:33] buuuuut, kafka predates hiera :) [19:14:48] module params are automatically set from hiera [19:15:06] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Production achine to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746196 (JMinor) NEW [19:15:10] ottomata: uhhh [19:15:28] looking... [19:15:39] so madhuvishy [19:15:41] in the role [19:15:44] if you do [19:15:44] include borrow [19:15:49] (or class { 'burrow': } [19:15:56] and then in hiera, do [19:16:05] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746205 (JMinor) [19:16:05] burrow::consumer_groups = ... [19:16:07] or [19:16:08] sorry, its yaml [19:16:09] so [19:16:16] "burrow::consumer_groups": ... [19:16:18] in hiera [19:16:30] and where does the yaml file go? [19:16:33] then puppet will automatically pass the hiera value to the module [19:16:37] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746196 (JMinor) [19:16:44] ottomata: aah [19:16:53] so i don't have to pass it from the rolw? [19:17:24] well, hm. i think not. I *think* it will auto fill them but i'm not sure what happens if you have other params set too...i think it will to the righ thting [19:17:26] yes [19:17:34] i think yes is the answer to your question :) [19:17:54] so in both the module and role, i dont have to say consumer_groups = hiera('something') [19:18:02] if i just define the right thing [19:18:07] it'll automatically show up [19:18:18] right. [19:18:23] mm hmm [19:18:52] ottomata: and where does the yaml go? role/common/analytics/burrow.yaml? [19:18:59] or just role/common/burrow.yaml [19:19:25] ha, UM [19:19:31] i *think* the former [19:19:49] since your role class is named role::analytics::burrow [19:20:00] ottomata: hmmmm [19:20:16] okay i trust you :) [19:20:20] haha, i mean [19:20:28] i'd say i'm 80% sure about that :) [19:20:35] :D okay [19:20:52] do you have a package i can test with? [19:21:19] no hurry, just asking if you already made it [19:24:59] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. {hawk} [8 pts] - https://phabricator.wikimedia.org/T116097#1746226 (Nuria) p:Normal>High [19:28:08] ottomata: ^^ [19:29:17] what hiera? [19:29:21] oh [19:29:23] burrow [19:29:27] uhhhh, i have one, but it doesn't ahve systemd yet [19:29:29] i'm working on it now [19:29:36] ottomata: cool! [19:29:38] also also [19:29:41] the one i have just installs stuff and binary [19:29:42] you want? [19:30:03] ottomata: nah i'll wait - i have to eat lunch anyway [19:30:12] i see that in eventlogging [19:30:29] the consumer groups themselves are defined in hiera [19:31:00] https://www.irccloud.com/pastebin/rBmfvEBC/ [19:31:14] should i just put eventlogging-00 in burrow.yaml? [19:31:41] i do not know how make hiera lookup from hiera [19:31:52] HMMM [19:32:03] well, thats just the default if there is none specified [19:32:07] yeah [19:32:17] so maybe we need to put that into hiera more explicitly as an object and look it upthat way [19:32:28] not sure. [19:32:35] might just need hardcoded [19:32:35] hm. [19:32:42] ottomata: ya but that'd mean making burrow look up each of them [19:32:47] yeah [19:32:50] ideally it shouldn't care about EL no [19:33:00] or webrequest or anything we add [19:33:01] ya would be nice if you could just lookup all consumer groups [19:33:03] hm [19:33:21] madhuvishy: this might be possible to DRY, but it might be kinda hard and messy [19:33:23] I'd be okay if it was like hiera('el-consumer-groups') [19:33:26] i'd put them into burrow.yaml for now [19:33:51] yeah [19:33:52] well [19:34:05] hm [19:34:11] we'd have to be able to look them up by name for eventlogging [19:34:14] so, either its unDRY [19:34:16] or [19:34:18] it'd be seomthing like [19:34:38] values(hiera('el_consumer_groups')) [19:34:40] and [19:35:07] el_consumer_groups: [19:35:07] eventlogging_processor: eventlogging-00 [19:35:07] eventlogging_mysql_consumer: eventlogging-mysql-00 [19:35:08] etc. [19:35:16] right [19:35:23] and will have to change el puppet too [19:35:27] yeah [19:35:30] if you can figure that out, i'm all for it, but it will be tricky [19:35:33] hmmmm [19:35:40] so, for now i'd say just put them in an array in burrow.yaml [19:35:45] i wanna start with that [19:35:46] manually list consumer groups you want to monitor [19:35:46] ya [19:35:48] maybe we can refactor later [19:35:54] okay [19:36:00] where is the el yaml file [19:36:48] ottomata: i can't find it on puppet [19:36:52] the values are just in eqiad.yaml [19:36:57] oh [19:37:03] but, the consumers names aren't specified there [19:37:11] since they aren't overridden from the default values [19:38:12] ottomata: so even though we say hiera('....), its not really doing anything atm? [19:39:57] right, i allows you to change the consumer group via hiera if you want to [19:40:00] but provides a default [19:40:12] that way you don't have to set that value in labs hiera or other environments every time [19:41:03] ottomata: yeah alright [19:41:41] (PS5) Mforns: Add oozie job to compute browser usage reports [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) [19:42:45] ottomata: does camus not belong to a consumer group? atleast it looks like we don't define one explicitly - should we monitor this? [19:42:54] sorry i have way too many questions [19:43:20] it does [19:43:51] https://github.com/wikimedia/operations-puppet/blob/production/modules/camus/templates/webrequest.erb#L76 [19:44:01] OH [19:44:04] madhuvishy: it does not. [19:44:05] i mean, it does. [19:44:09] but, camus does not commit offsets to kafka [19:44:15] ottomata: aah [19:44:17] it saves them in hdfs [19:44:26] so can't monitor using burrow [19:44:29] right [19:44:30] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [19:44:34] GAH [19:44:37] WHY!? [19:44:53] the false alarm? [19:45:20] mforns' theory is that graphite updates the two metrics independently with a small delay [19:45:32] and icinga picks it up during that gap and complains [19:48:10] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [19:54:26] yeah but there is an averaged window for this thing [19:54:39] ottomata: oh [19:57:36] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746317 (Legoktm) Do you mean piwik? https://github.com/sachdevs/pyWick seems unrelated to analytic... [19:58:42] madhuvishy: is there a reason for /etc/burrow/config/... [19:58:45] and not just /etc/burrow/ [19:58:45] ? [19:58:52] ottomata: no [19:59:13] no reason [19:59:15] i can change it [20:00:46] k [20:00:48] ja just /etc/burrow [20:00:49] cool [20:01:26] brb lunch [20:02:17] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1746338 (GWicke) Some notes from the meeting: ## Framing, for all events - **uri**: string; path or url. Example: /en.wikipedia... [20:16:14] back [20:19:59] ottomata: could you do the ansible deploy please? [20:20:08] ansible-playbook --check -i production -e target=aqs roles/restbase/deploy.yml [20:20:10] and then [20:20:14] ansible-playbook -i production -e target=aqs roles/restbase/deploy.yml [20:22:58] doing [20:24:28] --check went ok [20:24:33] doing for real [20:24:41] !log deploying aqs [20:24:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [20:28:26] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport piwick analytics - https://phabricator.wikimedia.org/T116312#1746422 (JMinor) [20:29:02] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746426 (Milimetric) [20:29:06] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746427 (JMinor) Yes, sorry http://piwik.org/ Corrected spelling... [20:29:23] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Support Piwick in production - https://phabricator.wikimedia.org/T116308#1746428 (JMinor) [20:29:40] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Support Piwik in production - https://phabricator.wikimedia.org/T116308#1746431 (BGerstle-WMF) [20:30:31] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, iOS-5-app-production: Support Piwik in production - https://phabricator.wikimedia.org/T116308#1746432 (JMinor) [20:31:06] did it get stuck ottomata? [20:31:12] I see the new code on 1001 but not 1002 [20:32:07] new table creation typically takes a little longer, but this is only one table [20:34:36] Ironholds: will CR your stuff today [20:35:27] milimetric: sorry started looking at something else [20:35:32] the table's created gwicke [20:35:36] yes it got stuck [20:35:45] 1001 failed check port [20:35:51] failed check port, hm... [20:35:52] so it stopped [20:36:02] TASK: [check port 7231] ******************************************************* [20:36:03] failed: [aqs1001.eqiad.wmnet] => {"elapsed": 180, "failed": true} [20:36:03] msg: Timeout when waiting for aqs1001.eqiad.wmnet:7231 [20:36:03] FATAL: all hosts have already failed -- aborting [20:36:06] should I try again? [20:36:21] I'd restart 1001 first [20:36:35] and make sure it's okay [20:36:40] its running [20:36:51] kk [20:37:06] try again? [20:37:16] i think maybe it just took too long to come up for ansible? [20:37:17] deploy will skip the restart if the code is up to date already; that's the reason why I normally restart explicitly before continuing [20:37:29] if all is well, just continue [20:37:36] it is a new process since I ran that [20:37:38] so i think it restarted [20:37:41] will continue [20:39:23] (code got updated on 1002 now) [20:39:35] (table's there no matter which cassandra I query) [20:39:51] select * from "local_group_default_T_pageviews_per_article_flat"."data"; [20:40:08] yeah, schema changes are always cluster-wide [20:40:26] which is why all the schema changes we support are backwards-compatible [20:40:42] right, just making sure one of the nodes wasn't dropped out of the cluster or something weird [20:43:28] ottomata: it got stuck again on 1002 I'm guessing? [20:45:14] yes [20:45:37] milimetric: ce Thu 2015-10-22 20:38:14 UTC; 7min ago [20:45:40] Active: active (running) since Thu 2015-10-22 20:38:14 UTC; 7min ago [20:45:45] i think the timeout isn't long enough mabye [20:45:46] maybe [20:45:51] for the ansible check [20:45:52] gwicke: ? [20:45:53] hey a-team, going to sign off for today, see you tomorrow! [20:45:59] doing it again [20:46:01] see ya marcel [20:46:05] ciao [20:46:05] thx ottomata [20:46:08] bye [20:46:11] bye mforns! [20:46:18] bye! [20:50:35] same deal milimetric, failed check on 1003 [20:50:37] but looks good now [20:51:21] thx ottomata, looks good to me [20:51:37] joal: if you're still around, the new table's deployed and the new code is there if you want to mess with it [20:51:41] thx all [20:51:50] the timeout is two minutes IIRC, so something seems to be slow about startup [20:53:26] hm [20:53:41] ¯\_(ツ)_/¯ [20:53:52] hehe [20:53:53] (madhuvishy i just used the alias!! :D) [20:54:05] * gwicke copies new ascii art [20:54:20] ottomata: what alias? [20:54:46] shrug [20:54:54] aaah [20:54:56] didn't you give me that one? [20:55:01] no [20:55:04] oh [20:55:07] thought it was you, noooO? [20:55:08] oh well [20:55:11] nooo [20:55:46] Analytics-Tech-community-metrics, Developer-Relations, DevRel-October-2015: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1746532 (saper) Wouldn't whitelisting of commits to only those who are listed in LDAP a... [20:55:53] ottomata: i added comments etc and pushed [20:55:58] now gotta test [21:03:29] cool, package getting close madhuvishy ! [21:03:29] :) [21:03:37] yay [21:07:10] I'm off for tonight, ping me if you need me [21:07:33] bye Dan! [21:20:54] madhuvishy: https://gerrit.wikimedia.org/r/#/c/248245/ [21:21:46] ottomata: cool! I don't know how to use this patch to install package [21:22:15] also, you can link this if you want - https://phabricator.wikimedia.org/T116084 [21:22:17] madhuvishy: aye [21:22:19] oh yeah [21:22:42] madhuvishy: where are you testing? [21:22:52] i will put .deb there for you [21:22:58] labs - in kafka-jessie01 [21:23:03] perfect [21:23:08] me too :) [21:23:12] ha ha [21:23:19] cool [21:23:21] i installed it there already [21:23:30] i've edit files in /etc/burrow/ [21:23:34] should work [21:23:39] with the same stuff you were testing i think [21:23:41] it is running [21:23:45] sudo service burrow status [21:24:07] feel free to stop and edit and cahnge whatever you need [21:24:11] an/or apply puppetization [21:24:25] ottomata: it says inactive now [21:24:30] oh i stopped it [21:24:31] :p [21:24:32] doh [21:24:34] try it [21:24:37] sudo service burrow start [21:24:42] okay :) [21:24:50] nice [21:24:51] thanks [21:25:08] ottomata: curl localhost:8000/v2/kafka/local/ [21:25:58] aye cool! [21:26:10] pretty sweet [21:26:13] :) [21:26:26] its too bad kafka folks didn't use etcd instead of zk [21:26:27] curl localhost:8000/v2/kafka/local/topic/test [21:26:33] cause then we'd have all that http interface by default [21:26:35] no consumer groups there so can't check those [21:26:40] yeah [21:26:45] aye [21:26:49] OH COOL [21:26:50] also have you decided where this is gonna run [21:26:54] it shows partition offsets of topic? [21:26:58] yess [21:27:00] nice [21:27:06] like the head of the topic partitions [21:27:07] cool [21:27:11] need to check that 8000 is free [21:27:17] wherever we wanna run it [21:27:18] oh, that should be parameterized too [21:27:22] in your puppet stuff ja? [21:27:24] i dint make it configurable [21:27:28] ok i can do that [21:27:28] you should! :) [21:27:30] default 8000 [21:27:31] but ja [21:27:51] cool, i'll make that change too [21:29:18] thanks ottomata :) [21:29:21] ok madhuvishy i'm out for the day! [21:29:22] laterrrs [21:29:24] byeee [23:12:13] (CR) Nuria: Functions for identifying search engines as referers. (8 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [23:42:25] (CR) Nuria: Add oozie job to compute browser usage reports (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) (owner: Mforns)