[06:14:27] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, Ops-Access-Requests: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2317094 (Joe) [06:31:34] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, and 2 others: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2317203 (Joe) Thanks @madhuvishy I'll amend the ticket title. [06:32:11] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, and 2 others: Add kartik to researchers group - https://phabricator.wikimedia.org/T135704#2317204 (Joe) [06:37:13] !log Set kafka retention.ms=172800000 for the topic webrequest_upload to free some disk space on kafka1022 [06:37:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [06:37:37] coffee and kafka disk space alert, what a great start of the week :P [06:42:27] !log Removed Kafka temp. override for webrequest_upload retention.ms after freeing some disk space. [06:42:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [06:44:53] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2317227 (Joe) [06:46:18] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2307853 (Joe) Speaking with @madhuvishy, she realized that we were wrong: @Amire80 has permissions to q... [06:54:09] Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, Operations, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2317273 (Joe) Open>Resolved [09:12:48] Analytics-Kanban, Datasets-General-or-Unknown, WMDE-Analytics-Engineering: Fix permissions on dumps.wm.o access logs synced to stats1002 - https://phabricator.wikimedia.org/T134776#2317524 (TerraCodes) [09:13:35] Analytics-Kanban, Datasets-General-or-Unknown, WMDE-Analytics-Engineering: Fix permissions on dumps.wm.o access logs synced to stats1002 - https://phabricator.wikimedia.org/T134776#2276672 (TerraCodes) Has this been resolved? [09:40:19] Analytics-Cluster, Analytics-Kanban: Configure Spark YARN Dynamic Resource Allocation - https://phabricator.wikimedia.org/T101343#2317576 (elukey) Relevant guide from CDH: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html#concept_zdf_rbw_ft ``` Starting with... [09:48:05] (PS1) Amire80: Add "hookaborted" error [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/290190 [10:07:51] Analytics-Kanban, Datasets-General-or-Unknown, WMDE-Analytics-Engineering: Fix permissions on dumps.wm.o access logs synced to stats1002 - https://phabricator.wikimedia.org/T134776#2317613 (hoo) Open>Resolved a:hoo Yes, the files can now indeed be accessed. [10:12:22] hi team :] [10:24:26] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2317644 (Framawiki) Hi, I ave a bug with the new search provider : https://fr.wikipedia.o... [10:26:05] (CR) Mforns: "LGTM, there's one question in the comments, but I would merge this as is. Let me know if we should change it." (1 comment) [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289913 (https://phabricator.wikimedia.org/T134506) (owner: Nuria) [10:47:24] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Configure Spark YARN Dynamic Resource Allocation - https://phabricator.wikimedia.org/T101343#2317662 (elukey) After a lot of changes I finally got a working version on cdh101.analytics.eqiad.wmflabs: ``` 2016-05-23 10:24:23,703 INFO org.apache.h... [10:59:05] * elukey lunch! [11:01:06] (CR) Nikerabbit: [C: 2] Add sorted errors [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/282228 (owner: Amire80) [11:09:10] (Merged) jenkins-bot: Add sorted errors [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/282228 (owner: Amire80) [11:18:27] (PS3) Amire80: Add a script for checking number of pages published despite failures [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/282312 (https://phabricator.wikimedia.org/T127283) [11:18:54] (PS2) Amire80: Add a simple script to run all the sorted errors in one go [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/284128 [11:20:53] (PS3) Amire80: Add a simple script to run all the sorted errors in one go [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/284128 [11:22:00] (PS2) Amire80: Add "hookaborted" error [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/290190 [12:05:22] elukey: aqs needs the user-agent and x-client-ip from the front-end rb [12:05:28] elukey: for that, we need to get out https://gerrit.wikimedia.org/r/#/c/289092/ [12:06:50] joal: fyi ^ [12:07:07] mobrovac: looks good [12:07:40] gwicke mentioned in the CR that it is dependent from https://github.com/wikimedia/hyperswitch/pull/41 but the link is broken [12:07:46] (depends on sorry) [12:07:54] yeah [12:08:04] it's a config-format change [12:08:10] that's already present in prod [12:08:14] let me say that [12:09:21] when do we need to release it? [12:09:36] like an hour ago? :D [12:09:43] or maybe later on? [12:10:08] haha [12:10:15] as always - the sooner the better [12:10:36] if you can +2 it, i can take care of restarting checking etc [12:10:44] would you mind to run the puppet compiler just to triple check? [12:10:48] I'll merge after that [12:10:55] coming right up! [12:11:08] super [12:15:14] elukey: pcc looking good [12:15:22] (posted on the patch) [12:19:24] Analytics, Pageviews-API, Wikidata: "egranary digital library system" UA should be listed as a spider - https://phabricator.wikimedia.org/T135164#2318127 (Addshore) >>! In T135164#2297815, @Nuria wrote: > .Please try to notify owner of UA policy. If they add the word "bot" to UA this would automatica... [12:19:33] Analytics, Pageviews-API, WMDE-Analytics-Engineering, Wikidata: "egranary digital library system" UA should be listed as a spider - https://phabricator.wikimedia.org/T135164#2318128 (Addshore) [12:24:08] mobrovac: thanks! Should I just merge and run pupper on restbase1008.eqiad.wmnet just to verify or is there another rollout strategy for these things [12:24:11] ? [12:24:51] elukey: you just need to merge, i can run puppet and restart rb [12:25:02] everywhere [12:25:36] * elukey feels used by Marko [12:25:41] :P [12:25:47] all right merging [12:26:15] hahaha [12:26:59] completely the opposite - i try to minimise the actions you need to take so that you don't waste time [12:27:00] :) [12:27:14] s/time/your time/ [12:28:29] mobrovac: merged! Thanks.. and let me know if I can help in some way [12:28:37] gr8 thnx! [12:54:01] elukey: all done, aqs should be receiving the headers now [12:54:05] elukey: thnx for the help! [12:54:13] thank you for the work! [12:54:24] there is also the scap change, I was forgetting it [12:54:29] maybe we can wait joal [12:54:35] yeah wanted to ping you about that [12:54:36] he had some questions for you afaik [12:54:49] it should be merged before the next aqs dpeloy [12:55:12] ah right [12:55:14] kk, no pb [12:55:22] I am a bit ignorant on scap but it looks good. joal is in charge of deployments and afaik he is going to deploy a newer version of AQS tomorrow/this_week [12:57:06] kk [13:04:41] Analytics-Cluster, Operations, ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2318255 (elukey) Adding a note from the SAL: ``` 10:00 reverted net.netfilter.nf_conntrack_tcp_timeout_time_wait on kafka1013 back to 65 (as set by default by puppet) ``` So now... [13:07:22] Analytics-Cluster, Operations, ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2303007 (MoritzMuehlenhoff) And also: Why was net.netfilter.nf_conntrack_tcp_timeout_time_wait set to the kernel default value of 120? The value of 65 should have been set on system startup... [13:57:49] a-team: anyone else wanna help moderate the internal list? [13:58:08] since Kevin left some messages got ignored when I didn't see them [13:58:56] * elukey hides [13:59:10] :S [13:59:14] sorry :D [13:59:19] :) only volunteers, it's ok if nobody wants it [13:59:29] maybe everybody could be in it? Just to have more people [14:00:00] yeah, but then it'll spam everyone [14:00:26] I'm going through this weird admin interface to see if there's a way to just always accept from @wikimedia.org [14:02:07] ahhh got it [14:32:07] (CR) Milimetric: "-1 a few small changes and some notes for later" (5 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [14:32:29] (CR) Milimetric: [C: -1] "oops forgot the -1" [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [14:54:58] Analytics, Wikipedia-Android-App-Backlog: Investigate recent decline in views and daily users - https://phabricator.wikimedia.org/T132965#2318537 (JAllemandou) Data has been backfilled from bug correction date (2016-05-03 for a full day). It looks correct in Vital signs. [14:57:59] !log suspended all the oozie bundles as prep step for https://gerrit.wikimedia.org/r/#/c/290252 (yes I know super paranoid mode on) [14:58:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:03:53] (PS2) Nuria: Adding favicon and correcting link to github depot [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289913 (https://phabricator.wikimedia.org/T134506) [15:04:22] (CR) Nuria: Adding favicon and correcting link to github depot (1 comment) [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289913 (https://phabricator.wikimedia.org/T134506) (owner: Nuria) [15:05:32] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289913 (https://phabricator.wikimedia.org/T134506) (owner: Nuria) [15:07:08] ottomata: on analytics1028 [15:07:09] 2016-05-23 15:06:08,196 INFO org.apache.spark.network.yarn.YarnShuffleService: Started YARN shuffle service for Spark on port 7337. Authentication is not enabled. Registered executor file is /var/lib/hadoop/data/b/yarn/local/registeredExecutors.ldb [15:08:16] 2016-05-23 15:06:08,625 INFO org.apache.hadoop.mapred.ShuffleHandler: httpshuffle listening on port 13562 [15:08:36] and I can see both java processes listening on netstat [15:08:48] so I guess it fine to proceed incrementally [15:09:57] (CR) Nikerabbit: Add a script for checking number of pages published despite failures (1 comment) [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/282312 (https://phabricator.wikimedia.org/T127283) (owner: Amire80) [15:10:04] mmmm this is weird [15:10:05] 2016-05-23 15:07:04,342 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error in populating headers : [15:10:27] org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hdfs/appcache/application_1462814849219_46859/output/attempt_1462814849219_46859_m_000127_0/file.out.index in any of the configured local directories [15:11:19] Analytics, Pageviews-API, WMDE-Analytics-Engineering, Wikidata: "egranary digital library system" UA should be listed as a spider - https://phabricator.wikimedia.org/T135164#2318618 (Nuria) The policy doesn't have an specific owner, if that is what you are asking. Here is is: https://meta.wikimed... [15:11:29] elu hm [15:11:30] elukey: hm [15:11:59] hmmmm, elukey maybe that is a running job that did get confused when dynamic allocation was aenabled? [15:12:21] ottomata: could be, there were some spark things running [15:12:27] hm, aye [15:12:39] maybe there's some file a dynamic job expects to have [15:12:49] that gets written on job startup, that wasn't present for a running job [15:12:51] just guessing! [15:12:53] hm [15:13:01] can try with another node [15:13:03] and see [15:13:21] yarn doesn't complain about missing libs, configs, etc.. [15:13:36] elukey: was that on startup, or just some time after? [15:13:37] aye [15:13:49] some time after the startup [15:15:53] (CR) Nikerabbit: [C: 2] Add "hookaborted" error [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/290190 (owner: Amire80) [15:17:27] hm, ok, [15:18:57] ottomata: on analytics1029 I can see "org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job token for job .." for example, that does seem legit [15:20:56] elukey: legit in what way? [15:21:40] ottomata: well I've just restarted the shuffle service together with Yarn [15:22:14] (CR) Amire80: [C: -1] "The crazy part is that it seems to work despite the weirdness." (1 comment) [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/282312 (https://phabricator.wikimedia.org/T127283) (owner: Amire80) [15:22:15] it should be ok to see "oh I can't find $temp_token for XYZ" [15:22:16] no? [15:26:43] oh yeah, ok yeah [15:26:44] i think so [15:26:49] i'm not sure what that is about actually [15:27:03] SecretManager, since we don't really use security features in hadoop [15:30:05] tried on analytics1030 and again org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job token for job job_1462814849219_46859 [15:30:13] the job seems to be oozie [15:30:56] INSERT OVERWRITE TABLE wmf.webrequ...hour=13(Stage-1) [15:31:26] is the job failling? or just the stage? [15:31:55] nono I am checking http://localhost:8088/cluster/app/application_1462814849219_46859 [15:32:01] hmm, elukey that is a hive job [15:32:05] hm! [15:32:18] yeah [15:32:23] I was puzzled too :D [15:32:38] oh, elukey did you suspend jobs? [15:32:45] yep yep [15:32:48] ah ok [15:33:15] ok elukey i think that looks fine, it is still running [15:33:22] so we now have two shufflers, one for map reduce and one for spark [15:33:32] if we restart them, it might make sense that the former complains a bit [15:34:03] aye [15:37:40] ottomata: for the "analytics learning series" - I am seeing a spark job using analytics1031.eqiad.wmnet as ApplicationMaster. What happens if I restart yarn? [15:37:44] job failed? [15:38:24] I mean, is it the host driving the spark computation across the cluster or is it guided by the master nodes? [15:38:59] hmm, elukey i'm not 100% sure what happens, but, i'm pretty sure hadoop should be resilient about that somehow [15:39:04] even if it means it has to retry the whole job [15:39:07] i don't think it will though [15:39:20] the resourcemanager is the one that assigned the app master [15:39:24] so it knows about it [15:39:42] ahh okok [15:39:48] so I got it correctly [15:39:52] the app master periodically reports back to the resourcemanager about its health, i'm not sure how much the resourcemanager knows about individual tasks that the app master is respsonsible for [15:39:52] \o/ [15:39:54] :) [15:40:13] so maybe I can proceed not touching the spark app masters [15:40:27] maybe! might be fun to see what happens though! :) [15:41:38] yeah let's test what hadoop does [15:44:27] ottomata: ah I almost forgot - I had to reduce uploads retention this morning, kafka1022 saturated one partition [15:45:04] we might think to wipe a bit of text to be on the safe side for the next days [15:47:29] also moritzm fixed the conntrack issue, it was a sysctl setting set in the wrong way (conntrack timeout) [15:47:48] oh ooof [15:47:50] thanks elukey [15:47:50] the weird thing is that conntrack max now is 524k for all the kafka analytics node [15:47:53] *nodes [15:48:03] oh all of them! [15:48:04] hm [15:48:06] huh. [15:48:11] after restaring 1022? [15:48:12] but from what I recall it was 256k.. not sure what heppened [15:48:16] no no before that [15:48:55] for some reason conntrack count is 524k now but I am pretty sure that last week was 256k because otherwise we wouldn't have got the alert [15:49:12] and I can't find any trace in SAL/puppet about this [15:49:43] yeah, hm, it used to hover around 120K+, right? before protocol chnage [15:52:51] yeah and now all the brokers are in range, kafka1013 was the weird one since contrack timeout was doubled [15:53:01] so it was keeping more connections tracked [15:53:14] must be something related with the reboot [15:53:59] ottomata: meanwhile, I restarted yarn on 1031 that was a spark app master, and now I can see a second attempt on analytics1050.eqiad.wmnet [15:54:39] (Merged) jenkins-bot: Add "hookaborted" error [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/290190 (owner: Amire80) [15:56:55] elukey: nice [15:57:22] elukey: can you tell if it had to restart even the already completed tasks? [15:58:02] oh but elukey the brokers are are now back to hovering at around 130K conntrack count, eh? [15:58:12] yeppa! [15:58:37] cool [15:58:43] I am only puzzled about where the 524k limit comes from [15:58:50] huh, i see [15:58:53] because it should be 256k theoretically [15:58:56] will keep investigating [15:59:07] elukey: i *think* we will be ok with disk space now, it should roll over tomorrow sometime [15:59:09] maybe after the ops meeting we can chat a bit about disk consumption? [15:59:13] aye k [15:59:52] a-team: standduppp [16:00:57] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Fix hive-metastore vs libmysql-jar race condition when provisioning new hive metastore server - https://phabricator.wikimedia.org/T133198#2318824 (Ottomata) p:Triage>Lowest [16:18:11] Analytics, Pageviews-API: 20160431 produces "end timestamp is invalid, must be a valid date in YYYYMMDD format" - https://phabricator.wikimedia.org/T135812#2311749 (Nuria) Pageview API is reporting an error as it should, this should likely be filed as a bug on the tool itself. [16:19:22] Analytics: 20160431 produces "end timestamp is invalid, must be a valid date in YYYYMMDD format" - https://phabricator.wikimedia.org/T135812#2311749 (Nuria) [16:20:07] Analytics: 20160431 produces "end timestamp is invalid, must be a valid date in YYYYMMDD format" - https://phabricator.wikimedia.org/T135812#2311749 (Milimetric) @Nemo_bis: the query asks for 20160401 through 20160431, and 20160431 is not a valid date, in case that wasn't obvious in what Danny_B was saying. [16:25:23] to all the analytics listeners: I am restarting Yarn on the whole hadoop cluster, if you encounter issues please reach out to me :) [16:32:14] Analytics, ArticlePlaceholder, Pageviews-API, Wikidata: Track pageviews of ArticlePlaceholders - https://phabricator.wikimedia.org/T132223#2191933 (Nuria) I am not sure this requires any works from analytics team. Seems like the data you need is already available on pageview API, correct? [16:34:26] Analytics-Kanban: Prototype Data Pipeline on Druid - https://phabricator.wikimedia.org/T130258#2319023 (Nuria) [16:35:06] Analytics-Kanban: Operational improvements and maintenance in EventLogging in Q4 {oryx} - https://phabricator.wikimedia.org/T130247#2319025 (Nuria) [16:40:28] Analytics, Pageviews-API: Improve pageviews error messages on invalid project - https://phabricator.wikimedia.org/T129899#2119381 (Nuria) Example request IT.wikimedia: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/IT.wikipedia/all-access/user/Antonio_De_Gennaro/daily/20160401/20160430 [16:44:32] Analytics-Kanban: lowercase project parameter - https://phabricator.wikimedia.org/T136016#2319072 (Milimetric) [16:47:30] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2319092 (mobrovac) [Gerrit 290264](https://gerrit.wikimedia.org/r/#/c/290264/) aims at enabling global rate limiting for RESTBase. Once that is out in prod, we will be able to enforce per-endpoint global rate l... [16:48:09] Analytics, Operations, Performance-Team, Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310174 (Milimetric) I have a quick suggestion to make this play nice with client-side sampling. First, the problem: some Event Logging instrumentation randomly only sends b... [17:01:56] ottomata: I've restarted almost all the analytics nodes, now I guess I need to do the master failover (only for Yarn though) [17:02:09] after that it would be great if you could test it :P [17:02:10] aye [17:02:54] milimetric: edit data meeting? or we skipping cause no joal? [17:04:04] yeah, figured he needs to be there [17:04:11] and marcel [17:04:43] thought you'd still be in ops meeting, miscalculated [17:05:55] ok [17:05:58] it is only 1h [17:05:59] s'ok [17:08:57] Analytics, Documentation: Document a proposal for bundling other than load-refine jobs together (see refine/diagram) - https://phabricator.wikimedia.org/T130734#2319218 (Danny_B) [17:16:05] ottomata: ready to test spark, everything restarted :) [17:18:43] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2319253 (Earwig) The copyvio text has been deleted so I can't really investigate this. [17:19:34] ottomata: so.. something I am not sure of [17:19:54] ottomata: how does cache of https://analytics.wikimedia.org/ get sbusted so our deployments are visible? [17:19:59] *gets busted [17:20:05] !log oozie bundles re-enabled [17:20:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [17:21:14] nuria_: no idea! :) [17:21:16] probably just expiry [17:21:17] dunno [17:21:35] (PS13) Nuria: Fix unique devices bugs. Update to knockout 3.4 [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [17:22:08] ottomata: which means that regardless of state of our git clone we will be seeing an outdated version .. let me ask in ops [17:22:54] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Configure Spark YARN Dynamic Resource Allocation - https://phabricator.wikimedia.org/T101343#2319255 (elukey) All the analytics nodes restarted, the new shuffler java process is currently running on all of them (listening on port 7337). The last... [17:24:27] (CR) Nuria: "Updated commit message and two small fixes." (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [17:25:03] milimetric: corrected commit message of dashiki [17:28:48] !log re-run from Hue webrequest-load-wf-(text|upload)-2016-5-23-13. The failures were likely caused by my global Yarn restart on the cluster. [17:28:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [17:34:14] (PS1) Nuria: Correcting piwik info [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/290272 (https://phabricator.wikimedia.org/T134506) [17:34:56] (CR) Nuria: [C: 2 V: 2] "Self merging correction of typo." [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/290272 (https://phabricator.wikimedia.org/T134506) (owner: Nuria) [17:53:06] ottomata: whenever you have 10 minutes, can we chat about kafka and spark? [17:57:45] oh ja, gimme 10 mins then ja [18:05:26] elu ja let [18:05:29] hiiii [18:05:32] so jaa kafka [18:05:53] so, what did you do to clear up space this mornign? [18:06:22] hhhhhheeellloooo [18:06:36] so I set upload's retention to 48hrs [18:06:43] copy/paste of your recipe [18:06:49] ahhh ok [18:06:56] perfect, so you just cleared out some more upload [18:06:57] nice :) [18:07:13] ottomata: https://grafana.wikimedia.org/dashboard/db/kafka?panelId=35&fullscreen [18:07:22] elukey: did you just do that? [18:07:26] nope! [18:07:26] a coupleof hours go? [18:07:36] oh, maybe the 7 retention mark was ust hit! [18:07:41] yesssss \o/ [18:07:46] 7 day* [18:07:46] :) [18:07:47] yeehaw [18:08:00] perfect, so we are good to go! [18:08:12] much better now :) [18:08:22] great, thanks for doing that this morning, we were getting close [18:08:30] how's spark lookin? [18:08:43] you say you aren't sure about that port...I think we have ferm open on hadoop for other reasons, checking... [18:08:53] hmm, maybe not [18:09:11] ha, well, pretty open [18:09:12] hehe [18:09:13] ferm::service{ 'hadoop-access': [18:09:13] proto => 'tcp', [18:09:13] port => '1024:65535', [18:09:13] srange => '$ANALYTICS_NETWORKS', [18:09:13] } [18:09:31] ah good I was about to check because didn't get time today :) [18:09:37] well it is good then! [18:09:41] yeh, may have mentioned this to you before [18:09:46] mortiz and i tried to lock it down [18:09:51] but appmasters listen on random ports [18:10:03] there is a setting to restrict those ports to a range [18:10:07] but there's some bug that keeps that setting from working [18:10:10] so we had to just open it up :/ [18:10:21] ahhhhh yes yes ok! [18:10:41] hm, but [18:10:42] we should edit [18:10:51] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Ports [18:11:12] elukey: i'll do that [18:11:15] i see some other things i want to add [18:11:20] * elukey didn't know about this page :D [18:11:26] port 7337 cool [18:12:11] yep [18:14:21] ottomata: all right I am going offline and then will recheck later on if oozie is fine, but there is nothing to handover [18:14:30] and kafka solved by itself [18:14:33] so super good [18:14:34] :) [18:15:43] oook, laters elukey! i will see if i can test the spark dynamic stuff later, but also maybe not til tomorrow [18:15:57] super thanks! [18:16:00] elukey: thanks for the ganglia patch [18:16:00] byyyeee a-team! [18:16:03] i added graphite metrics too [18:16:05] see you tomorrow [18:16:20] ori: thanks a lot for the merge! I was about to ping you but you were super quick :) [18:16:27] ottomata: i think git updates are not happening on analytics.wikimedia.org [18:16:51] nuria looking [18:16:53] ottomata: let me make sure i had ensure latest [18:16:58] it is done by puppet, so would take at least 30 mins [18:17:00] ori: ah we have also graphite metrics? [18:17:10] ottomata: i think that's it [18:17:26] nuria_: am running puppet [18:17:27] ottomata: ya but it was.. ahem.. hours ago [18:17:28] will check tomorrow then, great! [18:17:29] will see if it pulls [18:17:37] k [18:17:49] ottomata: we do not have ensure =>latest [18:17:50] nope no pull! [18:17:51] ohhhh [18:17:54] there you go [18:17:55] i thought it was there [18:17:57] do you want ti? [18:18:00] ottomata: right, let me send patch [18:18:00] or should we pull manually? [18:18:15] ottomata: it should be there, submitting patch [18:18:18] ok [18:20:07] ottomata: so I wrote this script for jenkins to calculate what the changelog between the most recent tag and the future tag is and add those to the changelog.md file and push it to git [18:21:58] hm, ok madhuvishy [18:22:13] sounds cool, but also maybe could be messy if we aren't good with commit messages? :p [18:22:14] https://www.irccloud.com/pastebin/wmi4BmP5/ [18:22:34] but sounds ok i guess [18:22:42] ottomata: hmmm yes but then we can't automate [18:22:46] we do often forget to update changelog [18:22:52] well, we can update changelog manually as we go [18:23:01] we should be updating it with commits, that matter, [18:23:10] but, i'm fine with automating it too [18:23:34] hmmm I think when the jenkins automation was discussed this was one of the things joal mentioned [18:24:11] anyway - it might be a good idea to keep this script in refinery source and just launch it from the jenkins workspace [18:24:45] (PS14) Nuria: Fix unique devices bugs. Update to knockout 3.4 [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [18:25:06] ottomata: let me know what you think, and i'll submit a patch. will probably make the branch to be pushed to a parameter [18:27:57] madhuvishy: i'm fine with it :) [18:28:04] if joal likes it then I think it scool [18:28:09] :) [18:28:28] ottomata: okay :) [18:30:47] milimetric: there is another bug for unique devices visualization, nothing related to ko or display so i will work on it on top of current patch which after your comments i think is reday to get merged [18:30:51] *ready [18:31:14] k, will merge [18:35:28] (CR) Milimetric: [C: 2 V: 2] Fix unique devices bugs. Update to knockout 3.4 [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [18:35:38] cool nuria_, looks good. [18:35:47] you know, I was thinking as I see "ko if: metric && metric() && (mergedData && mergedData() && mergedData().header.length)" which is now kind of a monster [18:36:03] milimetric: k, will fix the new thing i found on another patch [18:36:04] it'd be cool to have like a ready computed on objects that have this kind of situation [18:36:32] like this.ready = ko.computed(... check all the things ...) in the view model so the templates don't have to worry about it [18:36:37] I'll file a task [18:38:22] Analytics: dashiki: Simplify readiness checking by making a ready computed - https://phabricator.wikimedia.org/T136025#2320649 (Milimetric) [18:43:04] ottomata: wonder if we could add a word like [minor commit] borrowing from mediawiki's minor edit - and those would be ignored from changelog too [18:43:29] madhuvishy: suppose, but i wonder if it would be better to do it the other way [18:43:40] mark a commit as beiing for inclusion [18:43:44] explicitly say show up [18:43:46] sure [18:43:47] madhuvishy: what happens if changelog already has some entries? [18:43:54] for the version that is being released? [18:43:55] ottomata: as in? [18:43:57] ah [18:44:01] not a thing i considered [18:44:10] like what if we want to put something in changelog after commits have been merged? [18:44:23] ottomata: as in after the release [18:44:25] ? [18:44:29] madhuvishy: the reason i am hesitant about this idea, is that changelog does not usually map to commits [18:44:36] commit messages are often small and detailed [18:44:39] changelogs are broader [18:44:46] ottomata: hmmmm [18:44:48] a single changelog entry might be just about a new feature or somethign [18:44:52] which was composed of many commits [18:45:02] or maybe a backwards incompatible change that should really be noticed [18:45:08] it might get lost in a lot of detailed commit messages [18:45:32] i don't really care so much for refinery though :) [18:45:42] ottomata: hmmm [18:45:51] which is why i'm ok with it if joal really likes it, but i can see that it would be handy to at least be able to manually edit changelog [18:45:53] well [18:45:53] somehow [18:46:19] madhuvishy: not after the release, jsut after commits are merged [18:46:26] once release is out its a little too late [18:46:41] we could edit a changelog for a release in master for posterity [18:46:54] which is helpful, but it really should be done before the release [18:46:57] since the changelog is part of the release [18:47:01] ya i see what you mean - i guess though we should either automate it fully or do it manually [18:47:24] ja, i would lean towards just making us better at editing changelog as we code [18:47:35] if we push a feature, or a change to pageview def, add it to the changelog [18:47:39] the changelog entry could even be -SNAPSHOT [18:47:41] until the release [18:47:43] the way this will work - once we are ready to release a version - we go click a button on jenkins [18:47:51] you could automate the edit of the version in the changelog [18:47:55] it will commit a changelog for the version to be released [18:48:03] and do the release [18:48:23] aye [18:48:46] the changelog entry won't happen until this release build does [18:48:59] so i dont see how we could do something in between [18:49:21] as in, we can always first commit the changelog and then do the release - and not launch my script [18:49:42] madhuvishy: what about [18:49:57] what if the automation was similar to the way the pom is updated? [18:50:05] on release: [18:50:14] remove -SNAPSHOT from changelog version, commit [18:50:15] release [18:50:17] after release [18:50:30]