[08:25:33] joal: o/ [08:30:29] elukey@kafka1013:/var/log/kafka$ grep -i compact log-cleaner.log [08:30:30] [2016-05-24 13:12:05,573] INFO Compaction for partition [webrequest_upload,9] is resumed (kafka.log.LogCleaner) [08:30:44] [2016-05-24 13:12:05,790] INFO Compaction for partition [webrequest_text,2] is resumed (kafka.log.LogCleaner) [08:31:14] that is when we rebooted 4.4 to get the new kernel [08:31:36] and from that point onwards, the log size increased [08:31:53] because of the mtime=now change for all the logs [08:33:00] afaik we are not using message keys so theoretically the log compaction strategy is not needed [08:33:12] moreover it could easily explain what we are seeing [08:33:39] Hi elukey [08:33:49] no compaction happens, but the kafka thread sets mtime to now since in its opinion he changes something [08:33:52] batcave for this kafka thing? [08:33:52] helloooo [08:34:31] I am still grabbing some info/logs, just wanted to know from you if this could be a lead or if it is expected [08:34:47] but we can chat if you wish :) [08:34:49] elukey: I need some help to recall the actual issue ;) [08:34:53] ahhhhh [08:35:09] sure joining, give me 2 mins [08:35:19] elukey: There have been weekend, so my brain has been fully emptied :) [08:37:33] :D :D [08:37:40] joining! [09:04:38] elukey: forgot to tell you: I have run a few queries on misc partition for metric consolidation, and realised it couldn't word :) [09:11:23] Analytics-Tech-community-metrics: Add remaining KPIs to kpi_overview.html once available in korma - https://phabricator.wikimedia.org/T116572#2338244 (Aklapper) Open>stalled [09:12:20] joal: sorry I didn't get the last part.. What was related to? (Slow Monday for me :) [09:13:20] elukey: data warning on misc partition: I tried to rerun queries removing boundaries minutes [09:13:33] And this actually doesn't change a lot [09:16:28] ahhhhh [09:16:35] ok now it makes sense [09:17:12] I hoped that it would have worked [09:17:31] but does it reduce the "holes" detected ? [09:17:44] Actually, if I had thought a bit more about the thing, I could have find without testing that it can't :) [09:17:52] But, testing is a good apprach as well :) [09:25:47] so back to square 0 [09:25:57] anyhow, about kafka [09:25:58] last restart [09:25:59] [2016-05-24 12:51:31,272] INFO KafkaConfig values: [09:26:00] .. [09:26:01] .. [09:26:07] log.cleaner.enable = true [09:26:14] that afaik is the compactor [09:27:37] Configuration parameter log.cleaner.enable is now true by default. This means topics with a cleanup.policy=compact will now be compacted by default [09:28:00] and we don't set the cleanup policy.. [09:28:33] ok [09:29:01] Therefore expected but not wanted behavior [09:29:34] I think that we had a kafka::default puppet class with the "delete" cleanup policy set, but it got dropped for the new confluent module [09:29:38] that has defaults [09:31:19] buuuuut from kafka docs delete should be the default one [09:31:21] mmmmmmmmmmmmmmmmmmmmmmmmmm [09:35:34] cleanup.policy -> delete [09:35:58] weird [10:06:17] joal: https://gerrit.wikimedia.org/r/#/c/291697/ - will wait for mr Ottomata [10:06:48] one thing that I've noticed in labs is that the cleanup policy is set among the list of options set during startup [10:07:01] meanwhile in prod it is not [10:07:04] for some reason [10:07:22] I just put a default in the module, that is basically what we implicitly expect [10:48:58] elukey: Would you mind having a quick look at https://etherpad.wikimedia.org/p/analytics-aqs-cassandra [10:56:51] joal: looks good, the only thing that I was wondering is if we have also the percentile calculations [10:56:55] like ab does [10:59:01] (brb lunch, will be back in 30/45 mins) [11:18:59] (back) [11:58:09] Analytics-Kanban, DC-Ops, Operations, ops-eqiad: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2338489 (elukey) Open>Resolved [11:58:36] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Configure Spark YARN Dynamic Resource Allocation - https://phabricator.wikimedia.org/T101343#2338490 (elukey) Open>Resolved [12:27:46] elukey: Got some results for yaaaaaaaaah :) [12:28:05] mforns: o/ [12:28:09] hi joal [12:35:34] joal: ?? [12:35:45] aqs load test :) [12:35:51] elukey: batcave? [12:41:36] elukey: Hey, let nuria close tasks :-P [12:41:52] joal: is it ok in 30 mins? [12:41:58] elukey: sure :) [12:42:08] joal: one was Ops, I'll notify the other one to Nuria :) [12:42:29] * joal is happy to have managed to bug elukey for nothing [12:42:40] * joal whistle looking up in the air [12:47:19] ahhahaha [12:47:21] https://grafana-admin.wikimedia.org/dashboard/db/varnishkafka [12:49:05] elukey: up to 2000 reqs / sec on new-aqs :) [12:49:43] WOA [12:49:45] hahahahaha [12:49:52] are you sure about that?? :D [12:49:57] 2k/s??? [12:50:40] Every info I have lead to me thinking there is no bug in the analysis [13:03:06] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2338675 (elukey) Created a grafana dashboard from Varnishkafka metrics: https://grafana.wikimedia.org/dashboard/db/varnishkafka [14:05:15] joal: sorry I am still full into varishkafka, do you mind if we postpone? :( [14:05:32] np elukey, I'd guessed you'd habve other things :) [14:35:48] joal, qq: how can I compare a binary timestamp with a given date when querying a dataframe? :] [14:36:24] mforns: I think binary timestamp is a string with a dedicated (and reasonably self-explanatory) format [14:37:48] mmmm [14:44:20] mforns: Do you want us to pair for while before standup? [14:47:05] joal, I found it, the comparison needs to be like this: where log_timestamp > cast('2016-01-01' as timestamp) [14:49:03] k mforns [15:04:15] elukey: do you agree for read-load test while loading a new month of data on aqs? [15:04:49] sure sure! [15:05:00] sorry I am deep into vk's code today :( [15:05:02] okey elukey, going for that then !P [15:05:12] np, we'll discuss tomorrow [15:14:38] joal, mforns - I am going to the ops meeting today so I guess that you'll be alone during standup? [15:14:55] elukey: I think nuria works today [15:14:57] elukey, ok, maybe nuria comes? [15:28:09] not sure :) [16:00:50] a-team: standdupp? [16:05:10] nuria_: ops meeting, will send the e-scrum :) [16:14:47] Analytics, Pageviews-API: Provide a pageview API which pre-filters transient spikes from a few days or so - https://phabricator.wikimedia.org/T136049#2321458 (mforns) @Jdforrester-WMF Please, could you provide more context, because if you're talking about spikes that are due to bots, there is already a... [16:24:55] Analytics-Kanban: Browser report on analytics.wikimedia.org has broken icons - https://phabricator.wikimedia.org/T136217#2339043 (mforns) a:Nuria [16:25:33] Analytics-Kanban: Enable pivot ui so non analytics engineers can query druid pageview data (poc) - https://phabricator.wikimedia.org/T136331#2339048 (mforns) [16:29:05] Analytics-Kanban, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2339050 (mforns) [16:29:55] Analytics-Kanban, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333507 (mforns) @elukey Can you clarify what is the action to do in this task? Thanks! [16:36:47] Analytics, Pageviews-API: Add support for outreachwiki to pageviews API - https://phabricator.wikimedia.org/T132313#2194166 (mforns) Merging another task that asks basically for the same feature but for the mediawiki project. [16:36:49] Analytics: stats.grok.se doesn't offer mediawiki.org page view stats - https://phabricator.wikimedia.org/T111662#2339059 (mforns) [16:36:51] Analytics, Pageviews-API: Add support for outreachwiki to pageviews API - https://phabricator.wikimedia.org/T132313#2339061 (mforns) [16:38:10] Analytics: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2339063 (Nuria) [16:39:27] Analytics, Pageviews-API: Add support for outreachwiki to pageviews API - https://phabricator.wikimedia.org/T132313#2339067 (Nuria) [16:40:14] Analytics: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2339068 (Nuria) [16:40:16] Analytics: stats.grok.se doesn't offer mediawiki.org page view stats - https://phabricator.wikimedia.org/T111662#2339069 (Nuria) [16:40:42] Analytics, Pageviews-API: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2130851 (Nuria) [16:41:32] Analytics, MediaWiki-API, Reading-Infrastructure-Team: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414#2339071 (mforns) [16:43:18] Analytics, MediaWiki-API, Reading-Infrastructure-Team: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414#1520419 (mforns) @Tgr This can be easily done with [[ https://wikitech.wikimedia.org/wiki/Analytics/Reportupdater | reportupdater ]] an... [17:02:36] joal, you ok? [17:02:40] or it's me? [17:02:47] it's me for sure, reconnecting [17:09:34] I just found something horrible in the vk's code that I didn't see before [17:09:43] after 3 hours of debugging [17:11:14] elukey: like something that could cause our problem? [17:12:52] so I was investigating a possible cause of the difference in timings, and one thing that came up discussing with ema is that we now use the time in which varnish receives a request as dt, meanwhile before it was different (response completed afaik, still need to verify). [17:13:10] ahahahaha [17:13:13] that would do it [17:13:19] so I said, let's try to change the matchers! It will be as easy as changing a string [17:13:34] and then I went into a rabbit hole of horror [17:14:04] it has been like playing jenga over and over [17:14:12] removing the wrong brick at first try [17:14:56] nuria_: I closed two phab task today, one was ops and one was ours (the one for Dynamic executors) [17:15:02] they were in done under my name [17:15:18] elukey: k [17:15:35] rabbit hole of horror ... jajajaja [17:15:37] ay ayaya [17:16:08] I will win [17:16:20] vk always defeates me multiple time with no mercy before the final battle [17:16:28] *defeats [17:16:43] :D [17:16:45] * joal do believes in elukey [17:16:54] need to go a-team, see you ! [17:16:58] byeeee [17:17:02] joal: sorry again for today :( [17:17:05] joal, byyeeeeeee :] [17:21:09] nuria_: we also have now https://grafana.wikimedia.org/dashboard/db/varnishkafka [17:21:18] elukey: nice! [17:29:09] Analytics, WMDE-Analytics-Engineering, Wikidata: [Task] dashboard showing browser usage distribution for Wikidata - https://phabricator.wikimedia.org/T130102#2339204 (Nuria) FYI that when we have our pageview dataset working on druid you could look at this data in an easier fashion, now, as i said (o... [17:30:29] Analytics, WMDE-Analytics-Engineering, Wikidata: [Task] dashboard showing browser usage distribution for Wikidata - https://phabricator.wikimedia.org/T130102#2339227 (Nuria) See attached a rough preview of 1 week of wikidata requests per browser per country via Druid [17:45:58] a-team logging off! [17:46:03] Just sent the e-scrum [17:46:04] elukey, bye! [17:46:27] also, tomorrow I'll be afk for a couple of hours tops in the EU morning, Cc mforns/joal [17:46:59] ok elukey, see you tomorrow! [21:09:53] Quarry: Add date when query was last run - https://phabricator.wikimedia.org/T77941#832144 (agray) This would be very useful for reports which use Quarry data (allowing us to timestamp the source for the end-user). The page currently reports the ID of the last run (in the source as `"qrun_id": 12345`) which... [21:42:59] (PS1) Nuria: Adding missing icon file [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/291833 (https://phabricator.wikimedia.org/T136217) [21:51:20] Analytics-Kanban, Patch-For-Review: Create repo analytics.wikimedia.org with index and build of browser reports for puppet to source and deploy to analytics.wikimedia.org - https://phabricator.wikimedia.org/T134506#2339910 (Nuria) Open>Resolved [21:51:54] Analytics-Kanban: Ease restarting and backfilling of jobs in cluster {hawk} - https://phabricator.wikimedia.org/T115985#1737628 (Nuria) @JAllemandou: can you add a bit of detail here as to what was done? [21:52:33] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: More solid Eventlogging alarms for raw/validated {oryx} [8 pts] - https://phabricator.wikimedia.org/T116035#2339913 (Nuria) Open>Resolved [21:52:55] Analytics, Analytics-Cluster, Patch-For-Review: Single Kafka partition replica periodically lags - https://phabricator.wikimedia.org/T121407#2339918 (Nuria) [21:52:58] Analytics, Analytics-Cluster, Operations, Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2339917 (Nuria) [21:53:02] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2339916 (Nuria) Open>Resolved [21:53:40] Analytics-Kanban: Get jenkins to automate releases {hawk} - https://phabricator.wikimedia.org/T130122#2339920 (Nuria) [21:53:42] Analytics-Kanban: Figure out the exact strategy for release {hawk} - https://phabricator.wikimedia.org/T132180#2339919 (Nuria) Open>Resolved [21:54:07] Analytics-Kanban, Patch-For-Review: Create debian packages for druid - https://phabricator.wikimedia.org/T134503#2339922 (Nuria) Open>Resolved [21:54:09] Analytics-Kanban: Prototype Data Pipeline on Druid - https://phabricator.wikimedia.org/T130258#2339923 (Nuria) [21:55:32] Analytics-Kanban: Backfill Android Apps pageviews from May 2nd hour 21 - https://phabricator.wikimedia.org/T135299#2339929 (Nuria) Open>Resolved [21:55:52] Analytics-Kanban: Have archiva server credentials available via the Config File Builder in global maven settings.xml - https://phabricator.wikimedia.org/T132178#2339930 (Nuria) Open>Resolved [21:55:54] Analytics-Kanban: Get jenkins to automate releases {hawk} - https://phabricator.wikimedia.org/T130122#2339931 (Nuria) [21:56:08] Analytics-Kanban: Generate test data for Cassandra - https://phabricator.wikimedia.org/T134899#2339932 (Nuria) Open>Resolved