[09:01:46] Analytics-Tech-community-metrics, ECT-August-2015: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1566554 (Aklapper) >>! In T97118#1399396, @Qgil wrote: > bring the generic graphs from gerrit_review_queue.html. scr.html could contain all the general graph... [09:02:04] Analytics-Tech-community-metrics, ECT-August-2015, Patch-For-Review: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1566555 (Aklapper) [09:58:00] Analytics-Backlog: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1566760 (Springle) @mforns, yes, feel free. Eventlogging is alone on its master, so in theory you can hammer it quite hard. However heavy updates can still cause replication lag, so, as ever, it sho... [09:59:17] Analytics-Backlog: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1566762 (Springle) A white list is fine. No real difference from DBA perspective, so +1. [10:01:08] Analytics-Backlog: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1566763 (Springle) Regarding implementing this on replicas: nothing special is needed. Change on the master will propagate (but rate-limit everything, to avoid replag). [12:09:18] (CR) Joal: [C: 2 V: 2] "Looks good to me :)" [analytics/refinery] - https://gerrit.wikimedia.org/r/233044 (https://phabricator.wikimedia.org/T109860) (owner: Ottomata) [12:14:41] Analytics-Tech-community-metrics: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1567056 (Aklapper) [12:14:43] Analytics-Tech-community-metrics: Add "Ticket Openers" to Korma's "Activity by contributors" - https://phabricator.wikimedia.org/T105634#1567055 (Aklapper) [12:15:10] Analytics-Tech-community-metrics: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1457306 (Aklapper) General question: Should we keep Bugzilla statistics in korma for historical purpose? If yes, we should rename "Tickets" to "Bugzilla (d... [12:15:20] Analytics-Tech-community-metrics, ECT-September-2015: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1567061 (Aklapper) [12:18:12] Analytics-Tech-community-metrics, ECT-September-2015: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing - https://phabricator.wikimedia.org/T106037#1567069 (Aklapper) Note to myself: This also affects the korma frontpage listing both "Ticket Participants" and "Maniphest participa... [12:21:50] Analytics-Tech-community-metrics, ECT-September-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1567070 (Aklapper) I'd like to give this a shot myself in September if time allows. >>! In T100978#1513777, @Qgil wrote: > we need a good place to... [12:32:23] Analytics-Tech-community-metrics, ECT-August-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1567083 (Aklapper) [12:46:24] Analytics-Tech-community-metrics, ECT-August-2015: Remove deprecated repositories from korma.wmflabs.org code review metrics - https://phabricator.wikimedia.org/T101777#1567112 (Aklapper) @Dicortazar: What's the status when it comes to Octopu? Is this already in place, or will this happen in the next two... [12:46:45] Analytics-Tech-community-metrics, ECT-August-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1567113 (Aklapper) @Dicortazar: Any news here when it comes to fixing the underlying problem? [12:53:40] Analytics-Tech-community-metrics, ECT-August-2015: Remove deprecated repositories from korma.wmflabs.org code review metrics - https://phabricator.wikimedia.org/T101777#1567143 (Aklapper) Copying @Dicortazar's comment from T104845#1552564: > Some updates: there's a new tool in Metrics Grimoire named as 'r... [12:54:54] Analytics-Tech-community-metrics, ECT-August-2015: Remove deprecated repositories from korma.wmflabs.org code review metrics - https://phabricator.wikimedia.org/T101777#1567147 (Aklapper) [12:55:10] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1567148 (Aklapper) [13:21:27] ottomata: Hi ! [13:21:44] Kafka expension time ? [13:27:21] hiya! [13:27:26] i'm going to upgrade camus first and make sure that works. [13:27:31] doing that now [13:27:45] yeah, makes sense :) [13:28:35] have you deployed already ? [13:28:43] yes [13:28:45] doing first camus run now [13:28:50] k [13:29:06] btw, wanted to talk to you about my load job change, with the hourly stats [13:29:32] you are right about the partitions, buuuut, that would mean one record per partition, right? [13:34:40] ottomata: Yeah I have seen your comment [13:34:49] And I am still thinking about that one [13:35:14] ottomata: wanna think aloud in cave ? [13:35:28] k [14:01:09] Analytics-Backlog: Investigate sample cube pageview_count vs unsampled log pageview count - https://phabricator.wikimedia.org/T108925#1567339 (JAllemandou) Also, it seems the difference on mediawiki.org and wikimediadoundation.org doesn't account for the total of missing data: it sums up to ~12M for the 2015-... [14:33:01] Analytics-Backlog: Investigate sample cube pageview_count vs unsampled log pageview count - https://phabricator.wikimedia.org/T108925#1567504 (Tbayer) >>! In T108925#1567339, @JAllemandou wrote: > Also, it seems the difference on mediawiki.org and wikimediadoundation.org doesn't account for the total of missi... [14:34:03] Analytics-Backlog: Investigate sample cube pageview_count vs unsampled log pageview count - https://phabricator.wikimedia.org/T108925#1567507 (Ironholds) Have you considered looking at the codebase used to retrieve this, identifying discrepancies, and, in the absence of discrepancies, just rerunning a sample... [14:52:03] (PS3) Ottomata: [WIP] Generate hourly aggregate statistics about webrequest sequence stats [analytics/refinery] - https://gerrit.wikimedia.org/r/232644 [14:53:38] Analytics-Kanban: Test for Phab workboard script - https://phabricator.wikimedia.org/T110044#1567628 (mforns) NEW [14:53:49] (PS4) Ottomata: [WIP] Generate hourly aggregate statistics about webrequest sequence stats [analytics/refinery] - https://gerrit.wikimedia.org/r/232644 [15:31:12] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Upgrade Camus - https://phabricator.wikimedia.org/T109860#1567842 (ggellerman) [16:14:33] Analytics-Backlog, Reading-Admin, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1567936 (JAllemandou) [16:19:16] Analytics-Backlog, RESTBase: Create a metric for overall RESTBase request rates from Varnish logs [8 pts] - https://phabricator.wikimedia.org/T109547#1567959 (mforns) [16:20:07] Analytics-Backlog, RESTBase: Create a metric for overall RESTBase request rates from Varnish logs {hawk} [13 pts] - https://phabricator.wikimedia.org/T109547#1551633 (mforns) [16:37:05] Analytics, operations: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#1568057 (Southparkfan) [16:41:55] Analytics-Backlog: Add a 'Guard' job for pageviews {hawk} [13 pts] - https://phabricator.wikimedia.org/T109739#1568098 (mforns) [16:43:05] Analytics-Backlog: Create white list for pageview data {hawk} [8 pts] - https://phabricator.wikimedia.org/T110061#1568100 (JAllemandou) NEW [16:45:01] Analytics-EventLogging, Analytics-Kanban: Update Schema Talk pages {tick} [8 pts] - https://phabricator.wikimedia.org/T103133#1568119 (madhuvishy) [16:45:54] Analytics-Kanban, Reading-Admin, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1568123 (JAllemandou) [16:46:05] Analytics-Kanban, RESTBase: Create a metric for overall RESTBase request rates from Varnish logs {hawk} [13 pts] - https://phabricator.wikimedia.org/T109547#1568124 (JAllemandou) [16:46:09] ok joal, i just reran the full load bundle for a single hour, and all is well. [16:46:20] select * from wmf_raw.webrequest_sequence_stats_hourly where year=2015; [16:46:37] i think we can merge, and restart the load bundle [16:47:27] ottomata: I merge that now [16:47:49] (CR) Joal: [C: 2 V: 2] "LGTM !" [analytics/refinery] - https://gerrit.wikimedia.org/r/232644 (owner: Ottomata) [16:48:48] ottomata: talking with mforns and madhuvishy now, can let you deploy ? [16:50:11] yup [16:52:12] pshsh, whoops it still had WIP tag on it :/ [16:53:01] aouch .... Sorry ottomata [16:53:14] not reviewd your commit message thouroughly [16:53:42] Analytics-Kanban, RESTBase: Create a metric for overall RESTBase request rates from Varnish logs {hawk} [13 pts] - https://phabricator.wikimedia.org/T109547#1568136 (madhuvishy) Couple of questions: - What is the time granularity of the metrics needed - hourly/daily/monthly - etc - We should check where... [16:54:47] sok [17:09:23] joal: cool, i have deployed and started a new load bundle. [17:09:30] awesome [17:10:41] Analytics-Tech-community-metrics, ECT-August-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1568193 (Aklapper) >>! In T100189#1320120, @Aklapper wrote: > There are plans to exclude obvious Freemail provide... [17:27:54] Analytics-Kanban: Write scripts to track cycle time of tasked tickets and velocity [8 pts] - https://phabricator.wikimedia.org/T108209#1568264 (mforns) Another idea is to apply the same 4-step logic to lead time: - Given a start date and an end date, - And using the 4-step model described above: - Predi... [18:38:43] Analytics-Cluster, Analytics-Kanban: Create new Hive / Oozie server from old analytics Dell - https://phabricator.wikimedia.org/T110090#1568618 (Ottomata) NEW a:Ottomata [18:38:52] Analytics-Cluster, Analytics-Kanban: Create new Hive / Oozie server from old analytics Dell - https://phabricator.wikimedia.org/T110090#1568618 (Ottomata) [18:38:54] Analytics, operations: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1568625 (Ottomata) [18:42:29] Analytics-Cluster, Analytics-Kanban: Create new Hive / Oozie server from old analytics Dell - https://phabricator.wikimedia.org/T110090#1568649 (Ottomata) I have excluded analytics1015 from active Hadoop nodes. Waiting for data to be moved elsewhere. [18:52:03] Analytics-Kanban, RESTBase: Create a metric for overall RESTBase request rates from Varnish logs {hawk} [13 pts] - https://phabricator.wikimedia.org/T109547#1568723 (GWicke) >>! In T109547#1568136, @madhuvishy wrote: > Couple of questions: > > - What is the time granularity of the metrics needed - hourly... [18:53:03] Analytics-Cluster, Analytics-Kanban: Add hourly aggregate sequence stats creation to webrequest load job - https://phabricator.wikimedia.org/T109136#1568729 (Ottomata) The load job has been augmented. I am backfilling for the month of August with this query (using dynamic partition values): ``` set hiv... [19:07:49] ahh crap joal, i didn't set the mapreduce queue properly when i resubmitted load. :/ [19:07:54] gotta kill and resubmit! [19:14:13] ottomata: no big deal, just a PITA ... [19:15:03] ja [19:16:43] ottomata: I have found the real reason for which hdfs2cass is not suitable for us ! [19:16:58] I found that before, forgot about it, and found it again today :) [19:17:22] ha, oh? [19:18:48] Composed row keys (first part of the Primary key) is not managed [19:19:07] I want to ensure that with a proper test, but the code seems not to allow it [19:20:11] ottomata: --^ [19:20:37] ok, joal, i don't know much about cassandra, but if you say it won't do that's cool, we can use your thing [19:20:51] want to make that sure :) [20:15:35] ottomata: https://github.com/Parsely/pykafka/blob/master/pykafka/producer.py They made an async producer [20:15:42] it was merged 3 days back [20:24:01] Analytics-Tech-community-metrics, Research consulting, Research-and-Data: Data for audit report - https://phabricator.wikimedia.org/T110067#1569050 (Qgil) About software developers: https://www.openhub.net/orgs/wikimedia is not showing lines of code anymore, and I don't know how to get this informat... [20:24:49] hmm, madhuvishy interesting. help my memory! what were we (cough, I) doing with python-kafak? [20:24:50] was that sync? [20:24:56] one message at at ime? [20:24:56] time? [20:25:11] ottomata: No, it was either async or keyed [20:25:27] aye, so it was always async then, ja? [20:25:29] async by default, 1 message at a time though [20:25:33] yup [20:25:34] hm. [20:26:12] if we bump pykafka version, we'd have the choice to pick between sync and async [20:26:30] interesting. well, hm. i guess, since this is so new, let's see what we can do with what we have? [20:26:40] and maybe think about upgrading as not part of stag? [20:26:44] ottomata: yeah alright [20:27:03] anyway, i'm not sure if they're caching the topic and producer [20:27:07] we'll at least try what we have on analytics1010 as our test-prod setup, and see what we can do [20:27:33] it seems like you can send messages to the producer batched though [20:28:24] with current version? [20:28:25] ottomata: https://gerrit.wikimedia.org/r/#/c/232408/1/server/eventlogging/handlers.py I'm talking about this comment [20:28:56] hmmm the code's changed on master now, will look at old version [20:30:29] ottomata: yes, produce takes an iterable of message, key tuples [20:32:39] yeah, madhuvishy, it looks like topic.get_producer does not cache anything, so each time it would construct a new Producer object? [20:32:39] hm. [20:33:08] yup, looks like it. I can ask on their mailing list [20:34:05] ottomata: If we used it like, create one object - send it a stream of messages - it would be only once. Our current design is different though [20:35:00] ja since we possibly will have many topics in the same stream [20:35:01] hm [20:35:07] ja you should ask them about that [20:35:20] madhuvishy: solution might be to cache it ourselves then :/ [20:35:34] producers = {} [20:36:03] keep a map of producers per topic [20:36:08] and look up before making one [20:36:48] i think per topic should be good enough [20:36:58] but maybe need kwargs in key hash somehow? [20:36:59] dunno [20:37:25] hmmm, maybe, yeah, especially since some eventlogging processes can use multiple output URIs [20:37:33] which means they could specify different kafka producer params [20:37:41] ottomata: Yupp [20:38:41] that's true [20:39:03] madhuvishy: maybe? http://stackoverflow.com/a/10220908/555565 [20:39:41] dunno [20:40:26] hmmm [20:40:43] ottomata: https://groups.google.com/forum/#!topic/pykafka-user/aO8OivKpm2s I was reading this [20:41:05] sync + making instances everytime could be really costly yes [20:41:47] hm! intersting. [20:43:38] hm, madhuvishy even with pykafka's new async version, i think this would be a problem. [20:43:47] ottomata: yes [20:43:48] the problem is that their producers are tied directly to a topic [20:44:02] whereas python-kafka's are not, you give the topic to the produce message [20:44:04] method* [20:44:32] ottomata: yeah... pykafka's way allows batching the messages - but that's not useful to us [20:48:15] hm, madhuvishy, ja i guess kafka-python here is a litlte more robust on the producer side, eh? [20:48:16] hm. [20:48:30] ottomata: it does look like it [20:48:53] ottomata: i can try implementing a cache like thing [20:50:01] madhuvishy: i dunno, it sounds like it isn't worth it. the reasons for changing to pykafka proudcer were that hopefully it would be better, including consistency in libs, but if kafka-python is better, i don't think we shoudl change, especially since we already have it written. [20:50:32] ottomata: last reason for which I won't try anymore with hdfs2cass --> no way to provide cassandra with username and password ! [20:50:38] ottomata: yeah makes sense. We can move if at some point pykafka's implementation suits us better [20:50:44] haha [20:50:49] Done for tonight, and will continue with my hand-written stuff :) [20:50:52] aye, indeed madhuvishy, thanks for looking into that [20:50:55] ok1 [20:51:01] joal, also thanks for looking into that! [20:51:05] :) [20:51:08] no problem, was interesting [20:51:24] ottomata: Might reuse their codebase and my config ;) [20:51:48] Later ... when time will be so freely availa [20:52:01] Have a good night a-team ! [20:52:05] See you tomorrow [20:52:09] good night joal! [20:54:20] Good night! [20:55:08] nighters! [21:01:13] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Move Eventlogging Kafka writer to use pykafka's Producer instead of python-kafka {stag} [8 pts] - https://phabricator.wikimedia.org/T109244#1569134 (madhuvishy) Looked into this a bit more with @Ottomata, and it looks like atm, python-kafka's... [21:21:08] perhaps a silly question...but we are intersting in integrating information from the upcoming page views api into the search suggestions cirrussearch gives. Is there anything we can use prior to the prod service being up and running for initial testing/integration work/etc ? [21:24:05] ebernhardson: it's still in the works, but we are gonna try and set up a labs instance for our testing purposes soon [21:24:48] i'll let you know when we have something working [21:26:21] madhuvishy: excellent, thanks [21:26:31] ebernhardson: no problem :) [22:14:00] good night a-team! see ya tomorrow :] [22:16:41] good night mforns!