[00:16:24] 06Analytics-Kanban, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570#2821426 (10Krinkle) >>! In T151570#2836702, @Krenair wrote: > Stuff to do: > * Labs replicas - @chasemp or @andrew need to run som... [00:17:45] 06Analytics-Kanban, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570#2836850 (10Krenair) yeah, do that with maintain-meta_p [00:31:52] 10Analytics, 10EventBus, 10Wikimedia-Stream, 06Services (watching): RecentChanges in Kafka - https://phabricator.wikimedia.org/T152030#2836021 (10Krinkle) >>! In T152030#2836466, @Krenair wrote: > It'll probably involve creating a class that implements the RCFeedEngine interface from MW core, registering t... [00:46:11] 10Analytics, 06Discovery, 06Discovery-Analysis, 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2836951 (10mpopov) >>! In T151832#2831300, @Nuria wrote: > How about we split the work here in two chunks: Sounds good! > 1. let's get the d... [00:55:01] (03PS1) 10MaxSem: Don't record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 [00:59:27] (03PS2) 10MaxSem: Don't record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) [01:00:38] (03CR) 10jenkins-bot: [V: 04-1] Don't record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [01:00:50] <3 jerkins [01:15:12] 06Analytics-Kanban, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570#2836995 (10hoo) >>! In T151570#2836702, @Krenair wrote: > * Wikidata support - can you check this @hoo? I got a lot of 'done.' res... [01:46:10] (03PS7) 10MaxSem: reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) [01:58:53] (03PS8) 10MaxSem: reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) [01:59:13] fffu... [02:00:28] (03PS9) 10MaxSem: reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) [04:14:03] 10Analytics, 10Dumps-Generation, 05Security: Omit private data from being generated during dump runs - https://phabricator.wikimedia.org/T152021#2835669 (10Bawolff) Arent these important as a sort of backup of last resort? [04:52:52] 06Analytics-Kanban, 06Discovery, 06Operations, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2837271 (10Ottomata) YESSHHHHH I think I did it. @mpopov try: ``` CXX=g++-4.8 CXX1X=g++-4.8... [05:31:01] 10Analytics, 06Discovery, 06Discovery-Analysis, 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2837286 (10Nuria) How about you let me know when you have some data and I work into loading it into druid? [07:51:25] (03PS1) 10Elukey: Add fi.wikivoyage to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/324671 [08:02:24] !log added fi.wikivoyage to the pageview whitelist manually [08:02:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:54] and pivot deployed on thorium [08:03:00] now let's see if we can make it work [08:24:50] puppet works except the bit for stats.w.o [08:24:54] that needs to be rsynced [08:46:44] 06Analytics-Kanban, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570#2837609 (10Nikerabbit) > * cxserver - I think this is blocked on imports or something? @Nikerabbit? cxserver only supports Wikipe... [08:55:06] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:55:09] \o/ [09:36:31] (03PS3) 10MarcoAurelio: Configuration for fi.wikivoyage.org [analytics/refinery] - 10https://gerrit.wikimedia.org/r/323699 (https://phabricator.wikimedia.org/T151570) [09:40:23] (03CR) 10MarcoAurelio: "Re-ping. Please V+2 and submit. Wiki is now live. Regards." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/323699 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [09:42:08] (03CR) 10Elukey: [C: 04-1] "not sure if I am correct but the insertion date shouldn't be 2015 right?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/323699 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [11:13:46] joal: after reading T144431 I finally got our partition key [11:13:46] T144431: RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431 [11:13:53] and what PARTITION means [11:13:54] finally [11:14:52] I haven't found a simpler description of the cassandra data model yet [11:14:59] (I mean, something for newbies like me) [11:16:46] and now I get that using ("_domain", project, article, granularity) as partition key we have only a limited and slowly growing set of new "records" becuase of timestamp and daily loading [11:16:57] that is what you were referring to yesterday [11:33:37] ok after this enlightment, lunch :) [12:09:22] 10Analytics, 10Dumps-Generation, 13Patch-For-Review, 05Security: Omit private data from being generated during dump runs - https://phabricator.wikimedia.org/T152021#2837936 (10ArielGlenn) >>! In T152021#2837221, @Bawolff wrote: > Arent these important as a sort of backup of last resort? I don't think so.... [12:10:04] 10Analytics, 10Dumps-Generation, 13Patch-For-Review, 05Security: Omit private data from being generated during dump runs - https://phabricator.wikimedia.org/T152021#2837938 (10ArielGlenn) The gerrit changeset could be merged at any time; it's been tested, and won't actually change anything without a corres... [12:40:51] hi team :] [12:46:03] Hi mforns [12:46:14] hello! [13:31:17] mforns: review or not review ? [13:31:25] that is the question joal [13:31:42] batcave? [13:31:55] sure mforns [13:31:58] OMW [13:43:10] 10Analytics, 10Dumps-Generation, 13Patch-For-Review, 05Security: Omit private data from being generated during dump runs - https://phabricator.wikimedia.org/T152021#2838148 (10jcrespo) > I don't think so. We have db backups for that. @jcrespo can say more about those. I trust only the dumps as a last reso... [13:57:29] 06Analytics-Kanban, 13Patch-For-Review: Replace stat1001 - https://phabricator.wikimedia.org/T149438#2838179 (10elukey) thorium has been reimaged by Rob, re-deployed pivot (this time it works) and rsynced stats.w.o from stat1001. [14:01:16] omw!~ [14:03:59] ehm ellery is saturating stat1002 home space [14:04:04] 408G /home/ellery [14:06:38] mforns,milimetric - any chance that could could free some space on your home dirs on stat1002? [14:06:48] elukey, sure! [14:06:49] just to get some space [14:06:52] thanks :) [14:06:58] * elukey wants quotas [14:08:02] elukey, freed 26GB [14:09:17] nice :) [14:41:05] 10Analytics, 10Dumps-Generation: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2838259 (10ArielGlenn) So, where can we go from here? Should these be filtered out or left in? [14:45:02] ottomata: o/ [14:45:11] I'd need to restart kafka on kafka100[123] [14:45:17] for openjdk upgrades [14:45:31] but on kafka1001 there is a java process under your user [14:45:50] oh! [14:45:51] looking [14:45:57] :) [14:46:38] uhhh, i don't have any screens open, looks like some consumer process i was runnign when checking stuff while adding brokers [14:46:44] why are thye still running?? [14:46:45] weird [14:46:46] killing [14:46:51] thanks! [14:47:08] anything against me restarting kafka one by one (with replica election each time) ? [14:47:17] nope proceed! [14:47:28] elukey: fyi [14:47:41] i have not restarted brokers with that min.insync.replicas thing [14:47:52] i had meant to, but then was waiting cause dzahn said he wanted to restart the next day [14:47:57] but then he didn't have to after all [14:48:07] so, should be no effect, since producers have acks=1 [14:48:10] but, just fyi [14:48:17] sure [14:48:44] one thing that I noticed is that some partitions have only 2 ISRs [14:48:52] !!! [14:48:52] rather than 3 [14:48:54] i missed some! [14:48:59] in eqiad? [14:49:13] OH [14:49:16] tep [14:49:18] *yep [14:49:19] yes, i kinda remember this [14:49:29] ok, those topics i had planned deleting [14:49:35] since they are deprecated [14:49:40] so i didn't bother adding replicas to them [14:49:53] i probably should have, like I did in codfw. man, adding replicas is such a painful manual process! [14:50:02] i did it better the second time around in codfw [14:51:31] elukey: will follow up with this: https://phabricator.wikimedia.org/T149594 [14:51:52] sure! just wanted to verify that it was ok to proceed [14:52:03] yup, totally cool [14:56:17] ottomata: something weird got logged [14:56:18] [2016-12-01 14:53:17,990] ERROR Processor got uncaught exception. (kafka.network.Processor) [14:56:21] java.lang.ArrayIndexOutOfBoundsException: 18 [15:00:35] 10Analytics, 10EventBus: Delete stale topics from main Kafka clusters - https://phabricator.wikimedia.org/T149594#2838319 (10Ottomata) Topics to delete: ``` change-prop.retry.mediawiki.page_delete change-prop.retry.mediawiki.page_move change-prop.retry.mediawiki.page_restore change-prop.retry.mediawiki.revisi... [15:01:57] elukey: thanks for thorium :) [15:03:49] :) [15:03:55] ottomata: I found https://github.com/edenhill/librdkafka/issues/606 [15:04:48] I think that the IndexOutOfBound is only temporary [15:04:54] but wanted to double check [15:05:16] mmmm I am seeing it logged three times [15:05:21] Ohhh! [15:05:41] yeah i think we've seen that in logs before, right? [15:05:48] I don't recall :( [15:05:53] usually it goes away, but it looks really scary [15:07:07] yeah, def seen that before [15:07:17] ok good to know where it comes from [15:07:29] as long as it doesn't keep happening [15:08:05] mobrovac: do you mind to check changeprop's logs to see what got logged after the restart? [15:08:17] what I am seeing is basically the same as https://github.com/edenhill/librdkafka/issues/606 [15:09:39] i saw errcode -192 [15:09:42] message time out [15:10:46] -192: Local: Message timed out: Local timeout inside librdkafka, typically due to broker not being available or too slow. [15:11:13] yup [15:12:00] ok so I launched the preferred replica election [15:12:03] the ISRs are good [15:12:18] I'll wait a bit more before proceeding with the othes [15:12:31] java.lang.ArrayIndexOutOfBoundsException: 18 in the logs is scary :D [15:15:56] ok proceedign with kafka10012 [15:16:00] 1002 [15:19:24] ok this time nothing weird in the logs [15:19:28] let's see how changeprop is doing [15:19:45] elukey: hm, yeah, and i'm seeing the timeouterrors in failed events log too for eventbus :( [15:19:51] so it affects kafka-python too [15:19:57] i will replay these events when you are done [15:20:01] but not ideal :( [15:20:19] arrrghhh a lot of errors on kafka1002 [15:20:42] WARN [ReplicaFetcherThread-0-1001], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@2adf74fd. Possible cause: java.io.IOException: Connection to 1001 was disconnected before the response was read (kafka.server.ReplicaFetcherThread) [15:20:57] this one was in the middle [15:20:59] probably nothing [15:21:04] but a lot of index out of bound [15:21:05] again [15:21:18] ok, yeah, i guess index oob is expected now :/ [15:21:30] replica fetch sounds temporary [15:21:35] just warning [15:21:46] yeah [15:21:59] it seems something that the client sends and the broker doesn't like [15:22:16] but nothing really outstanding in the logs except "out of bound" [15:22:30] but the fact that it is at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:68) [15:22:37] points to https://github.com/edenhill/librdkafka/issues/606 [15:22:41] ja [15:22:52] and to v [15:22:53] https://issues.apache.org/jira/browse/KAFKA-3547 [15:22:57] well, yeah to that one [15:23:12] because kafka-python is having timeout troubles (probably due to that), and it is not using librdkafka [15:23:42] fixed in 0.10 :P [15:23:46] :) [15:26:10] sudo lsof | grep DEL is now part of my most useful commands ever [15:26:18] so powerful [15:27:31] ok preferred replica ran on 1002 [15:27:37] and mirror maker also restarted [15:29:21] proceeding with 1003 [15:31:26] same problem on 1003 :( [15:34:40] mobrovac: all done [15:35:42] mobrovac: transclusions are down again, weird [15:38:01] ok [15:38:04] i will restart cp now [15:40:54] elukey: what is DEL as FD? [15:42:20] ottomata: it indicates for example what old openjdk libs ares still used even if the package has been upgraded [15:42:37] they are kept around due to the processes that haven't been restarte [15:42:40] *restarted [15:42:49] since they hold a FD [15:43:03] so I can see which processes should be restarted [15:43:36] ah, so the inode for the file exists, but it has been deleted from the tree? [15:43:58] or, sorry vice versa [15:44:00] sorta [15:44:40] yeah the inode is still there because of the last FD kept opened [15:44:52] aye [15:46:36] ottomata: https://grafana.wikimedia.org/dashboard/db/eventbus?from=now-6h&to=now&panelId=10&fullscreen - change prop behavior is weird [15:46:56] and I am wondering what "unsupported request" mean in the context of https://issues.apache.org/jira/browse/KAFKA-3547 [15:47:29] elukey: are those timeouts still happening? [15:48:07] nono only after the restart [15:48:13] but I wanted to get what is the issue [15:48:16] reading https://github.com/apache/kafka/commit/bb643f83a95e9ebb3a9f8331fe9998c8722c07d4#diff-d0332a0ff31df50afce3809d90505b25R80 [15:49:57] hm, for librdkafka version we are using, i'm not sure why we'd have that. i think for 0.9.1 api.version.request=false and broker.version.fallback=0.9.0 by default [15:50:12] i would have guessed that unknown requests would be due to the client trying to discover the broker version [15:50:22] before in broker versions where there wasn't an API to ask [15:50:27] (like the one we run) [15:52:14] elukey: btw, on kafka-python connect, i see: [15:52:15] 2016-12-01 15:51:54,570 [7515] (MainThread) Broker is not v(0, 10) -- it did not recognize ApiVersionRequest_v0 [15:52:15] 2016-12-01 15:51:54,675 [7515] (MainThread) Broker version identifed as 0.9 [15:52:15] 2016-12-01 15:51:54,675 [7515] (MainThread) Set configuration api_version=(0, 9) to skip auto check_version requests on startup [15:53:28] ahhhh nice! [15:54:01] where are the logs? [15:55:03] elukey: the eventbus logs are /var/log/eventlogging/eventlogging-service-eventbus.log [15:55:23] but, i got that while replaying the timeout errors that were captured in /srv/log/eventlogging/eventlogging-service-eventbus.failed_events.log [15:55:30] https://wikitech.wikimedia.org/wiki/EventBus/Administration#Replaying_failed_events [15:55:58] ahahhaah I like that you put the link to anticipate my question [15:55:59] :D [15:56:06] haha [15:56:09] yeah, this situation sucks though [15:56:20] this is a manual step we shouldn't have to take. [15:56:30] 1. we should not fail the events because of this [15:56:34] they should produce to kafka properly [15:56:39] so Eventbus tries and then pushes on a specific topic if fails? [15:57:06] ah no sorry a file [15:57:10] makes sense [15:57:12] yeah, its configurable [15:57:16] we *could* make it push to a kafka topic [15:57:25] i had considered that, could even push to a topic in the analytics cluster [15:58:34] exactly I was about to say [15:58:48] so if EB goes completely down we have a backup [15:59:02] that would centralize them, instead of having them on 3 hosts [15:59:12] and, we *could* make a process that consumes that topic and auto retries [15:59:14] but ungh [15:59:31] i haven't been able to reproduce this in labs either [15:59:50] i wonder if manually specifying broker version would help with restarts [16:00:12] I'll be a little late to standup nuria [16:00:17] still in meeting with ash [16:01:43] ottomata: I was about to ask [16:01:49] there must be a setting [16:24:49] (03PS2) 10Elukey: Add fi.wikivoyage to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/324671 [16:25:57] (03CR) 10Joal: [C: 032 V: 032] "Yay, let's merge" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/324671 (owner: 10Elukey) [16:26:53] (03Abandoned) 10Joal: Configuration for fi.wikivoyage.org [analytics/refinery] - 10https://gerrit.wikimedia.org/r/323699 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [16:30:08] 06Analytics-Kanban, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570#2838572 (10chasemp) > * Labs replicas - @chasemp or @andrew need to run something like `maintain-views --databases fiwikivoyage -... [16:36:59] 10Analytics, 10Dumps-Generation: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2838609 (10Nuria) @Bawolff @ArielGlenn Valid requests and pageviews are different things. While these could be valid requests in any case should... [16:39:10] ergh, elukey thanks. i'm so upset that we have to deal with this. [16:39:21] If the cluster is up, the producer should not FAIL [16:39:23] GRRRRRRR [16:47:48] 10Analytics: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2838654 (10Nuria) [16:57:17] 10Analytics: Update Refinery's restart documentation. - https://phabricator.wikimedia.org/T150661#2838672 (10Nuria) [16:57:43] 10Analytics: Update Refinery's restart documentation. - https://phabricator.wikimedia.org/T150661#2792629 (10Nuria) 05Open>03Resolved [16:58:10] 10Analytics: Create SLA alarams for cassandra jobs and others - https://phabricator.wikimedia.org/T152109#2838676 (10Nuria) [17:01:13] 10Analytics: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2838692 (10Nuria) - Write queries to run metrics for all wikis - create massive tsv file split per wiki on 1st column - split file per wiki per metric - manual rsync of pe... [17:01:25] 06Analytics-Kanban: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2838693 (10Nuria) [17:01:43] 06Analytics-Kanban: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2836112 (10Nuria) Let's update points on completition. [17:04:06] 10Analytics: Valgrind tutorial for periodical mem usage reviews - https://phabricator.wikimedia.org/T147438#2692888 (10Nuria) We can: - provide a wiki page explaining how to use tool and spot issues - include valgrind binary in docker integration testing environment and run as part of tests [17:06:23] 06Analytics-Kanban: Valgrind tutorial for periodical mem usage reviews - https://phabricator.wikimedia.org/T147438#2838706 (10Nuria) [17:06:58] 10Analytics, 10Analytics-Cluster: Update wikitech Kafka documentation - https://phabricator.wikimedia.org/T150277#2838709 (10Ottomata) [17:08:58] elukey: when you have a min: https://gerrit.wikimedia.org/r/#/c/324745/ [17:09:13] 10Analytics, 10Analytics-Cluster: Update wikitech Kafka documentation - https://phabricator.wikimedia.org/T150277#2780405 (10Nuria) Docs need to be updated and move out of analytics space. We agree kafka is tier-1 and our future plan is to have just one kafka cluster, not one analytics-specific (Q4). We should... [17:12:20] 10Analytics: Tech Talk: Pivot - https://phabricator.wikimedia.org/T148776#2838727 (10Nuria) [17:13:32] 06Analytics-Kanban: Reportupdater might be creating TSVs it can't update later - https://phabricator.wikimedia.org/T150954#2802485 (10Nuria) ping @MaxSem can we repro this? otherwise we will close it. [17:15:30] 06Analytics-Kanban: Reportupdater might be creating TSVs it can't update later - https://phabricator.wikimedia.org/T150954#2838740 (10MaxSem) 05Open>03Invalid Nope, I couldn't reproduce the exact sequence of events. [17:16:11] 06Analytics-Kanban: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2836112 (10Nuria) a:03Milimetric [17:22:11] 10Analytics, 10Pageviews-API, 07Easy, 03Google-Code-In-2016: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2838759 (10Milimetric) a:05Nuria>03Milimetric [17:28:37] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream: Improve tests for KafkaSSE - https://phabricator.wikimedia.org/T150436#2838773 (10Nuria) [17:38:20] ottomata: in an effort to deprecate the vk ganglia plugin for T152093 (and also ganglia!) I'd like to add logster for statsv and eventlogging data [17:38:20] T152093: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093 [17:38:28] anything against it? [17:38:50] so we'll be able to have https://grafana.wikimedia.org/dashboard/db/varnishkafka also for statsv and el [17:38:53] (vk instances) [17:39:38] oh for the instances [17:39:43] was confused for a sec [17:39:44] yeah [17:39:47] +1 elukey that makes sense [17:39:51] i think they'd need a special prefix maybe? [17:39:54] not sure [17:39:58] but yah, +1 to idea [17:43:22] \o/ [17:49:02] elukey: for your q on that review: https://gerrit.wikimedia.org/r/#/c/324751/ [17:49:07] deployed that to analytics eventlogging already [17:49:12] ok, lunchin, thanks for review, bbl [17:52:24] elukey: one question about erik's cassandra doc [17:52:52] nuria: sure [17:53:25] (if I can, Joseph or Eric are surely a better poc than me :) [17:53:38] elukey: do we have also the same issue with cassandra partitions growing unbounded? seems like our growth is more controlled .. cc joal [17:53:50] ah yes exactly [17:54:54] for example, for per article we have ("_domain", project, article, granularity) as partition key, and then timestamp [17:55:25] so for each ("_domain", project, article, granularity) combination we add a "record" per day in the partition [17:55:38] that is very controlled and ok for our use case [17:56:14] Eric was saying that we should not see any problem for years, we'll probably change Cassandra before then :D [17:56:40] elukey: or if we increase granularity [17:57:01] elukey: ok, need to reda that like 100 more times to understand issues with GC that he is talking about [17:57:16] elukey: but teh bottom line is that we can more or less anticipate our growth [17:57:25] elukey: if we do not increase granularity [17:57:35] yes exact,y [17:57:38] *exactly [17:58:12] also we don't delete [17:58:28] so no issue with tombstones etc.. [18:23:07] nuria: correct, the model we've chosen for our keys (partition and clustering keys) makes this number fairly stable (given the number of distinct article grows slowly_ [18:26:42] * elukey going afk! [18:26:48] byeeee o/ [18:26:54] bye elukey :) [18:29:24] joal: if you're around and have a link to that wikistats analysis work you did, that'd be great [18:29:37] like that long list of all the data there (etherpad I think?) [18:29:41] milimetric: give me aminute, looking for it [18:29:52] np - don't waste too much time i can look too [18:33:12] milimetric: looking, but not finding so far :( [18:33:34] it's ok, I'll find it someday :) [18:34:28] milimetric: https://etherpad.wikimedia.org/p/wikistats-edits [18:34:31] Found it ! [18:34:56] milimetric: that was what you were after, right? [18:35:04] nice! [18:35:09] cool :) [18:40:04] milimetric, could you review https://gerrit.wikimedia.org/r/#/c/324635/ ? [18:46:29] MaxSem: will do! but in meetng [18:46:37] thx [19:22:17] (03CR) 10Yurik: [C: 032] reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [19:37:57] (03CR) 10MaxSem: [V: 032] reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [19:45:57] !log restarting eventlogging analytics processes to pick up api_version kafka arg [19:45:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:46:35] ottomata: milimetric: hi! how's it going? If anyone has a sec to take a peek at https://phabricator.wikimedia.org/T152122, that'd be fantastic!! :) [19:46:50] tl;dr: strange dip in FR banner impressions this morning for 1 hr [19:47:12] AndyRussG: waht do you want us to do? [19:47:23] Mmmm here's a specific q: any ideas on how to query pageviews and group by minute? [19:47:26] AndyRussG: jeff green pinged me [19:47:31] Ah cool! [19:47:56] we looked at some charts, nothing crazy in traffic patterns, but we did notice a dip in eventlogging_CentralNoticeBannerHistory topic [19:48:04] Yeah I saw that [19:48:11] and also, jeff noticed a dip in the banner history logs from kafkatee [19:48:26] which, would indicate to me, that the dip is caused by banners not showing [19:48:34] sure could be, yep [19:48:39] so nothing comparable in pageviews? [19:48:56] AndyRussG: pageviews by minute will not help much [19:49:04] why not? [19:49:06] AndyRussG: it will oscillate a lot [19:49:40] yeah, AndyRussG we looked at webrequest_text traffic, (not exactly pageviews), but saw no noticeable change [19:49:44] Maybe by 10 minute buckets? The timeframe we're looking at is 8:07 - 9:10 UTC [19:49:48] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=webrequest_text [19:50:18] also looking at https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes looks prett normal [19:50:37] so overall, traffic looks normal, just banner requests dropped i think [19:51:13] We should check Special:BannerLoader [19:51:19] the fact that both eventlogging banner stuff and the parsed webrequests from kafkatee (which feed in into your db) dropped at the same time, while other things remained the same, seems to indicate that banners didn't show as much [19:51:46] cause those are two different code paths that would emit that data [19:51:56] yep [19:52:10] parsed webrequest logs (from the original http request), and an explicit eventlogging http log request from the browser after the js there has loaded [19:56:06] ottomata: yes... It'd be good to try to pull the webrequest logs filtering a bit more. I can do that, but I'd really like to group on time periods smaller than 1 hr [19:56:15] Is there an easy way to do that in Hive? [19:56:22] 10Analytics-Dashiki, 06Analytics-Kanban: Improve initial load performance for dashiki dashboards - https://phabricator.wikimedia.org/T142395#2839418 (10Nuria) We are about to drop a json extension that would change how do we query for the configuration: Example: https://commons.wikimedia.beta.wmflabs.org/w/a... [19:56:42] BTW it's not just banners not being shown. It's CN either not running, or not selecting users for a campaign [19:56:50] I wonder if there are any load.php or RL issues [19:57:03] AndyRussG: sure [19:57:19] you can group by whatever you like, your query will just load an hours worth of data at a time [19:57:22] but for just an hour its fine [19:57:33] select the hour like you usually do with the partition (where year=, month= ...) [19:57:39] and then you can use the dt field to group [19:57:43] there shoudl be hive date functions to use i think... [19:58:24] AndyRussG: you can group pageviews in small intervals but what info do you expect to extract? [19:59:44] AndyRussG: Let's think about that for a sec [19:59:52] how do I put grouping by minute in the query? arrrrg I don't see timestamp? (what am I missing?) https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly [20:00:12] it's hourly data AndyRussG [20:00:12] nuria: the unexplained dip is over a period of about an hour [20:00:24] AndyRussG: webrequest table has data in disntict records [20:00:28] *distinct [20:00:53] AndyRussG: this is the table you want: [20:01:11] AndyRussG: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest [20:02:39] AndyRussG: but note that a significant dip would also be present on varnish http 200 codes for the hour which ottomata has already looked at [20:03:17] AndyRussG: you can't do it on pageviews [20:03:24] yeah, you gotta do webrequset [20:03:32] pageviews is an aggregate dataset [20:03:38] K [20:04:57] So we could count pageviews in general, and calls to Special:BannerLoader, and beacon/impression, all filtered for agent_type and country, grouped by some interval smaller than 1 hr [20:06:23] what is the granularity of dt in Webrequest? [20:06:47] mforns: I think I have pinpointed the thing: for pages with null page_id from archives, generating a hash based on pagetitle doesn't work -- still too much in same group [20:07:11] mforns: now using a random value per page for fake Ids, problematic point passed [20:07:15] milimetric: --^ [20:07:27] Will go to sleep for now, coming back to that tomorrow :) [20:07:28] AndyRussG: every record has a timestamp [20:07:30] bLater folks [20:07:34] joal [20:07:40] ok [20:07:44] mforns [20:07:45] let's talk tomorrow! [20:07:50] :] [20:07:55] We can talk now if you want :) [20:08:02] I'm still here ;) [20:08:09] no no, disappear :] [20:08:22] huhu :) [20:08:32] hehe, good night, cya tomorrow! [20:08:33] AndyRussG: webrequset is every webrequest to wikimedia sites [20:08:37] each one has a timestamp [20:08:38] dt [20:08:39] is the field [20:09:01] joal: nice, nite [20:09:04] there should be an is_pageview flag field too [20:09:08] that you could filter by [20:09:36] right, but he's asking what the granularity is, I think it's to the microsecond AndyRussG [20:09:39] calls to SpecialBanner loader might not be a pageview though. not sure [20:09:39] ottomata: ah great [20:09:57] Yeah it looks like at least 1/10 of a second [20:09:57] oh sorry AndyRussG misunderstood the q [20:10:01] np! :) [20:10:13] AndyRussG: I don't remember exactly it should be in the docs, let us know if you figure it out and we'll update [20:10:16] ja i think it millisecond, but they are ISO 8601 timestamps, not unix timestamps [20:10:19] so it shoudl be obvious [20:15:37] (03CR) 10Milimetric: [C: 032 V: 032] Don't record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:22:11] 10Analytics, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 10 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2839551 (10Yurik) [20:24:56] (03CR) 10jenkins-bot: [V: 04-1] Don't record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:25:20] !log restarting eventlogging analytics processes again to pick up api_version change for consumers too [20:25:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:27:11] !log bouncing kafka broker on kafka1018 to test config changes to eventlogging analytics kafka clients [20:27:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:30:22] ottomata, milimetric tells us we should bug you with the changes to the reportupdater :) [20:30:40] shhh not right now, he's busy! [20:30:42] :) [20:30:53] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430#2839625 (10Ottomata) [20:30:58] LOL :) [20:31:20] then again, i think he always is, which is a good thing (tm) :) [20:31:48] yeah, I'm jk, I think his ping bandwidth is like 10 times mine [20:32:14] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430#2534697 (10Ottomata) I added `api_version` as a param to avoid possible dropped messages due to https://issues.apache.org/jir... [20:32:22] haha, hi yurik! [20:32:25] ping bandwidth is very different than the request processing time :) [20:32:26] its good, i just tried a thing! [20:32:29] it didn't work! [20:32:35] time to context switch for some merges! [20:32:44] excellente!!! [20:32:46] i'm right on time [20:32:51] MaxSem, ^ :) [20:33:03] this ja? https://gerrit.wikimedia.org/r/#/c/322969/ [20:33:11] its ready? the other stuff has been merged milimetric? [20:33:16] and deployed? [20:33:21] yep [20:34:59] ottomata: actually, one sec [20:35:07] gerrit is refusing to merge [20:35:07] k, i just rebased that [20:35:20] this last patch: https://gerrit.wikimedia.org/r/#/c/324635/2 [20:35:21] hmmm [20:36:43] (03PS3) 10Milimetric: Do not record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:36:54] (03CR) 10Milimetric: [V: 032] Do not record empty metrics [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/324635 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:37:12] grr ... gerrit is bugging out on reportupdater, annoying [20:37:16] jenkins rather [20:37:46] ok, ottomata, you can merge, deployed now [20:38:48] ok [20:43:26] (03CR) 10Nuria: "We need to rebase this one to be able to merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/321676 (https://phabricator.wikimedia.org/T148980) (owner: 10Mforns) [20:44:35] ottomata, can you force run it by hand so we can see if everything is going through? [20:44:52] also, do we have the ability to do that? cc: OuKB [20:45:26] Role::Statistics::Cruncher/Reportupdater::Job[interactive]/Cron[reportupdater_discovery-stats-interactive]/ensure: created [20:45:37] hm, run it by hand ehh? [20:45:44] I can do that.... [20:46:52] :) [20:46:55] yurik: lgtm: /srv/reportupdater/output/metrics/interactive/events/all.tsv [20:46:56] on stat1003 [20:47:08] cc: OuKB [20:47:59] yeppi! [20:48:33] now if only we could get it from graphana/oid/graphen/... whatever it is called [20:53:26] 10Analytics, 10Dumps-Generation, 05Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2839698 (10Nuria) [20:53:36] yurik: https://graphite.wikimedia.org/render/?width=1613&height=845&target=daily.kartographer.events.open.mapframe.all&from=00%3A00_20161104&until=23%3A59_20161201 [20:53:39] you saw that right? [20:54:33] you can use that metric in grafana too, of course, make a dashboard with it or whatever, you know how to do that? [20:54:48] (I'm hoping he says yes because I'm pretty bad at it) [20:54:49] milimetric, awesome!!! daily.kartographer.events.open.mapframe.all [20:54:51] yep [20:54:57] i'm on it [20:55:00] or OuKB ? [20:55:01] cool, have fun [20:55:13] yurik: who's OuKB [20:55:18] milimetric, max [20:55:22] oh [20:55:30] he's hiding like that sometimes :) [20:55:43] but... now you revealed his secret... how can he hide? [20:57:28] 10Analytics, 10Dumps-Generation, 05Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2839724 (10Nuria) I added security again cause these requests should not return 200: https://meta.wikimedia.org/wiki/%3Cscript%3E?... [20:57:55] AndyRussG: you can use something like this to group by minute: [20:58:04] https://www.irccloud.com/pastebin/uIDr7z24/ [20:59:14] milimetric: I am going to take a look at moving the available.projects.json , OK? The caching headers issue should resolve when we deploy the extension [20:59:53] K [20:59:55] AndyRussG: but , as i said, i think any oddities would be present too in varnish 200 graphs [21:01:57] 10Analytics-Dashiki, 06Analytics-Kanban: refactor code using available.projects.json to use sitematrix - https://phabricator.wikimedia.org/T136120#2839753 (10Nuria) [21:31:13] nuria: cool, thanks much!!!! [21:31:52] (gotta non-work stuff for a little while, back soon!) [21:43:59] 10Analytics, 10EventBus, 13Patch-For-Review, 06Services (watching): Check eventbus Kafka cluster settings for reliability - https://phabricator.wikimedia.org/T144637#2839955 (10Ottomata) I reverted back to `min.insync.replicas=1`. During one of my kafka broker restarts today, I saw produce requests fail b... [21:45:26] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430#2839958 (10Ottomata) I just applied the `api_version` change for the eventbus kafka-python producer. Good news: this elimina... [21:45:38] 10Analytics-EventLogging, 06Analytics-Kanban, 10EventBus: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430#2839959 (10Ottomata) [22:06:18] 10Analytics-EventLogging, 06Analytics-Kanban, 10EventBus: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430#2840048 (10Ottomata) We could try increasing `sync_timeout`, maybe 2 is just too long. Or we could try https://gerrit.wikimedia.org/... [22:53:42] (03PS1) 10MaxSem: Record sum of all wikis for geo tag counts [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/324822 [23:50:54] nuria: milimetric: I can copy a bunch of partly-filtered data to my own table in Hive, and then query it on more specific criteria more quickly than if I were querying the full webrequest table every time, right?